=Paper=
{{Paper
|id=Vol-3878/50_main_long
|storemode=property
|title=The Self-Contained Italian Negation Test (SCIN)
|pdfUrl=https://ceur-ws.org/Vol-3878/50_main_long.pdf
|volume=Vol-3878
|authors=Viola Gullace,David Kletz,Thierry Poibeau,Alessandro Lenci,Pascal Amsili
|dblpUrl=https://dblp.org/rec/conf/clic-it/GullaceKPLA24
}}
==The Self-Contained Italian Negation Test (SCIN)==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/50_main_long.pdf</pdf>
<pre>
                                The Self-Contained Italian Negation Test (SCIN)
                                Viola Gullace1,2,3,† , David Kletz1,4,*,† , Thierry Poibeau1 , Alessandro Lenci2 and Pascal Amsili1
                                1
                                  Lattice, CNRS & ENS-PSl & U. Sorbonne-Nouvelle, 1 rue Maurice Arnoux F-92120 Montrouge, France
                                2
                                  CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica, Università di Pisa, Via Santa Maria, Pisa, 56126, Italy
                                3
                                  Scuola Normale Superiore, Piazza dei Cavalieri 7, Pisa, 56126, Italy
                                4
                                  LLF, CNRS & Université Paris Cité, 8 Rue Albert Einstein 75013 Paris, France


                                                Abstract
                                                Recent research has focused extensively on state-of-the-art pretrained language models, particularly those based on Trans-
                                                former architectures, and how well they account for negation and other linguistic phenomena in various tasks. This study
                                                aims to evaluate the understanding of negation in Italian bert- and robert-based models, contrasting the predominant English-
                                                focused prior research. We develop the SCIN Set, an Italian dataset designed to model the influence of polarity constraints on
                                                models in a masked predictions task. Applying the SCIN Set reveals that these models do not adjust their behaviour based
                                                on sentences polarity, even when the resulting sentence is contradictory. We conclude that the tested models lack a clear
                                                understanding of how negation alters sentence meaning.

                                                Keywords
                                                negation, Italian PLMs, testing, self-contained


                                1. Introduction                                                                                          ison, potentially highlighting language-specific effects
                                                                                                                                         or revealing new generalization. Therefore, we decide to
                                Compositionality is a fundamental feature of human lan-                                                  undertake a new experiment focusing on Italian negation.
                                guage, based on the principle that the meaning of a com-                                                    Thus, in this article, we aim to explore whether the
                                plex expression derives from its parts and their respective                                              behavior of PLMs accurately models the polarity of sen-
                                arrangements.                                                                                            tences. We will investigate how the addition of negation
                                   One notable compositional phenomenon is negation,                                                     to a sentence can alter its overall meaning (demonstrat-
                                formally defined as a semantic operator (or function) that                                               ing the models’ capability to handle shifts in meaning
                                reverses the truth-value of a sentence [1].                                                              due to structural changes).
                                   Given its importance, understanding how well pre-                                                        Given the limitations explained above, our work has
                                trained language models (PLMs) can grasp and apply this                                                  deliberately chosen to concentrate on Italian. This choice
                                principle is crucial.                                                                                    not only addresses the need to explore how these mod-
                                   These models achieve impressive performance across                                                    els perform with languages other than English but also
                                a wide array of language modeling tasks. Nonetheless,                                                    serves as a critical test for PLMs dedicated to Italian. We
                                they often reveal to rely on shallow heuristics or exhibit                                               suspect that these models may not be as advanced or
                                other issues in handling specific aspects of language.                                                   effective as their English counterparts, highlighting the
                                   A prominent bias in the body of research is that the                                                  need for further developments outside English.
                                vast majority of research on language models has pre-                                                       We adapt the test set developed for English by Kletz
                                dominantly concentrated on English. This focus raises                                                    et al. [2] to Italian, creating the Self-Contained Italian
                                concerns about the generalizability of findings to other                                                 Neg Set (SCIN Set). Using the dataset to evaluate bert-
                                languages which may be structurally different from En-                                                   and roberta-based models for Italian, we find that these
                                glish. Conducting similar experiments in other languages                                                 models are unable to adjust their prediction in response
                                could provide valuable context and material for compar-                                                  to constraints posed by negation, often generating con-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                     tradictory text.
                                Dec 04 — 06, 2024, Pisa, Italy                                                                              The article will be structured as follows. The rest of
                                *
                                  Corresponding author.                                                                                  Section 1 will introduce compositional phenomena and
                                †
                                  These authors contributed equally.                                                                     Italian negation in particular. Section 2 will briefly review
                                $ viola.gullace@sns.it (V. Gullace);                                                                     related work. Section 3 will detail the composition of the
                                david.kletz@sorbonne-nouvelle.fr (D. Kletz);
                                                                                                                                         SCIN Set. Section 4 will present the tests conducted on
                                thierry.poibeau@ens.psl.eu (T. Poibeau); alessandro.lenci@unipi.it
                                (A. Lenci); Pascal.Amsili@ens.fr (P. Amsili)                                                             several bert-based Italian models using the SCIN Set;
                                 https://people.unipi.it/alessandro_lenci/ (A. Lenci);                                                  in particular, we tested the following bert-base-cased
                                https://lattice.cnrs.fr/amsili/ (P. Amsili)                                                              models:
                                 0000-0003-3669-4051 (T. Poibeau); 0000-0001-5790-4308
                                (A. Lenci); 0000-0002-5901-5050 (P. Amsili)                                                                   • bert-base for Italian, both in its basic and
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                          Attribution 4.0 International (CC BY 4.0).                                                            its XXL versions (bert-base-italian-cased,


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
        bert-base-italian-xxl-cased)1 [3],                   2.2. The Self-Contained Neg Test
      • m-bert (multilingual bert)2 [4],
                                                             The Self-Contained Neg Test, developed by Kletz et al.
      • alb3rt03 [5], and
                                                             [2], is a set of pairs of sentences consisting of a context
      • UmBERTo4 [6].
                                                             (C) and a target (T) sentence, either positive (p) or nega-
Section 5 will discuss the results, followed by a final      tive (n). The target sentence contains a masked position,
section containing our general conclusions and ideas for     syntactically constrained to be filled by a verb (2).
further research.
                                                             (2)    Jessica is an architect who likes to dance. She isn’t
                                                                    happy to [MASK].
2. Related work
                                                          The instances are designed in such a way that a model
Although negation plays an essential role in human com- that predicts (in the masked position of T) the last verb
munication, it appears to present challenges for PLMs. of C will produce a semantically well-formed paragraph
In recent years, much research has focused on this topic. only if C and T have the same polarity. For instance, in
                                                          (2), the context is positive (Cp), the target is negative
2.1. Effect of negation on the model’s                    (Tn), and as a consequence a model predicting dance in
                                                          the masked position produces an ill-formed paragraph:
       prediction
                                                           (3) #Jessica is an architect who likes to dance. She isn’t
Kassner and Schütze [7] and Ettinger [8] analyzed to
                                                                  happy to dance.
what extent Transformer-based language models’ predic-
tions are sensitive to the presence or absence of negation In contrast, a CnTn version of (3) would accept the verb
in sentences involving factual knowledge, such as (1-a-b): dance in the same position:
(1)     a.   Birds can [MASK].                               (4)    Jessica is an architect who doesn’t like to dance.
        b.   Birds cannot [MASK].                                   She isn’t happy to dance.
They found that in such pairs the top-1 predictions are      To produce the sentences of the set, the pattern (5) is
unchanged most of the time: models do not seem to take       taken as a starting point, where NAME and PRON are
into account the polarity of the environment (presence       substituted with a proper noun and a compatible third
or absence of a negation in the surrounding sentence) to     person pronoun, PRO is substituted with a profession
adapt their predictions. They concluded that models do       name, and ACT is substituted with an action verb.
not deal correctly with negation.
   Gubelmann and Handschuh [9] criticized such studies,      (5)    NAME is a PROF who likes/doesn’t like to ACT.
noting in particular that the pragmatic component was               PRON is/isn’t happy to [MASK].
overlooked in Ettinger’s experiments. They noted that
                                                             A large number of triplets (NAME, PRO, ACT) are tested
a statement containing a negation stating a false fact
                                                             with each model, and the ones that are retained are the
(for example, Birds cannot fly) can be more plausible
                                                             ones such that the model’s top one prediction is the ACT
than a formally true but unusual statement (say, Birds
                                                             verb itself when C and T are both positive (CpTp). Here
cannot breastfeed). In fact, a vast number of words could
                                                             for instance, assuming that (6) are a model’s predictions,
potentially fit the negative statement, making it true,
                                                             the triplet (Jessica, architect, dance) would be retained
many of them with little association with the rest of the
                                                             while the triplet (Luke, janitor, swim) would not.
sentence. This makes it challenging for any single word
to become the top prediction in the negative case.           (6)    a.   Jessica is an architect who likes to dance. She
   Gubelmann and Handschuh [9] developed a more prag-                    is happy to dance.
matically informed test set, in which each instance is (in          b.   Luke is a janitor who likes to swim. He is
[2]’s terms) self-contained. This means that each item                   happy to ski.
in the set includes some context information, allowing
direct evaluation of the model’s completion. Building           Once triplets have been selected (the set of all triplets
on this work, [2] developed the Self-Contained Neg Test,     such that the ACT verb is repeated in CpTp instances),
which aimed to address some issues in the test set from      CpTn and CnTp instances can be formed, and the ex-
[9] and more accurately determine the model’s handling       pectation is that a model that “understands” negation
of negation without interference of world knowledge.         should not predict the ACT verb in those cases since it
1
                                                             would lead to contradictory instances. As a control, two
  https://huggingface.co/dbmdz/bert-base-italian-xxl-cased
2
  https://huggingface.co/bert-base-multilingual-cased
                                                             additional confirgurations are considered: CnTn where it
3
  https://github.com/marcopoli/AlBERTo-it                    is expected that the repetition of ACT is possible (though
4
  https://github.com/musixmatchresearch/umberto
not required), and CpTv in which an adverb (very) is in-      We choose instead to rely on the pair (9), involving a
serted in the positive target, which should not change semantic inference relation.
the preferred prediction of ACT since both sentences
are positive. The different configurations are illustrated (9)    ha l’abitudine di / molto spesso
below.                                                            is used to / very often

(7)          CpTp     Jessica is an architect who likes to dance.      The final form of the SCIN set is available in table 1. The
                      She is happy to [MASK].                          shape of the contexts is given in row 1, that of the targets
             CpTn     Jessica is an architect who likes to dance.      in row 2, and the test target Tv is added in row 3.
                      She isn’t happy to [MASK].                          Our assumption is that, if the model repeats the ACT
             CnTp     Jessica is an architect who doesn’t like to      token in the CpTp configuration, it is proof that the model
                      dance. She is happy to [MASK].                   has resolved the ha l’abitudine di / molto spesso inference.
             CnTn     Jessica is an architect who doesn’t like to      When confronted with the CpTn or CnTp configuration,
                      dance. She isn’t happy to [MASK].                the model should have the addition of the negation as the
             CpTv     Jessica is an architect who likes to dance.      only element that can explain the modification of its pre-
                      She is very happy to [MASK].
                                                                       dictions. Finally, the CpTv control allows us to check the
                                                                       extent to which the addition of a different, non-negative
3. SCIN construction                                                   adverb in the sequence modifies the model’s predictions;
                                                                       we can assume that any modification of greater magni-
In Italian, negation is most commonly expressed by the                 tude than that associated to CpTv are due to the influence
negative invariable proclitic non (not) [10].                          of negation.
   It is this expression of negation that we use for the                  The complete list of new patterns is available in Table 1.
Italian adaptation of the Self-Contained Neg Test that we
present in this section: the SCIN set.
                                                                       3.2. Pattern selection
3.1. Italian patterns                                        The triplets (name, profession, verb) used for testing are
                                                             selected by testing them on the CpTp configuration: only
Following the preparation of the Self-Contained Neg Test, triplets leading to a repetition of the ACT token are re-
we collect a list of Italian verbs, professions and names tained (see Table 2). This ensures that only patterns for
that will be used to create the triplets to be tested. The which the model is already biased towards repetition are
verbs are taken from the Dizionario Italiano Sabatini Co- tested, and the model has to understand the influence of
letti 2022 (online version); only the intransitive (3138 negation on sentence semantics to reverse this tendency.
verbs) are retained; among these, for each of the tested        All available triplets are tested, i.e. all configurations
models we further exclude the verbs that are not tok- between verbs monotokenized by the model, first names
enized as a single token. The selected names are the 100 and occupations selected in subsection 3.1. As tokeniza-
most popular in Italy in 20245 . Lastly, the professions are tion is model-dependent, the number of verbs tested is
taken from a site specializing in job searches in Italy6 ; not the same for each model: details are available in the
of those present on the site, only those consisting of a first row of table 3.
single word have been selected.                                 The results of this test are available in table 3.
   The patterns cannot simply be a direct translation of The results are highly model-dependent: while the
English patterns into Italian. In fact, for the test to be bert-base-italian-cased model predicts the ACT token in
adequate for evaluating models, we need the masked almost 25% of cases, this is the case in only 0.03% of cases
position to be syntactically constrained to be a verb. This for alb3rt0.
would not be the case if we used a direct translation of
the original sentences: for example, the sequence (8) can
be completed with the token “questo” ( = PRON is happy 4. Testing
to do this).
                                                                       4.1. Setup
(8)        NAME è un PROF che ama ACT. È felice di MASK.
           NAME is a PROF who loves to ACT. (PRON) is happy            Tests are performed as in Kletz et al. [11]. Contexts (C)
           to MASK.                                                    and targets (T) are combined to create two test patterns
                                                                       CpTn, CnTp; in addition to these two, the test includes
                                                                       two control patterns CnTn and CpTv where the repetition
                                                                       of the ACT verb is not contradictory.
5
    https://www.nostrofiglio.it/gravidanza/nomi-per-bambini/              All selected triplets are then used to saturate the pat-
    i-100-nomi-per-bambini-piu-amati-dai-genitori-di-nostrofiglio-it   terns, and the resulting patterns are provided as inputs to
6
    https://www.wecanjob.it/pagina9_elenco-professioni.html
        pol.                  C(ontext)                                              T(arget)
                 p            NAME è un(a) PROF che ha l’abitudine di ACT.           PRON [MASK] molto spesso.
 1
                              NAME is a PROF who is used to ACT-ing.                 PRON [MASK] often.
                 n            NAME è un(a) PROF che non ha l’abitudine di ACT.       PRON non [MASK] molto spesso.
 2
                              NAME is a PROF who is not used to ACT-ing.             PRON doesn’t [MASK] often.
                 v            -                                                      PRON [MASK] davvero molto spesso.
 3
                                                                                     PRON [MASK] really often.
Table 1
Complete list of contexts and targets used to build masked sequences in the SCIN dataset. Masks are always in the target.
Contexts and targets can be either positive or negative, and the target can also have an adverb added which is not a negation
cue. Patterns are made up of a context and a target, i.e. 5 possible patterns.


                  Instantiated NAME/PROF:                       definitively attribute this to its logical function, the nega-
                   Jessica / Ballerina (Dancer)
                                                                tion marker does exert a distinct influence.
                Tested verb: Fumare (To smoke)
                                                                   Nevertheless, it is important to emphasize the very
          Tested example: Jessica è una ballerina che
          ha l’abitudine di fumare. Lei [MASK] spesso.
                                                                clear limitations of these results. Firstly, the drops never
                                                                exceed 25%, meaning that 75% of the times the model
     Model                        Top 1       Retained?
                                                                predicts a semantically prohibited token. On the other
                                  pred.
     b-b-italian-c                fuma        ✓
                                                                hand, with the exception of m-bert, all the models have a
     b-b-italian-xxl-c            fuma        ✓                 highe drop for the CnTn control than for the CnTp con-
     m-bert                       balla       no                figuration, thus indicating that even though the models
     alb3rt0                      parla       no                have acquired a certain understanding of negation, this
                                                                remains superficial and does not, for example, clearly in-
Table 2                                                         clude an understanding of the positive value of a double
An example of selecting a triplet for testing.              A   negation.
NAME/PROF/VERB triplet is used to saturate the CpTp
                                                                   A broader examination of the results reveals that while
pattern of SCIN. The sequence contains a mask and is used as
input to a PLM. If the model prediction is the ACT token, the
                                                                the drops in CpTn and CnTp configurations increase
triplet is retained (indicated by the ✓ symbol). In the name    together, the CnTn controls also show a corresponding
of the models given as examples, “b-b” means bert-base, “it”    increase.
stands for italian and “c” for cased.                              Finally, the training corpus of the models seems to
                                                                have an influence on their performance. For exam-
                                                                ple, note that the alb3rt0 model is the model obtain-
the models. Predictions at masked positions are collected.      ing the results least in line with our expectations, while
   We use drop as a measure of the models’ performance:         bert-base-italian-xxl-cased and bert-base-italian-cased
for each pattern, given the rate 𝑡𝑟 of repetitions of the Act   had better drop values, with the former performing bet-
Token in the predictions, the drop is defined as 100 − 𝑡𝑟 .     ter than the latter. However, these three models have
The higher the drop for the CpTn and CnTp patterns and          identical numbers of layers, attention heads and hidden
the lower for the CnTn and CpTv controls, the better the        sizes, the difference between them only consisting in
model has understood the negation.                              their training data. The alb3rt0 model was trained ex-
                                                                clusively on tweets, which likely limits the diversity of
                                                                its data, particularly with respect negation. In contrast,
4.2. Results and Discussion                                     bert-base-italian-cased and bert-base-italian-xxl-cased
Results are shown in table 4.                                   models were trained on more varied corpora, with the
   In contrast with the observations made by [8] and            latter featuring a larger dataset.
[7], the models are not insensitive to the presence of             In the future, this should lead us to study the corre-
negation in a sentence: all the models show a drop in both      lation between the performance of the models and the
configurations CpTn and CnTp, showing an adaptation             fine-grained distribution of negative and affirmative con-
of their predictions to the presence of a negation cue.         texts in their training corpus.
This observation is confirmed by the fact that the drops
in the CpTv control are always lower than those observed
in CpTn or CnTp.                                           5. Comparison with English
   This shows that simply adding an adverb is not suffi- In this section we compare the results obtained with the
cient to change the model’s predictions. While we cannot SCIN Set with those observed by [2] in English.
  Model                                                          b-b-it-c     b-b-it-xxl     m-bert       alb3rt0   UmBERTo
  # tested contexts                                              5880000       5880000       780000     18800000      280000
  Repetitions                                                    1498456       1236899       141609         5464       93284
  %                                                                 25.48         21.03        18.16         0.03       33.31
  # retained contexts                                              20000         20000        19973         2088       20000
Table 3
Details of the verb sets created for each model. The first line shows the number of triples available per model, the second the
number of these triples which, in a CpTp configuration, led to a repetition (prediction by the ACT token model), and line 3
the percentage of triples this represents.) The last line shows how many of the triplets leading to a repeat were retained, the
maximum for one model being 20,000. In the column titles, “b-b” means bert-base, “it” stands for italian and “c” for cased

  Pattern                                                          b-b-it-c     b-b-it-xxl     m-bert     alb3rt0   UmBERTo
  CpTn                                                                16.5            22.1       23.0         9.7        9.9
  CnTp                                                                11.0            14.5       19.7         4.4       11.9
  CnTn                                                                11.6            14.6       18.6         9.3       20.6
  CpTv                                                                  1.3           14.3        1.0         0.2        1.7
Table 4
Drops of Italian pretrained language models on the SCIN Set, for each pattern type. In the two first rows, a high number is
expected — the higher number of each row in bold face; in the two last rows, a lower number is expected. In the column titles
“b-b” means bert-base, “it” stands for italian and “c” for cased


   The scale of the drops in the two articles is notably        that negation modifies their predictions, but that this
very different: the maximum drop observed in Italian is         does not happen consistently or in a way that is always
23% (CpTn m-bert), while in English it’s 82.8%. Similarly,      coherent with the semantic effect that we expect negation
the CpTv drops of Italian-speaking models hardly exceed         to have on sentences. These results suggest a strong need
15%, while those of English-speaking models are never           to adapt these models to make them more sensitive to
less than 25%.                                                  negation and its semantic consequences.
   On the other hand, model architecture and type of               Nevertheless, we also noted a fairly marked difference
training do not seem to have a major influence: Umberto         in performance from one model to another, correlated
has the same architecture as roberta-base, but while the        with the different corpora used to train them. We thus
latter is the best performing model in [2], the former’s        suggest that a lexical and statistical study of these corpora
drops are the lowest for all configurations of the SCIN         could shed further light on the behavior of the models.
Set. Conversely, the other Italian models are built with           Lastly, it would be interesting to compare these results
the same architecture as bert-base-cased, i.e. the worst        with the performance of generative models, in order to
performing model for English; however, even the worst           study the relative importance of the number of model
performing Italian model, namely alb3rt0, features higher       parameters in relation to their architecture.
drops than bert-base-cased. This confirms the observa-
tion from the previous section, that while architecture is
indeed a limiting criterion, training data probably plays        Acknowledgments
a significant role.
                                                                We would like to express our gratitude to Marie Candito
   In general, we note that none of these models, neither
                                                                for her valuable assistance and guidance throughout the
for Italian nor for English, shows definitive drops compat-
                                                                course of this study.
ible with a full understanding of the semantic constraints
                                                                   This work was funded in part by the French gov-
of negation.
                                                                ernment under management of Agence Nationale de
                                                                la Recherche as part of the “Investissements d’avenir"
6. Conclusion                                                   program, reference ANR-19-P3IA0001 (PRAIRIE 3IA In-
                                                                stitute). This research was also partially funded by
In this paper, we investigated the ability of several Italian   the Labex EFL (ANR-10-LABX-0083) and by PNRR–
PLMs to take negation into account in their predictions.        M4C2–Investimento 1.3, Partenariato Esteso PE00000013-
To do this, we adapted to Italian the Self-Contained Neg        –“FAIR—Future Artificial Intelligence Research”–Spoke 1
Test proposed by Kletz et al. [2], which is based on mini-      “Human-centered AI,” funded by the European Commis-
mal pairs of aligned sentences.                                 sion under the NextGeneration EU programme.
   Applying this test to six models enabled us to show
References                                                     A. Verb statistics by PLM
 [1] L. R. Horn, H. Wansing, Negation, in: E. N. Zalta,        Details of the number of monotokenised intransitive
     U. Nodelman (Eds.), The Stanford Encyclopedia of          verbs available for each PLM tested are available in ta-
     Philosophy, Winter 2022 ed., Metaphysics Research         ble 5.
     Lab, Stanford University, 2022.
 [2] D. Kletz, P. Amsili, M. Candito, The self-contained         model                          monotokenized verbs
     negation test set, in: Y. Belinkov, S. Hao, J. Jumelet,     bert-base-italian-cased                                294
     N. Kim, A. McCarthy, H. Mohebbi (Eds.), Proceed-            bert-base-italian-xxl-cased                            294
     ings of the 6th BlackboxNLP Workshop: Analyzing             m-bert                                                  39
                                                                 alb3rt0                                                940
     and Interpreting Neural Networks for NLP, Asso-
                                                                 UmBERTo                                                 14
     ciation for Computational Linguistics, Singapore,
     2023, pp. 212–221. URL: https://aclanthology.org/         Table 5
     2023.blackboxnlp-1.16. doi:10.18653/v1/2023.              Detail of the number of Italian intransitive verbs tokenised as
     blackboxnlp-1.16.                                         a single token for each of the Italian models tested.
 [3] S. Schweter, Italian bert and electra models,
     2020. URL: https://doi.org/10.5281/zenodo.4263142.
     doi:10.5281/zenodo.4263142.
 [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     BERT: pre-training of deep bidirectional trans-
     formers for language understanding,              CoRR
     abs/1810.04805 (2018). URL: http://arxiv.org/abs/
     1810.04805. arXiv:1810.04805.
 [5] M. Polignano, V. Basile, P. Basile, M. de Gemmis,
     G. Semeraro, AlBERTo: Modeling italian social
     media language with bert, IJCoL 25 (1984) 11–31.
     URL: https://doi.org/10.4000/ijcol.472.
 [6] L. Parisi, S. Francia, P. Magnani, Umberto: an
     italian language model trained with whole word
     masking, https://github.com/musixmatchresearch/
     umberto, 2020.
 [7] N. Kassner, H. Schütze, Negated and misprimed
     probes for pretrained language models: Birds
     can talk, but can+not fly (2020). URL: https://
     aclanthology.org/2020.acl-main.698.
 [8] A. Ettinger, What BERT is not: Lessons from a
     new suite of psycholinguistic diagnostics for lan-
     guage models, Transactions of the Association
     for Computational Linguistics 8 (2019) 34–48. URL:
     https://doi.org/10.1162/tacl_a_00298.
 [9] R. Gubelmann, S. Handschuh, Context matters: A
     pragmatic study of PLMs’ negation understanding,
     in: Proceedings of the 60th Annual Meeting of the
     Association for Computational Linguistics (Volume
     1: Long Papers), Association for Computational Lin-
     guistics, Dublin, Ireland, 2022, p. 4602–4621. URL:
     https://aclanthology.org/2022.acl-long.315.
[10] L. Renzi, L. G. Salvi, A. Cardinaletti, Grande
     grammatica italiana di consultazione, volume 2, Il
     Mulino, Bologna, 2001.
[11] D. Kletz, M. Candito, P. Amsili, Probing structural
     constraints of negation in pretrained language mod-
     els, in: The 24rd Nordic Conference on Computa-
     tional Linguistics, 2023. URL: https://openreview.
     net/forum?id=_7VPETQwnPX.

</pre>