=Paper= {{Paper |id=Vol-3878/92_main_long |storemode=property |title=Assessing the Asymmetric Behaviour of Italian Large Language Models across Different Syntactic Structures |pdfUrl=https://ceur-ws.org/Vol-3878/92_main_long.pdf |volume=Vol-3878 |authors=Elena Sofia Ruzzetti,Federico Ranaldi,Dario Onorati,Davide Venditti,Leonardo Ranaldi,Tommaso Caselli,Fabio Massimo Zanzotto |dblpUrl=https://dblp.org/rec/conf/clic-it/RuzzettiROVRCZ24 }} ==Assessing the Asymmetric Behaviour of Italian Large Language Models across Different Syntactic Structures== https://ceur-ws.org/Vol-3878/92_main_long.pdf
                                Assessing the Asymmetric Behaviour of Italian Large
                                Language Models across Different Syntactic Structures
                                Elena Sofia Ruzzetti1,* , Federico Ranaldi1 , Dario Onorati2 , Davide Venditti1 , Leonardo Ranaldi3 ,
                                Tommaso Caselli4 and Fabio Massimo Zanzotto1
                                1
                                  University of Rome Tor Vergata, Italy
                                2
                                  Sapienza University of Rome, Italy
                                3
                                  School of Informatics, University of Edinburgh, UK
                                4
                                  University of Groningen, The Netherlands


                                                 Abstract
                                                 While LLMs get more proficient at solving tasks and generating sentences, we aim to investigate the role that different
                                                 syntactic structures have on models’ performances on a battery of Natural Language Understanding tasks. We analyze the
                                                 performance of five LLMs on semantically equivalent sentences that are characterized by different syntactic structures. To
                                                 correctly solve the tasks, a model is implicitly required to correctly parse the sentence. We found out that LLMs struggle
                                                 when there are more complex syntactic structures, with an average drop of 16.13(±11.14) points in accuracy on Q&A task.
                                                 Additionally, we propose a method based on token attribution to spot which area of the LLMs encode syntactic knowledge,
                                                 by identifying model heads and layers responsible for the generation of a correct answer.

                                                 Keywords
                                                 LLMs, Natural Language Understanding, Syntax, Attributions, Localization



                                1. Introduction                                                                                          Hence, syntax plays a crucial role not only in the gen-
                                                                                                                                       eral construction of language but also in the native speak-
                                Large Language Models (LLMs) excel at understanding ers ability to comprehend sentences: in fact, a correct
                                and generating text that appears human-written. Thus, syntactic parsing of the sentences is necessary to under-
                                it is intriguing to determine whether the models’ text stand their meaning, and some syntactic structures are
                                comprehension aligns in some way with human cogni- preferred over others. To what extent this preference is
                                tive processes. A peculiarity of natural languages is that replicated by LLMs needs to be further explored.
                                the same meaning can be encoded by multiple syntac-                                                      If the model shows some knowledge about syntax,
                                tic constructions. In Italian, for instance, the unmarked there should be an area of the model responsible for that.
                                sentence follows a subject-verb-object (SVO) word order. We aim to detect the area of a model responsible for its
                                However, inversions of this ordering do not necessar- syntactic knowledge. Extensive work has been devoted
                                ily lead to ungrammatical sentences. A case in point is to understanding how Transformer-based architectures
                                represented by cleft sentence, i.e., sentences where the encode information and one main objective is to local-
                                unmarked SVO sequence is violated. This corresponds to ize which area of the model is responsible for a certain
                                specific communicative functions, namely emphasize a behavior [4, 5]. Despite its usage as an explanation mech-
                                component, and it is obtained by putting one element in anism being debated [6, 7], the attention mechanism is
                                a separate clause. In particular, Object Relative Clauses – an interesting starting point given its wide use in Trans-
                                where the element that is emphasized is the object of the former architecture. While the attention weights alone
                                sentence – are difficult to understand [1, 2]. For example cannot be used as an explanation of a model’s behav-
                                the sentence “Sono i professori che i presidi hanno elogiato ior [8, 9], an analysis that includes multiple components
                                alla riunione d’istituto” is more challenging for an Ital- of the attention module is shown to be beneficial to ob-
                                ian speaker than its semantically equivalent unmarked tain an interpretation of how a model processes an input
                                version “I presidi hanno elogiato i professori alla riunione sentence [10, 11].
                                d’istituto” where the SVO order is restored. Similarly, in                                               Probing is a common method used to detect the pres-
                                Nominal Copular constructions, the inversion of subject ence of linguistic properties of language in models [12].
                                and verb clause is documented to cause difficulties in Probing consists of training an auxiliary classifier on
                                understanding the meaning of the sentence [3].                                                         top of a model’s internal representation, which could be
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                                                                                                                       the output of a specific layer, to determine which lin-
                                Dec 04 — 06, 2024, Pisa, Italy                                                                         guistic property the model has learned and encoded. In
                                *
                                  Corresponding author.                                                                                particular, it has been proposed to probe Transformer-
                                $ elena.sofia.ruzzetti@uniroma2.it (E. S. Ruzzetti)                                                    based models to reconstruct syntactic representations
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                           Attribution 4.0 International (NC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
like dependency parse trees from their hidden states [13].     subdataset, a Q&A task to assess the LLMs capabilities in
Probing tasks concluded that syntactic features are en-        understanding sentences when their syntactic structure
coded in the middle layers [14]. Correlation analysis on       makes them more complex. The Q&A task requires the
the weights matrices of the monolingual BERT models            model to implicitly parse the role of the words in the
confirmed the localization of syntactic information in         sentence to get the correct answer: for this reason, we
the middle layers showing that the models trained on           identify some important words that the model should
syntactically similar languages were similar on middle         attend to while getting the correct answer.
layers [15]. While an altered word order seems to play
a crucial role in Transformer-based models’ ability to         Object Clefts constructions The first subset is de-
process language [16, 17], the correlation between LLMs        rived from Chesi and Canal [1]: this dataset contains
downstream performance and the encoding of syntax              128 sentences characterized by Object Clefts (OC) con-
needs to be further explored.                                  structions. The OC sentences in this dataset all share the
   In this paper, we initially examine how syntax influ-       same structure (see Table 1): the object and subject are
ences the LLMs’ capability of understanding language.          words indicating either a person or a group of people, the
To achieve this, we will analyze five open weights LLMs        predicate describes an action that the subject performs
– trained on the Italian Language either from scratch or       towards the object. The object is always introduced as
during a finetuning phase – and measure their perfor-          the first element of the sentence in a left-peripheral posi-
mance in question-answering (Q&A) tasks that require           tion. The displacement of the object in the left-peripheral
an implicit parsing of the roles of words in the sentence      position makes the OC harder to understand [2]. We will
to provide the correct answer. We use an available set of      compare those sentences with semantically equivalent
Q&A tasks designed for Italian speakers [1] and propose        ones that preserve the unmarked SVO word order.
similar template-based questions for two other datasets           To assess whether the difficulty humans have in un-
of Italian sentences characterized by different syntactic      derstanding Object Cleft sentences can also be registered
structures (Section 2.1). The results show that the models     in LLMs for the Italian language, we tested them on the
are affected by the different syntactic structures in solv-    same Q&A task that Chesi and Canal [1] proposed to
ing the proposed tasks (Section 3.1): LLMs struggle when       human subjects. Given one OC sentence, the model is
more complex syntactic structures are present, with an         prompted with a yes or no question asking whether one
average drop in accuracy of 16.13(±11.14) points.              of the participants (subject or object) was involved in
   We then propose a method – based on norm-based              the action described by the predicate (see Table 1 for an
attribution [10]– to localize where syntactic knowledge        example). The ability of a model to comprehend cleft
is encoded by identifying the models’ attention heads and      sentences can be measured as the accuracy it obtains on
layers that are responsible for the generation of a correct    this Q&A task. Moreover, we perform the same Q&A
answer (Section 2.2). Although some differences can be         task on SVO sentences that we directly derived from the
observed across the five LLMs, we notice that syntactic        OC clauses in Chesi and Canal [1]: in this case, we re-
information is more widely included in the middle and          stored the SVO order and produced sentences that are
top layers of the models.                                      semantically equivalent to the corresponding OC (see
                                                               Table 1).
                                                                  To correctly solve the task, the model must interpret
2. Methods and Data                                            the role of the nouns of the sentences playing the role of
                                                               subject and object to answer the comprehension question.
2.1. Question-answering Tasks to assess                        Hence, the model should implicitly parse the sentences
     LLMs Syntactic Abilities                                  and focus on those relevant words during the generation
In this Section, we introduce the dataset we collected         of the answer.
– largely extracted from the AcCompl-It task [18] in
EVALITA 2020 [19] – to assess LLMs syntactic abilities.        The Copular Constructions The second subdataset
The dataset is split in three subdatasets. Each of the sub-    –which includes 64 pairs of sentences– is derived from
dataset is composed of pairs of sentences that share the       a study involving Nominal Copular constructions (NC)
same meaning but a different word order. One of the sen-       from Greco et al. [20]. The NC sentences are composed
tences in each pair is characterized by a simpler structure,   of two main constituents: a Determiner Phrase (𝐷𝑃𝑠𝑢𝑏𝑗 )
easier to understand also for humans, while the second         and a Verbal Phrase (𝑉 𝑃 ). The verbal phrase contains a
is characterized by an alternative – but still correct –       copula and another Determiner Phrase that acts as the
syntactic structure. We aim to understand whether a dif-       nominal part of the predicate (𝐷𝑃𝑝𝑟𝑒𝑑 ). In this dataset,
ferent structure can influence the model performance in        the effect of the position of the subject with respect to the
processing those similar sentences. We define, for each        copular predicate is studied. Two semantically equivalent
                        Sono i professori     che i presidi      hanno elogiato alla riunione d’istituto
        OC
                        Copula + Obj          Subj               Predicate         PP
                        I presidi             hanno elogiato i professori          alla riunione d’istituto
        SVO
                        Subj                  Predicate          Obj               PP
        Question        Qualcuno ha elogiato i professori alla riunione? or I presidi hanno elogiato qualcuno alla riunione?
                        La causa              della rivolta      sono              le foto                del muro
        NC inverse
                        noun of 𝐷𝑃𝑠𝑢𝑏𝑗        𝑃 𝑃𝑝𝑟𝑒𝑑            Copula            Subject                𝑃 𝑃𝑠𝑢𝑏𝑗
                        Le foto               del muro           sono              la causa               della rivolta
        NC canonical
                        Subject               𝑃 𝑃𝑠𝑢𝑏𝑗            Copula            noun of 𝐷𝑃𝑠𝑢𝑏𝑗         𝑃 𝑃𝑝𝑟𝑒𝑑
        Question        Di che cosa le foto sono la causa?
                        Hanno mangiato        le bambine                           il dolce
        MVP post
                        Predicate             Subj                                 Obj
                        Le bambine            hanno mangiato                       il dolce
        MVP pre
                        Subj                  Predicate                            Obj
        Question        Chi ha mangiato qualcosa? or Cosa è stato mangiato?

Table 1
Examples from the dataset under investigation. For each subdataset, an example is composed of two semantically equivalent
sentences, that differ from the syntactic point of view, and a comprehension question on them.



sentences are presented for each example. In one case,            plicitly parse the sentence and accurately identify the
the sentence presents a canonical structure (NC canon-            nominal part of the verbal phrase and, in particular, the
ical), with the subject (𝐷𝑃𝑠𝑢𝑏𝑗 ) preceding the copular           Prepositional Phrase that it contains (𝑃 𝑃𝑝𝑟𝑒𝑑 ).
predicate. In the second case, an inverse structure (NC
inverse) –with the subject following the predicate and            Minimal Verbal Structure with Inversion of Subject
the 𝐷𝑃𝑝𝑟𝑒𝑑 introduced as the first element of the sen-            and Verb Finally, the last subdataset we investigate
tence – is presented (see Table 1). NC inverse sentences          is derived from Greco et al. [20] and contains sentences
are syntactically correct but are harder to understand for        characterized by minimal verbal structure (MVP). MVP
humans than the NC canonical [3].                                 sentences are composed of a subject, a predicate and –
   The structure of the sentences in this dataset is en-          for sentences with transitive predicates – of an object
riched by two Prepositional Phrases, one in each of the           (see Table 1). In this subdataset, the inversion of the
Determiner Phrases. The 𝐷𝑃𝑠𝑢𝑏𝑗 includes a subject ac-             subject and the verb is studied: the pairs of sentences
companied by an article and augmented with a Preposi-             under investigation have the same meaning (and lexicon)
tional Phrase (𝑃 𝑃𝑠𝑢𝑏𝑗 ) that features a complement refer-        but in one cases the subject of the sentence follows the
ring to the subject. Similarly, the 𝐷𝑃𝑝𝑟𝑒𝑑 consists not           predicate (MVP post) while in the others the subject pre-
only of a noun and an article but is instead further en-          cedes the predicate (MVP pre). The latter configuration,
riched with another Prepositional Phrase 𝑃 𝑃𝑝𝑟𝑒𝑑 . The            in Italian, is more common that the former: we aim to
𝑃 𝑃𝑝𝑟𝑒𝑑 gives more information about the relation be-             investigate whether this syntactic variation can alter the
tween the subject noun and the nominal part of the pred-          performance of an LLM.
icate.                                                               We define, for each pair of sentences, a question that
   We exploit the different role of the two Prepositional         asks the model to predict which element of the sentence
Phrases to design a Q&A task on NC canonical and NC               is involved in a certain action, either as the subject entity
inverse sentences and hence assess whether a more com-            or the object. In particular, for sentences that contain
plex syntactic structure can influence LLMs capabilities.         intransitive verbs, the model is always asked to predict
Given an NC sentence, the model is asked to correctly             the subject of the sentence, while in transitive cases (like
interpret the meaning of the sentence by examining its            the one in Table 1) the model is either asked to predict the
predicate: in particular, the model is asked to predict           subject or the object of the sentence. For this subdataset,
the additional information related to the nominal predi-          while the original data included both declarative and
cate – which is included in the 𝑃 𝑃𝑝𝑟𝑒𝑑 – by answering            interrogative sentences, we retained only the declarative
a “wh-” question (in Italian, "Di cosa", see the example          ones: we test the model with a total of 192 sentence pairs.
in Table 1). While both Prepositional Phrase answer to a             To answer those questions, the relevant words – both
wh-question, only the 𝑃 𝑃𝑝𝑟𝑒𝑑 is related to the predicate         for humans and LLMs – are the nouns that play the role of
of the sentence and hence the model should be able to             subject, or object if present, in sentences. In the next Sec-
predict the 𝑃 𝑃𝑝𝑟𝑒𝑑 and ignore the 𝑃 𝑃𝑠𝑢𝑏𝑗 .                      tion, we describe how it is possible to quantify whether
   To solve the proposed task and to properly understand          a model is able to identify the role of those words during
NC sentences, humans and LLMs are required to im-                 the generation of the answer.
                     Qwen2-7B      LLaMAntino-3-ANITA-8B           Llama-2-7b      modello-italia-9b      Meta-Llama-3-8B
     OC                75.78               76.56                      57.81             56.25                  64.84
     SVO               89.06               83.59                      66.41             71.09                   80.4
     NC inverse        62.50               78.12                      15.62             82.81                  81.25
     NC canonical      81.25               84.38                      62.50             93.75                  87.50
     MVP post          72.92               77.6                       70.31             50.52                  69.79
     MVP pre           97.92               98.44                      92.19             53.12                  95.83
Table 2
Models accuracy on the different subdataset on the proposed Q&A tasks. Models tend to produce less accurate answers when
exposed to more rare syntactic structures.



2.2. Localizing Syntactic Knowledge via                       consider the tokens to be attributed for the generated
     Attributions                                             answer produced by the model: for each correct answer
                                                              generated by the model, we count the number of times
Knowing which sentence structures are easier or more          the tokens with the larger attribution value are the rele-
difficult for a model to analyze is not enough. Consider-     vant ones. This measures the accuracy of the attention
ing the black-box nature of these models, it is essential     head ℎ in recognizing the relevant words to generate the
to understand which layers are responsible for encoding       answer.
syntax, thus making the models more interpretable.               The more often the attention head focuses on the rel-
   We hypothesize that there is an area of the model          evant words, the more syntactic knowledge the head
responsible for correctly analyzing the sentence from the     encodes. For each downstream task presented in Section
syntactic point of view in order to get the answer to the     2.1, we collect the accuracy of all heads at all levels. Then,
Q&A task. In fact, as discussed in the previous Section,      we identify a head as "responsible" for generating the tar-
to answer correctly, the model needs to implicitly parse      get word in a task if its score is higher than the average
the roles of the words in the sentence and identify the       score for that task. Specifically, we assume a Gaussian
relevant words for the response (subjects and objects in      distribution of scores for each task and identify a head
the questions on OC, SVO and MVP sentences and the            as responsible if the probability of observing a value at
correct prepositional phrases in NC sentences). Hence, a      least as extreme as the one observed is below a threshold
knowledge of syntax is required to identify the relevant      𝛼 < 0.05. We also consider responsible all heads that
words and, consequentially, generate the correct answer.      obtain an excellent accuracy score (greater than 0.9) in
   In generating the answer, we expect the model to “fo-      focusing on the relevant words. With this procedure, for
cus” on those relevant words. We can identify to which        each layer and task, we can localize the responsible heads
token the model focuses during generation, measuring          and determine where the model encodes syntax the most.
token-to-token attributions [8, 10]. In fact, token-to-
token attribution methods quantify the influence of a
token in the generation of the other. We argue that the       2.3. Models and Prompting Method
part of the model architecture most aware of syntax is        We focus on Instruction-tuned LLMs, all of comparable
the one that systematically focuses on relevant words         size, and trained – either from scratch or only fine-tuned
when the model is prompted to answer syntax-related           – on the Italian language. The models1 under investiga-
questions. Kobayashi et al. [10] demonstrate that a mech-     tion are Qwen2-7B [22], LLaMAntino-3-ANITA-8B [23],
anism – called the norm-based attribution – that it in-       Llama-2-7b [24], modello-italia [25], and Meta-Llama-
corporates also the dense output layer of the Attention       3-8B [26]. To solve the Q&A task, we prompted each
Mechanism is an accurate metric for token-to-token attri-     model with 4 different – but semantically equivalent –
bution. We will refer to the matrix 𝐴ℎ (𝑋) – computed         instructions. The complete list of the prompts is in Ap-
for the attention head ℎ for a sequence 𝑋 – as an at-         pendix A.2. All prompts ask the model to solve the task
tribution matrix. Some examples and a more detailed           in zero-shot by answering only with one or two words.
description of norm-based attribution can be found in         At most 128 tokens are generated, with greedy decoding.
the Appendix (A.1). The attribution matrix 𝐴ℎ (𝑋), for        Once the generation is completed, a manual check of the
each sequence of tokens 𝑋, describes where the model          responses is performed to obtain a simplified response to
focuses during the generation of each token. By exam-         be compared with the gold. For the subsequent analysis,
ining all the attention heads, some of them may focus         for each model and task, only the prompt for which the
more often on the subject, the object, or the prepositional   higher accuracy is obtained is considered.
phrase in the predicate while generating the answer for
                                                              1
the task. In particular, for each attention head ℎ, we            All models parameters are available on Huggingface’s transformers
                                                                  library [21]
Figure 1: Number of responsible heads per layer in the Q&A task defined over NC sentences. The higher the number of
responsible heads, the more the layer as a whole focus on syntax.



3. Experiments and Results                                   served in the previous subdataset emerge. In particu-
                                                             lar, the NC inverse sentences are harder than the cor-
We initially revise model’s accuracy on question compre-     responding NC canonical: the average model accuracy
hension task and assess models capabilities when differ-     is 81.88(±11.78) on NC canonical sentences, while the
ent syntactic structures are involved (Section 3.1). Then,   accuracy on NC inverse sentences is much lower, with
we aim to spot the layers responsible for the correct syn-   an average value of 64.06(±28.26). Also in this case,
tactic understanding of the sentences (Section 3.2).         the results demonstrate that models are affected by dif-
                                                             ferent syntactic patterns. The model that better capture
3.1. Models accuracy on                                      the right information to extract is modello-italia-9b on
     question-answering task                                 both NC inverse and NC-canonical sentences. Although
                                                             the performance of Llama-2-7b is rather low on inverse
Results on each of the subdatasets show that the syntactic   NC sentences (the model tends to generate very often
structure of a sentence influences the models’ understand-   the 𝑃 𝑃𝑠𝑢𝑏𝑗 ), the remaining LLaMA-base models achieve
ing of that sentence (see Table 2): across all tasks, LLMs   better performance on both tasks.
tend to obtain larger accuracy on sentences characterized       Finally, results on the MVP task further confirm the
by a unmarked syntactic structure.                           models’ behavior observed on the previous two tasks:
   On the first task, on OC and SVO sentences, the mod-      the inversion of the subject and verb positions causes
els tend to struggle, especially in the OC sentences. On     the models to perform worst on MVP post sentences
OC sentences, some models, in fact, do not perform far       (87.5(±19.38) average accuracy) with respect to MVP
from the random baseline of 50% accuracy ("yes" and          pre (68.23(±10.37) average accuracy). The average drop
"no" answers are balanced). When comparing OC and            in performance is larger than in previous subtasks: these
SVO sentences, on average, the model accuracy drops          results confirm that the inversion of the subject, even
by 11.88(±3.84) points when the sentence presents the        in basic sentences, can degrade models’ understanding.
object in the left-peripheral position. This result aligns   Modello-italia-9b – probably due to the limited length
with the difficulty that humans encounter in understand-     of the input sentences – tends to replicate the input sen-
ing those sentences. The model that achieves the highest     tences. The other models solve the tasks with excellent
accuracy in this task in OC sentences is LLaMAntino-         accuracy in the MVP pre sentences.
3-ANITA-8B, with an accuracy of 76.56. It is impor-
tant to note that the model performance increase of
                                                             3.2. Localizing Layers responsible for
11.72 points with respect to the corresponding Meta-
LLama-3-8b (that achieves an accuracy of 64.84): these            Syntax
results stress the effectiveness of the finetuning for the   After quantifying the impact of different syntactic struc-
Italian language. Across the LLaMa-based models the          tures on model performance, we can identify the atten-
LLaMAntino-3-ANITA-8B is still the best performing           tion heads and levels of the models that mostly encodes
model, followed by Meta-LLama-3-8b and with a larger         syntax. In Figure 1 the number of responsible head at
gap by LLama-2-7b. The Qwen2-7B model is the best            each layer of the models is reported for the Q&A task on
answering to the task on unmarked sentences.                 NC sentences, (the remaining tasks are in Appendix A.3).
   On the NC sentences, similar patterns to the one ob-         The general trend is that the most active in identifying
relevant words during response generation layers are          [3] P. Lorusso, M. P. Greco, C. Chesi, A. Moro, et al.,
comprised between layer 19 and 25. Moreover, for all              Asymmetries in extraction from nominal copular
models, the layers we identify as responsible often han-          sentences: a challenging case study for nlp tools, in:
dle multiple syntactic structures. The most noticeable            Proceedings of the Sixth Italian Conference on Com-
result is that for the same task, the same activation trend       putational Linguistics CLiC-it 2019 (Bari, November
emerges across all sentences.                                     13-15, 2019), CEUR, 2019.
   A large number of responsible attention heads appear       [4] A. Rogers, O. Kovaleva, A. Rumshisky,                 A
around layer 19 to 27 in LLaMAntino-3-ANITA-8B and                primer in BERTology: What we know about
Meta-Llama-3-8B. Layer 21, in particular, is the layer with       how BERT works, Transactions of the Associa-
the most responsible heads both in NC and MVP tasks.              tion for Computational Linguistics 8 (2020) 842–
This layer is predominant also in the OC task, concomi-           866. URL: https://aclanthology.org/2020.tacl-1.54.
tant with layers 19 and 22 (Figure 3a). For Llama-2, we           doi:10.1162/tacl_a_00349.
observe the same pattern as the most active layers are        [5] J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa-jussà, A
between 18 and 25. On the Qwen2-7B model and modello-             primer on the inner workings of transformer-based
italia-9b active layers are higher in the architecture: from      language models, 2024. URL: https://arxiv.org/abs/
layer 18 to 24 for Qwen2-7B (with layer 23 being the more         2405.00208. arXiv:2405.00208.
active in NC and MVP tasks) and from layer 21 to 31 on        [6] S. Jain, B. C. Wallace, Attention is not Explanation,
NC and MVP senteces for modello-italia-9b. This finding           in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceed-
suggests a different interpretation of LLMs layers from           ings of the 2019 Conference of the North American
that previously observed in BERT [27].                            Chapter of the Association for Computational Lin-
   While we could expect some correlation between the             guistics: Human Language Technologies, Volume
accuracy of the task and the capability of the model to           1 (Long and Short Papers), Association for Com-
identify the correct word in the sentence, the responsi-          putational Linguistics, Minneapolis, Minnesota,
ble heads appear to be shared across different syntactic          2019, pp. 3543–3556. URL: https://aclanthology.org/
structures. Those results suggest that some layers, more          N19-1357. doi:10.18653/v1/N19-1357.
than others, encode syntactic information about the role      [7] S. Wiegreffe, Y. Pinter, Attention is not not explana-
of a word in a sentence. Moreover, different models and           tion, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Pro-
architectures seem to share a rather similar organization.        ceedings of the 2019 Conference on Empirical Meth-
                                                                  ods in Natural Language Processing and the 9th In-
                                                                  ternational Joint Conference on Natural Language
4. Conclusions                                                    Processing (EMNLP-IJCNLP), Association for Com-
                                                                  putational Linguistics, Hong Kong, China, 2019,
In this paper, we have investigated how semantically
                                                                  pp. 11–20. URL: https://aclanthology.org/D19-1002.
equivalent sentences are processed by LLMs in Italian
                                                                  doi:10.18653/v1/D19-1002.
when their syntax differs. We tested LLMs trained on the
                                                              [8] K. Clark, U. Khandelwal, O. Levy, C. D. Manning,
Italian - or with Italian data in the pre-trainig material
                                                                  What does BERT look at? an analysis of BERT’s at-
- and measured how their capabilities in a battery of
                                                                  tention, in: T. Linzen, G. Chrupała, Y. Belinkov,
Q&A tasks that rely on parsing the correct role of words
                                                                  D. Hupkes (Eds.), Proceedings of the 2019 ACL
in a sentence to be solved. Our findings confirm that
                                                                  Workshop BlackboxNLP: Analyzing and Interpret-
cleft sentences and construction with an inversion of
                                                                  ing Neural Networks for NLP, Association for Com-
subject and verb are difficult to understand also for LLMs
                                                                  putational Linguistics, Florence, Italy, 2019, pp.
- similarly to what observed for humans. Furthermore,
                                                                  276–286. URL: https://aclanthology.org/W19-4828.
we have identified systematically using token-to-token
                                                                  doi:10.18653/v1/W19-4828.
attribution that syntactic information tends to be encoded
                                                              [9] S. Serrano, N. A. Smith, Is attention interpretable?,
in the middle and top layers of LLMs.
                                                                  in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Pro-
                                                                  ceedings of the 57th Annual Meeting of the Associa-
References                                                        tion for Computational Linguistics, Association for
                                                                  Computational Linguistics, Florence, Italy, 2019, pp.
  [1] C. Chesi, P. Canal, Person features and lexical re-         2931–2951. URL: https://aclanthology.org/P19-1282.
      strictions in italian clefts, Frontiers in Psychology       doi:10.18653/v1/P19-1282.
      10 (2019) 2105.                                        [10] G. Kobayashi, T. Kuribayashi, S. Yokoi, K. Inui, At-
  [2] J. King, M. A. Just, Individual differences in syntac-      tention is not only a weight: Analyzing transform-
      tic processing: The role of working memory, Jour-           ers with vector norms, in: B. Webber, T. Cohn, Y. He,
      nal of memory and language 30 (1991) 580–602.               Y. Liu (Eds.), Proceedings of the 2020 Conference
                                                                  on Empirical Methods in Natural Language Pro-
     cessing (EMNLP), Association for Computational                tational Linguistics, Online and Punta Cana, Do-
     Linguistics, Online, 2020, pp. 7057–7075. URL: https:         minican Republic, 2021, pp. 2888–2913. URL: https:
     //aclanthology.org/2020.emnlp-main.574. doi:10.               //aclanthology.org/2021.emnlp-main.230. doi:10.
     18653/v1/2020.emnlp-main.574.                                 18653/v1/2021.emnlp-main.230.
[11] G. Kobayashi, T. Kuribayashi, S. Yokoi, K. Inui,         [17] M. Abdou, V. Ravishankar, A. Kulmizev, A. Søgaard,
     Incorporating Residual and Normalization Lay-                 Word order does matter and shuffled language mod-
     ers into Analysis of Masked Language Mod-                     els know it, in: S. Muresan, P. Nakov, A. Villavicen-
     els,     in: M.-F. Moens, X. Huang, L. Specia,                cio (Eds.), Proceedings of the 60th Annual Meeting
     S. W.-t. Yih (Eds.), Proceedings of the 2021 Con-             of the Association for Computational Linguistics
     ference on Empirical Methods in Natural Lan-                  (Volume 1: Long Papers), Association for Computa-
     guage Processing, Association for Computational               tional Linguistics, Dublin, Ireland, 2022, pp. 6907–
     Linguistics, Online and Punta Cana, Domini-                   6919. URL: https://aclanthology.org/2022.acl-long.
     can Republic, 2021, pp. 4547–4568. URL: https:                476. doi:10.18653/v1/2022.acl-long.476.
     //aclanthology.org/2021.emnlp-main.373. doi:10.          [18] D. Brunato, C. Chesi, F. Dell’Orletta, S. Montemagni,
     18653/v1/2021.emnlp-main.373.                                 G. Venturi, R. Zamparelli, et al., Accompl-it@
[12] Y. Belinkov, J. Glass, Analysis methods in neural lan-        evalita2020: Overview of the acceptability & com-
     guage processing: A survey, Transactions of the As-           plexity evaluation task for italian, in: CEUR WORK-
     sociation for Computational Linguistics 7 (2019) 49–          SHOP PROCEEDINGS, CEUR Workshop Proceed-
     72. URL: https://aclanthology.org/Q19-1004. doi:10.           ings (CEUR-WS. org), 2020.
     1162/tacl_a_00254.                                       [19] EVALITA 2020 — evalita.it, https://www.evalita.it/
[13] J. Hewitt, C. D. Manning, A structural probe for find-        campaigns/evalita-2020/, 2020.
     ing syntax in word representations, in: J. Burstein,     [20] M. Greco, P. Lorusso, C. Chesi, A. Moro, Asymme-
     C. Doran, T. Solorio (Eds.), Proceedings of the 2019          tries in nominal copular sentences: Psycholinguis-
     Conference of the North American Chapter of the               tic evidence in favor of the raising analysis, Lingua
     Association for Computational Linguistics: Human              245 (2020) 102926.
     Language Technologies, Volume 1 (Long and Short          [21] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
     Papers), Association for Computational Linguistics,           langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
     Minneapolis, Minnesota, 2019, pp. 4129–4138. URL:             towicz, J. Brew, HuggingFace’s Transformers: State-
     https://aclanthology.org/N19-1419. doi:10.18653/              of-the-art Natural Language Processing, ArXiv
     v1/N19-1419.                                                  abs/1910.0 (2019).
[14] G. Jawahar, B. Sagot, D. Seddah, What does BERT          [22] Qwen2 technical report (2024).
     learn about the structure of language?, in: A. Ko-       [23] M. Polignano, P. Basile, G. Semeraro, Advanced
     rhonen, D. Traum, L. Màrquez (Eds.), Proceedings              natural-based interaction for the italian language:
     of the 57th Annual Meeting of the Association for             Llamantino-3-anita, 2024. arXiv:2405.07101.
     Computational Linguistics, Association for Com-          [24] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
     putational Linguistics, Florence, Italy, 2019, pp.            hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
     3651–3657. URL: https://aclanthology.org/P19-1356.            gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer,
     doi:10.18653/v1/P19-1356.                                     M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,
[15] E. S. Ruzzetti, F. Ranaldi, F. Logozzo, M. Mastromat-         W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,
     tei, L. Ranaldi, F. M. Zanzotto, Exploring linguistic         A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar-
     properties of monolingual BERTs with typological              das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-
     classification among languages, in: H. Bouamor,               renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,
     J. Pino, K. Bali (Eds.), Findings of the Association          D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,
     for Computational Linguistics: EMNLP 2023, Asso-              P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen-
     ciation for Computational Linguistics, Singapore,             stein, R. Rungta, K. Saladi, A. Schelten, R. Silva,
     2023, pp. 14447–14461. URL: https://aclanthology.             E. M. Smith, R. Subramanian, X. E. Tan, B. Tang,
     org/2023.findings-emnlp.963. doi:10.18653/v1/                 R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan,
     2023.findings-emnlp.963.                                      I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang,
[16] K. Sinha, R. Jia, D. Hupkes, J. Pineau, A. Williams,          A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom,
     D. Kiela, Masked language modeling and the dis-               Llama 2: Open foundation and fine-tuned chat
     tributional hypothesis: Order word matters pre-               models, 2023. URL: https://arxiv.org/abs/2307.09288.
     training for little, in: M.-F. Moens, X. Huang,               arXiv:2307.09288.
     L. Specia, S. W.-t. Yih (Eds.), Proceedings of the       [25] iGenius | Large Language Model — igenius.ai, https:
     2021 Conference on Empirical Methods in Natu-                 //www.igenius.ai/it/language-models, 2024.
     ral Language Processing, Association for Compu-          [26] AI@Meta, Llama 3 model card (2024). URL:
     https://github.com/meta-llama/llama3/blob/main/            • Data la frase "{Item}", rispondi alla seguente
     MODEL_CARD.md.                                               domanda:"{Question}" Rispondi SOLAMENTE
[27] I. Tenney, D. Das, E. Pavlick, BERT rediscov-                con SI o NO.
     ers the classical NLP pipeline, in: A. Korho-              • Considera la frase: "{Item}". Rispondi con ’SI’ o
     nen, D. Traum, L. Màrquez (Eds.), Proceedings of             ’NO’ alla seguente domanda:"{Question}"
     the 57th Annual Meeting of the Association for             • Considera la frase: "{Item}". {Question}
     Computational Linguistics, Association for Com-              Rispondi brevemente, SOLAMENTE con con ’SI’
     putational Linguistics, Florence, Italy, 2019, pp.           o ’NO’.
     4593–4601. URL: https://aclanthology.org/P19-1452.         • Considera la frase: "{Item}". Rispondi con ’SI’ o
     doi:10.18653/v1/P19-1452.                                    ’NO’. {Question}

                                                             NC sentences:
A. Appendix                                                     • Data la frase "{Item}", rispondi alla seguente
                                                                  domanda:"{Question}" Rispondi in due parole.
A.1. Token-to-token norm-based                                  • Considera la frase: "{Item}". Rispondi solo con
     attribution                                                  le due parole che rispondono alla seguente do-
                                                                  manda:"{Question}"
As described in Section 2.2, we adopt norm-based
                                                                • Considera la frase: "{Item}". {Question}
token-to-token attribution to spot what is the most
                                                                  Rispondi SOLO con le due parole che rispondono
relevant word during the generation of the answer in
                                                                  alla seguente domanda.
LLMs on our task. The norm based approach is proposed
in Kobayashi et al. [10]. Given the query weight matrix         • Considera la frase: "{Item}". Rispondi solo con
𝑊𝑄 ℎ
     , key weight matrix 𝑊𝐾    ℎ
                                 , value weight matrix            due parole. {Question}
𝑊𝑉 and the attention output weight matrix 𝑊𝑂ℎ of an          MVP sentences:
attention head ℎ, the norm-based attribution for each
token of a sequence 𝑋 is calculated as the product of           • Data la frase "{Item}", rispondi alla seguente
the attention weights and the norm of the projected               domanda:"{Question}" Rispondi solo con un
token representation 𝑋𝑊𝑉ℎ 𝑊𝑂ℎ (see the original                   nome.
work Kobayashi et al.(︂ [10] for a detailed
                                        )︂ discussion).         • Considera la frase: "{Item}". Rispondi solo
                          ℎ
                        𝑋𝑊𝑄 ·(𝑋𝑊𝐾ℎ ⊤
                                  )                               con il nome che risponde alla seguente do-
Aℎ (𝑋) := 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥          √             · ‖𝑋𝑊𝑉ℎ 𝑊𝑂ℎ ‖
                              𝑑𝑣                                  manda:"{Question}"
For our analysis, we consider all rows relative to a            • Considera la frase: "{Item}". {Question}
token in the answer generated by the model. To assess             Rispondi SOLO con il nome che risponde alla
whether a model understands the syntactic relationship            domanda.
between words, it must focus on relevant words during           • Considera la frase: "{Item}". Rispondi solo con
the generation. In particular, the token with the highest         un nome. {Question}
attribution should be one belonging to the relevant
word. For example, in Figure 2, the attribution of
Meta-Llama-3-8B on one NC sentence is presented.
                                                            A.3. Responsible Attention Heads per
During the generation of the answer (the tokens of the           Layer in each subtask
answer index rows in the figure), the most attributed    In Figure 3, the responsible attention heads per layer is
tokens belong to the relevant words in the input (the    depicted. As described in Section 3.2, some layers tend to
tokens of the input index columns).                      demonstrate a high number of attention heads responsi-
                                                         ble for the generation. In particular, layers around layer
A.2. Prompts to Instruction-Tuned LLMs 20 seem to focus more on relevant words for the correct
       for the Italian Laguage                           generation of the answer than the other. Since the cor-
                                                         rect generation implies the capability of understanding
Each model has been prompted with four different the role of different words by a model, we claim that
prompts for each Q&A task (as described in Section 2.1). those level encodes some kind of syntactic information.
Here is a complete list of the prompts template used in It is worth noticing that similar layers are responsible for
our experiments: in the template the {Item} is the sen- the different sub tasks, in particular for the LLaMa-base
tence to be analyzed and {Question} is replaced with models and for Qwen-2-7b model.
the corresponding comprehension question.
  OC and SVO senteces:
Figure 2: Norm-based attribution matrix of Meta-Llama-3-8B on one example of the task presented in Section 2.1 on NC
sentences.
                                             (a) OC and SVO sentences




                                                 (b) MVP sentences
Figure 3: Number of responsible heads per layer in the Q&A task defined over two task: OC and SVO sentences (3a) and
MVP sentences (3b).