<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing the Asymmetric Behaviour of Italian Large Language Models across Diferent Syntactic Structures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elena Sofia Ruzzetti</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Ranaldi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dario Onorati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Venditti</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Ranaldi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Caselli</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Massimo Zanzotto</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sapienza University of Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Informatics, University of Edinburgh</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Groningen</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Rome Tor Vergata</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While LLMs get more proficient at solving tasks and generating sentences, we aim to investigate the role that diferent syntactic structures have on models' performances on a battery of Natural Language Understanding tasks. We analyze the performance of five LLMs on semantically equivalent sentences that are characterized by diferent syntactic structures. To correctly solve the tasks, a model is implicitly required to correctly parse the sentence. We found out that LLMs struggle when there are more complex syntactic structures, with an average drop of 16.13(± 11.14) points in accuracy on Q&amp;A task. Additionally, we propose a method based on token attribution to spot which area of the LLMs encode syntactic knowledge, by identifying model heads and layers responsible for the generation of a correct answer.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLMs</kwd>
        <kwd>Natural Language Understanding</kwd>
        <kwd>Syntax</kwd>
        <kwd>Attributions</kwd>
        <kwd>Localization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Large Language Models (LLMs) excel at understanding
and generating text that appears human-written. Thus,
it is intriguing to determine whether the models’ text
comprehension aligns in some way with human
cognitive processes. A peculiarity of natural languages is that
the same meaning can be encoded by multiple
syntactic constructions. In Italian, for instance, the unmarked
sentence follows a subject-verb-object (SVO) word order.</p>
      <p>
        However, inversions of this ordering do not
necessarily lead to ungrammatical sentences. A case in point is
represented by cleft sentence, i.e., sentences where the
unmarked SVO sequence is violated. This corresponds to
specific communicative functions, namely emphasize a
component, and it is obtained by putting one element in
a separate clause. In particular, Object Relative Clauses –
where the element that is emphasized is the object of the
sentence – are dificult to understand [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. For example
the sentence “Sono i professori che i presidi hanno elogiato
alla riunione d’istituto” is more challenging for an
Italian speaker than its semantically equivalent unmarked
version “I presidi hanno elogiato i professori alla riunione
d’istituto” where the SVO order is restored. Similarly, in
Nominal Copular constructions, the inversion of subject
and verb clause is documented to cause dificulties in
understanding the meaning of the sentence [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Hence, syntax plays a crucial role not only in the
general construction of language but also in the native
speakers ability to comprehend sentences: in fact, a correct
syntactic parsing of the sentences is necessary to
understand their meaning, and some syntactic structures are
preferred over others. To what extent this preference is
replicated by LLMs needs to be further explored.</p>
      <p>If the model shows some knowledge about syntax,
there should be an area of the model responsible for that.</p>
      <p>
        We aim to detect the area of a model responsible for its
syntactic knowledge. Extensive work has been devoted
to understanding how Transformer-based architectures
encode information and one main objective is to
localize which area of the model is responsible for a certain
behavior [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Despite its usage as an explanation
mechanism being debated [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], the attention mechanism is
an interesting starting point given its wide use in
Transformer architecture. While the attention weights alone
cannot be used as an explanation of a model’s
behavior [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], an analysis that includes multiple components
of the attention module is shown to be beneficial to
obtain an interpretation of how a model processes an input
sentence [
        <xref ref-type="bibr" rid="ref10">10, 11</xref>
        ].
      </p>
      <p>Probing is a common method used to detect the
presence of linguistic properties of language in models [12].</p>
      <p>
        Probing consists of training an auxiliary classifier on
top of a model’s internal representation, which could be
the output of a specific layer, to determine which
linguistic property the model has learned and encoded. In
particular, it has been proposed to probe
Transformerbased models to reconstruct syntactic representations
like dependency parse trees from their hidden states [13]. subdataset, a Q&amp;A task to assess the LLMs capabilities in
Probing tasks concluded that syntactic features are en- understanding sentences when their syntactic structure
coded in the middle layers [14]. Correlation analysis on makes them more complex. The Q&amp;A task requires the
the weights matrices of the monolingual BERT models model to implicitly parse the role of the words in the
confirmed the localization of syntactic information in sentence to get the correct answer: for this reason, we
the middle layers showing that the models trained on identify some important words that the model should
syntactically similar languages were similar on middle attend to while getting the correct answer.
layers [15]. While an altered word order seems to play
a crucial role in Transformer-based models’ ability to Object Clefts constructions The first subset is
deprocess language [16, 17], the correlation between LLMs rived from Chesi and Canal [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: this dataset contains
downstream performance and the encoding of syntax 128 sentences characterized by Object Clefts (OC)
conneeds to be further explored. structions. The OC sentences in this dataset all share the
      </p>
      <p>
        In this paper, we initially examine how syntax influ- same structure (see Table 1): the object and subject are
ences the LLMs’ capability of understanding language. words indicating either a person or a group of people, the
To achieve this, we will analyze five open weights LLMs predicate describes an action that the subject performs
– trained on the Italian Language either from scratch or towards the object. The object is always introduced as
during a finetuning phase – and measure their perfor- the first element of the sentence in a left-peripheral
posimance in question-answering (Q&amp;A) tasks that require tion. The displacement of the object in the left-peripheral
an implicit parsing of the roles of words in the sentence position makes the OC harder to understand [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We will
to provide the correct answer. We use an available set of compare those sentences with semantically equivalent
Q&amp;A tasks designed for Italian speakers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and propose ones that preserve the unmarked SVO word order.
similar template-based questions for two other datasets To assess whether the dificulty humans have in
unof Italian sentences characterized by diferent syntactic derstanding Object Cleft sentences can also be registered
structures (Section 2.1). The results show that the models in LLMs for the Italian language, we tested them on the
are afected by the diferent syntactic structures in solv- same Q&amp;A task that Chesi and Canal [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed to
ing the proposed tasks (Section 3.1): LLMs struggle when human subjects. Given one OC sentence, the model is
more complex syntactic structures are present, with an prompted with a yes or no question asking whether one
average drop in accuracy of 16.13(± 11.14) points. of the participants (subject or object) was involved in
      </p>
      <p>
        We then propose a method – based on norm-based the action described by the predicate (see Table 1 for an
attribution [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]– to localize where syntactic knowledge example). The ability of a model to comprehend cleft
is encoded by identifying the models’ attention heads and sentences can be measured as the accuracy it obtains on
layers that are responsible for the generation of a correct this Q&amp;A task. Moreover, we perform the same Q&amp;A
answer (Section 2.2). Although some diferences can be task on SVO sentences that we directly derived from the
observed across the five LLMs, we notice that syntactic OC clauses in Chesi and Canal [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: in this case, we
reinformation is more widely included in the middle and stored the SVO order and produced sentences that are
top layers of the models. semantically equivalent to the corresponding OC (see
Table 1).
2. Methods and Data To correctly solve the task, the model must interpret
the role of the nouns of the sentences playing the role of
2.1. Question-answering Tasks to assess subject and object to answer the comprehension question.
      </p>
      <p>Hence, the model should implicitly parse the sentences
LLMs Syntactic Abilities and focus on those relevant words during the generation
of the answer.</p>
      <sec id="sec-1-1">
        <title>In this Section, we introduce the dataset we collected</title>
        <p>
          – largely extracted from the AcCompl-It task [18] in
EVALITA 2020 [19] – to assess LLMs syntactic abilities. The Copular Constructions The second subdataset
The dataset is split in three subdatasets. Each of the sub- –which includes 64 pairs of sentences– is derived from
dataset is composed of pairs of sentences that share the a study involving Nominal Copular constructions (NC)
same meaning but a diferent word order. One of the sen- from Greco et al. [20]. The NC sentences are composed
tences in each pair is characterized by a simpler structure, of two main constituents: a Determiner Phrase ( )
easier to understand also for humans, while the second and a Verbal Phrase (  ). The verbal phrase contains a
is characterized by an alternative – but still correct – copula and another Determiner Phrase that acts as the
syntactic structure. We aim to understand whether a dif- nominal part of the predicate (). In this dataset,
ferent structure can influence the model performance in the efect of the position of the subject with respect to the
processing those similar sentences. We define, for each copular predicate is studied. Two semantically equivalent
SVO
Question
NC inverse
NC canonical
Question
MVP post
MVP pre
Question
sentences are presented for each example. In one case, plicitly parse the sentence and accurately identify the
the sentence presents a canonical structure (NC canon- nominal part of the verbal phrase and, in particular, the
ical), with the subject ( ) preceding the copular Prepositional Phrase that it contains ( ).
predicate. In the second case, an inverse structure (NC
inverse) –with the subject following the predicate and Minimal Verbal Structure with Inversion of Subject
the  introduced as the first element of the sen- and Verb Finally, the last subdataset we investigate
tence – is presented (see Table 1). NC inverse sentences is derived from Greco et al. [20] and contains sentences
are syntactically correct but are harder to understand for characterized by minimal verbal structure (MVP). MVP
humans than the NC canonical [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. sentences are composed of a subject, a predicate and –
        </p>
        <p>The structure of the sentences in this dataset is en- for sentences with transitive predicates – of an object
riched by two Prepositional Phrases, one in each of the (see Table 1). In this subdataset, the inversion of the
Determiner Phrases. The  includes a subject ac- subject and the verb is studied: the pairs of sentences
companied by an article and augmented with a Preposi- under investigation have the same meaning (and lexicon)
tional Phrase (  ) that features a complement refer- but in one cases the subject of the sentence follows the
ring to the subject. Similarly, the  consists not predicate (MVP post) while in the others the subject
preonly of a noun and an article but is instead further en- cedes the predicate (MVP pre). The latter configuration,
riched with another Prepositional Phrase  . The in Italian, is more common that the former: we aim to
  gives more information about the relation be- investigate whether this syntactic variation can alter the
tween the subject noun and the nominal part of the pred- performance of an LLM.
icate. We define, for each pair of sentences, a question that</p>
        <p>We exploit the diferent role of the two Prepositional asks the model to predict which element of the sentence
Phrases to design a Q&amp;A task on NC canonical and NC is involved in a certain action, either as the subject entity
inverse sentences and hence assess whether a more com- or the object. In particular, for sentences that contain
plex syntactic structure can influence LLMs capabilities. intransitive verbs, the model is always asked to predict
Given an NC sentence, the model is asked to correctly the subject of the sentence, while in transitive cases (like
interpret the meaning of the sentence by examining its the one in Table 1) the model is either asked to predict the
predicate: in particular, the model is asked to predict subject or the object of the sentence. For this subdataset,
the additional information related to the nominal predi- while the original data included both declarative and
cate – which is included in the   – by answering interrogative sentences, we retained only the declarative
a “wh-” question (in Italian, "Di cosa", see the example ones: we test the model with a total of 192 sentence pairs.
in Table 1). While both Prepositional Phrase answer to a To answer those questions, the relevant words – both
wh-question, only the   is related to the predicate for humans and LLMs – are the nouns that play the role of
of the sentence and hence the model should be able to subject, or object if present, in sentences. In the next
Secpredict the   and ignore the   . tion, we describe how it is possible to quantify whether</p>
        <p>To solve the proposed task and to properly understand a model is able to identify the role of those words during
NC sentences, humans and LLMs are required to im- the generation of the answer.</p>
        <p>OC
SVO
NC inverse
NC canonical
MVP post
MVP pre
2.2. Localizing Syntactic Knowledge via consider the tokens to be attributed for the generated
Attributions answer produced by the model: for each correct answer
generated by the model, we count the number of times
Knowing which sentence structures are easier or more the tokens with the larger attribution value are the
reledificult for a model to analyze is not enough. Consider- vant ones. This measures the accuracy of the attention
ing the black-box nature of these models, it is essential head ℎ in recognizing the relevant words to generate the
to understand which layers are responsible for encoding answer.
syntax, thus making the models more interpretable. The more often the attention head focuses on the
rel</p>
        <p>We hypothesize that there is an area of the model evant words, the more syntactic knowledge the head
responsible for correctly analyzing the sentence from the encodes. For each downstream task presented in Section
syntactic point of view in order to get the answer to the 2.1, we collect the accuracy of all heads at all levels. Then,
Q&amp;A task. In fact, as discussed in the previous Section, we identify a head as "responsible" for generating the
tarto answer correctly, the model needs to implicitly parse get word in a task if its score is higher than the average
the roles of the words in the sentence and identify the score for that task. Specifically, we assume a Gaussian
relevant words for the response (subjects and objects in distribution of scores for each task and identify a head
the questions on OC, SVO and MVP sentences and the as responsible if the probability of observing a value at
correct prepositional phrases in NC sentences). Hence, a least as extreme as the one observed is below a threshold
knowledge of syntax is required to identify the relevant  &lt; 0.05. We also consider responsible all heads that
words and, consequentially, generate the correct answer. obtain an excellent accuracy score (greater than 0.9) in</p>
        <p>
          In generating the answer, we expect the model to “fo- focusing on the relevant words. With this procedure, for
cus” on those relevant words. We can identify to which each layer and task, we can localize the responsible heads
token the model focuses during generation, measuring and determine where the model encodes syntax the most.
token-to-token attributions [
          <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
          ]. In fact,
token-totoken attribution methods quantify the influence of a
token in the generation of the other. We argue that the 2.3. Models and Prompting Method
part of the model architecture most aware of syntax is We focus on Instruction-tuned LLMs, all of comparable
the one that systematically focuses on relevant words size, and trained – either from scratch or only fine-tuned
when the model is prompted to answer syntax-related – on the Italian language. The models1 under
investigaquestions. Kobayashi et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] demonstrate that a mech- tion are Qwen2-7B [22], LLaMAntino-3-ANITA-8B [23],
anism – called the norm-based attribution – that it in- Llama-2-7b [24], modello-italia [25], and
Meta-Llamacorporates also the dense output layer of the Attention 3-8B [26]. To solve the Q&amp;A task, we prompted each
Mechanism is an accurate metric for token-to-token attri- model with 4 diferent – but semantically equivalent –
bution. We will refer to the matrix ℎ() – computed instructions. The complete list of the prompts is in
Apfor the attention head ℎ for a sequence  – as an at- pendix A.2. All prompts ask the model to solve the task
tribution matrix. Some examples and a more detailed in zero-shot by answering only with one or two words.
description of norm-based attribution can be found in At most 128 tokens are generated, with greedy decoding.
the Appendix (A.1). The attribution matrix ℎ(), for Once the generation is completed, a manual check of the
each sequence of tokens , describes where the model responses is performed to obtain a simplified response to
focuses during the generation of each token. By exam- be compared with the gold. For the subsequent analysis,
ining all the attention heads, some of them may focus for each model and task, only the prompt for which the
more often on the subject, the object, or the prepositional higher accuracy is obtained is considered.
phrase in the predicate while generating the answer for
the task. In particular, for each attention head ℎ, we
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1All models parameters are available on Huggingface’s transformers</title>
        <p>library [21]</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Experiments and Results</title>
      <p>served in the previous subdataset emerge. In
particular, the NC inverse sentences are harder than the
corWe initially revise model’s accuracy on question compre- responding NC canonical: the average model accuracy
hension task and assess models capabilities when difer- is 81.88(± 11.78) on NC canonical sentences, while the
ent syntactic structures are involved (Section 3.1). Then, accuracy on NC inverse sentences is much lower, with
we aim to spot the layers responsible for the correct syn- an average value of 64.06(± 28.26). Also in this case,
tactic understanding of the sentences (Section 3.2). the results demonstrate that models are afected by
different syntactic patterns. The model that better capture
3.1. Models accuracy on the right information to extract is modello-italia-9b on
question-answering task both NC inverse and NC-canonical sentences. Although
the performance of Llama-2-7b is rather low on inverse
Results on each of the subdatasets show that the syntactic NC sentences (the model tends to generate very often
structure of a sentence influences the models’ understand- the   ), the remaining LLaMA-base models achieve
ing of that sentence (see Table 2): across all tasks, LLMs better performance on both tasks.
tend to obtain larger accuracy on sentences characterized Finally, results on the MVP task further confirm the
by a unmarked syntactic structure. models’ behavior observed on the previous two tasks:</p>
      <p>On the first task, on OC and SVO sentences, the mod- the inversion of the subject and verb positions causes
els tend to struggle, especially in the OC sentences. On the models to perform worst on MVP post sentences
OC sentences, some models, in fact, do not perform far (87.5(± 19.38) average accuracy) with respect to MVP
from the random baseline of 50% accuracy ("yes" and pre (68.23(± 10.37) average accuracy). The average drop
"no" answers are balanced). When comparing OC and in performance is larger than in previous subtasks: these
SVO sentences, on average, the model accuracy drops results confirm that the inversion of the subject, even
by 11.88(± 3.84) points when the sentence presents the in basic sentences, can degrade models’ understanding.
object in the left-peripheral position. This result aligns Modello-italia-9b – probably due to the limited length
with the dificulty that humans encounter in understand- of the input sentences – tends to replicate the input
sening those sentences. The model that achieves the highest tences. The other models solve the tasks with excellent
accuracy in this task in OC sentences is LLaMAntino- accuracy in the MVP pre sentences.
3-ANITA-8B, with an accuracy of 76.56. It is
important to note that the model performance increase of 3.2. Localizing Layers responsible for
11.72 points with respect to the corresponding
MetaLLama-3-8b (that achieves an accuracy of 64.84): these Syntax
results stress the efectiveness of the finetuning for the
Italian language. Across the LLaMa-based models the
LLaMAntino-3-ANITA-8B is still the best performing
model, followed by Meta-LLama-3-8b and with a larger
gap by LLama-2-7b. The Qwen2-7B model is the best
answering to the task on unmarked sentences.</p>
      <p>On the NC sentences, similar patterns to the one
ob</p>
      <sec id="sec-2-1">
        <title>After quantifying the impact of diferent syntactic struc</title>
        <p>tures on model performance, we can identify the
attention heads and levels of the models that mostly encodes
syntax. In Figure 1 the number of responsible head at
each layer of the models is reported for the Q&amp;A task on
NC sentences, (the remaining tasks are in Appendix A.3).</p>
        <p>The general trend is that the most active in identifying
relevant words during response generation layers are
comprised between layer 19 and 25. Moreover, for all
models, the layers we identify as responsible often
handle multiple syntactic structures. The most noticeable
result is that for the same task, the same activation trend
emerges across all sentences.</p>
        <p>
          A large number of responsible attention heads appear
around layer 19 to 27 in LLaMAntino-3-ANITA-8B and
Meta-Llama-3-8B. Layer 21, in particular, is the layer with
the most responsible heads both in NC and MVP tasks.
This layer is predominant also in the OC task,
concomitant with layers 19 and 22 (Figure 3a). For Llama-2, we
observe the same pattern as the most active layers are
between 18 and 25. On the Qwen2-7B model and
modelloitalia-9b active layers are higher in the architecture: from
layer 18 to 24 for Qwen2-7B (with layer 23 being the more
active in NC and MVP tasks) and from layer 21 to 31 on
NC and MVP senteces for modello-italia-9b. This finding
suggests a diferent interpretation of LLMs layers from
that previously observed in BERT [
          <xref ref-type="bibr" rid="ref11">27</xref>
          ].
        </p>
        <p>While we could expect some correlation between the
accuracy of the task and the capability of the model to
identify the correct word in the sentence, the
responsible heads appear to be shared across diferent syntactic
structures. Those results suggest that some layers, more
than others, encode syntactic information about the role
of a word in a sentence. Moreover, diferent models and
architectures seem to share a rather similar organization.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusions</title>
      <sec id="sec-3-1">
        <title>In this paper, we have investigated how semantically</title>
        <p>equivalent sentences are processed by LLMs in Italian
when their syntax difers. We tested LLMs trained on the
Italian - or with Italian data in the pre-trainig material
- and measured how their capabilities in a battery of
Q&amp;A tasks that rely on parsing the correct role of words
in a sentence to be solved. Our findings confirm that
cleft sentences and construction with an inversion of
subject and verb are dificult to understand also for LLMs
- similarly to what observed for humans. Furthermore,
we have identified systematically using token-to-token
attribution that syntactic information tends to be encoded
in the middle and top layers of LLMs.
cessing (EMNLP), Association for Computational tational Linguistics, Online and Punta Cana,
DoLinguistics, Online, 2020, pp. 7057–7075. URL: https: minican Republic, 2021, pp. 2888–2913. URL: https:
//aclanthology.org/2020.emnlp-main.574. doi:10. //aclanthology.org/2021.emnlp-main.230. doi:10.
18653/v1/2020.emnlp-main.574. 18653/v1/2021.emnlp-main.230.
[11] G. Kobayashi, T. Kuribayashi, S. Yokoi, K. Inui, [17] M. Abdou, V. Ravishankar, A. Kulmizev, A. Søgaard,
Incorporating Residual and Normalization Lay- Word order does matter and shufled language
moders into Analysis of Masked Language Mod- els know it, in: S. Muresan, P. Nakov, A.
Villavicenels, in: M.-F. Moens, X. Huang, L. Specia, cio (Eds.), Proceedings of the 60th Annual Meeting
S. W.-t. Yih (Eds.), Proceedings of the 2021 Con- of the Association for Computational Linguistics
ference on Empirical Methods in Natural Lan- (Volume 1: Long Papers), Association for
Computaguage Processing, Association for Computational tional Linguistics, Dublin, Ireland, 2022, pp. 6907–
Linguistics, Online and Punta Cana, Domini- 6919. URL: https://aclanthology.org/2022.acl-long.
can Republic, 2021, pp. 4547–4568. URL: https: 476. doi:10.18653/v1/2022.acl-long.476.
//aclanthology.org/2021.emnlp-main.373. doi:10. [18] D. Brunato, C. Chesi, F. Dell’Orletta, S. Montemagni,
18653/v1/2021.emnlp-main.373. G. Venturi, R. Zamparelli, et al., Accompl-it@
[12] Y. Belinkov, J. Glass, Analysis methods in neural lan- evalita2020: Overview of the acceptability &amp;
comguage processing: A survey, Transactions of the As- plexity evaluation task for italian, in: CEUR
WORKsociation for Computational Linguistics 7 (2019) 49– SHOP PROCEEDINGS, CEUR Workshop
Proceed72. URL: https://aclanthology.org/Q19-1004. doi:10. ings (CEUR-WS. org), 2020.</p>
        <p>1162/tacl_a_00254. [19] EVALITA 2020 — evalita.it, https://www.evalita.it/
[13] J. Hewitt, C. D. Manning, A structural probe for find- campaigns/evalita-2020/, 2020.
ing syntax in word representations, in: J. Burstein, [20] M. Greco, P. Lorusso, C. Chesi, A. Moro,
AsymmeC. Doran, T. Solorio (Eds.), Proceedings of the 2019 tries in nominal copular sentences:
PsycholinguisConference of the North American Chapter of the tic evidence in favor of the raising analysis, Lingua
Association for Computational Linguistics: Human 245 (2020) 102926.</p>
        <p>Language Technologies, Volume 1 (Long and Short [21] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
DePapers), Association for Computational Linguistics, langue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
FunMinneapolis, Minnesota, 2019, pp. 4129–4138. URL: towicz, J. Brew, HuggingFace’s Transformers:
Statehttps://aclanthology.org/N19-1419. doi:10.18653/ of-the-art Natural Language Processing, ArXiv
v1/N19-1419. abs/1910.0 (2019).
[14] G. Jawahar, B. Sagot, D. Seddah, What does BERT [22] Qwen2 technical report (2024).
learn about the structure of language?, in: A. Ko- [23] M. Polignano, P. Basile, G. Semeraro, Advanced
rhonen, D. Traum, L. Màrquez (Eds.), Proceedings natural-based interaction for the italian language:
of the 57th Annual Meeting of the Association for Llamantino-3-anita, 2024. arXiv:2405.07101.
Computational Linguistics, Association for Com- [24] H. Touvron, L. Martin, K. Stone, P. Albert, A.
Almaputational Linguistics, Florence, Italy, 2019, pp. hairi, Y. Babaei, N. Bashlykov, S. Batra, P.
Bhar3651–3657. URL: https://aclanthology.org/P19-1356. gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer,
doi:10.18653/v1/P19-1356. M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,
[15] E. S. Ruzzetti, F. Ranaldi, F. Logozzo, M. Mastromat- W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,
tei, L. Ranaldi, F. M. Zanzotto, Exploring linguistic A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M.
Karproperties of monolingual BERTs with typological das, V. Kerkez, M. Khabsa, I. Kloumann, A.
Koclassification among languages, in: H. Bouamor, renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,
J. Pino, K. Bali (Eds.), Findings of the Association D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,
for Computational Linguistics: EMNLP 2023, Asso- P. Mishra, I. Molybog, Y. Nie, A. Poulton, J.
Reizenciation for Computational Linguistics, Singapore, stein, R. Rungta, K. Saladi, A. Schelten, R. Silva,
2023, pp. 14447–14461. URL: https://aclanthology. E. M. Smith, R. Subramanian, X. E. Tan, B. Tang,
org/2023.findings-emnlp.963. doi: 10.18653/v1/ R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan,
2023.findings-emnlp.963. I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang,
[16] K. Sinha, R. Jia, D. Hupkes, J. Pineau, A. Williams, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom,
D. Kiela, Masked language modeling and the dis- Llama 2: Open foundation and fine-tuned chat
tributional hypothesis: Order word matters pre- models, 2023. URL: https://arxiv.org/abs/2307.09288.
training for little, in: M.-F. Moens, X. Huang, arXiv:2307.09288.</p>
        <p>L. Specia, S. W.-t. Yih (Eds.), Proceedings of the [25] iGenius | Large Language Model — igenius.ai, https:
2021 Conference on Empirical Methods in Natu- //www.igenius.ai/it/language-models, 2024.
ral Language Processing, Association for Compu- [26] AI@Meta, Llama 3 model card (2024). URL:</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Appendix</title>
      <sec id="sec-4-1">
        <title>A.1. Token-to-token norm-based attribution</title>
        <sec id="sec-4-1-1">
          <title>As described in Section 2.2, we adopt norm-based</title>
          <p>
            token-to-token attribution to spot what is the most
relevant word during the generation of the answer in
LLMs on our task. The norm based approach is proposed
in Kobayashi et al. [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. Given the query weight matrix
ℎ, key weight matrix ℎ , value weight matrix
 and the attention output weight matrix ℎ of an
attention head ℎ, the norm-based attribution for each
token of a sequence  is calculated as the product of
the attention weights and the norm of the projected
token representation ℎℎ (see the original
work Kobayashi et al. [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] for a detailed discussion).
Aℎ() :=   ︂( ℎ·√(ℎ)⊤ )︂ · ‖ ℎℎ‖
For our analysis, we consider all rows relative to a
token in the answer generated by the model. To assess
whether a model understands the syntactic relationship
between words, it must focus on relevant words during
the generation. In particular, the token with the highest
attribution should be one belonging to the relevant
word. For example, in Figure 2, the attribution of
Meta-Llama-3-8B on one NC sentence is presented.
During the generation of the answer (the tokens of the
answer index rows in the figure), the most attributed
tokens belong to the relevant words in the input (the
tokens of the input index columns).
• Data la frase "{Item}", rispondi alla seguente
domanda:"{Question}" Rispondi SOLAMENTE
con SI o NO.
• Considera la frase: "{Item}". Rispondi con ’SI’ o
’NO’ alla seguente domanda:"{Question}"
• Considera la frase: "{Item}". {Question}
Rispondi brevemente, SOLAMENTE con con ’SI’
o ’NO’.
• Considera la frase: "{Item}". Rispondi con ’SI’ o
’NO’. {Question}
NC sentences:
• Data la frase "{Item}", rispondi alla seguente
domanda:"{Question}" Rispondi in due parole.
• Considera la frase: "{Item}". Rispondi solo con
le due parole che rispondono alla seguente
domanda:"{Question}"
• Considera la frase: "{Item}". {Question}
Rispondi SOLO con le due parole che rispondono
alla seguente domanda.
• Considera la frase: "{Item}". Rispondi solo con
due parole. {Question}
MVP sentences:
• Data la frase "{Item}", rispondi alla seguente
domanda:"{Question}" Rispondi solo con un
nome.
• Considera la frase: "{Item}". Rispondi solo
con il nome che risponde alla seguente
domanda:"{Question}"
• Considera la frase: "{Item}". {Question}
Rispondi SOLO con il nome che risponde alla
domanda.
• Considera la frase: "{Item}". Rispondi solo con
un nome. {Question}
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>A.3. Responsible Attention Heads per</title>
      </sec>
      <sec id="sec-4-3">
        <title>Layer in each subtask</title>
        <sec id="sec-4-3-1">
          <title>In Figure 3, the responsible attention heads per layer is</title>
          <p>depicted. As described in Section 3.2, some layers tend to
demonstrate a high number of attention heads
responsible for the generation. In particular, layers around layer
A.2. Prompts to Instruction-Tuned LLMs 20 seem to focus more on relevant words for the correct
for the Italian Laguage generation of the answer than the other. Since the
correct generation implies the capability of understanding
Each model has been prompted with four diferent the role of diferent words by a model, we claim that
prompts for each Q&amp;A task (as described in Section 2.1). those level encodes some kind of syntactic information.
Here is a complete list of the prompts template used in It is worth noticing that similar layers are responsible for
our experiments: in the template the {Item} is the sen- the diferent sub tasks, in particular for the LLaMa-base
tence to be analyzed and {Question} is replaced with models and for Qwen-2-7b model.
the corresponding comprehension question.</p>
          <p>OC and SVO senteces:</p>
          <p>(a) OC and SVO sentences</p>
          <p>(b) MVP sentences</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Canal</surname>
          </string-name>
          ,
          <article-title>Person features and lexical restrictions in italian clefts</article-title>
          ,
          <source>Frontiers in Psychology</source>
          <volume>10</volume>
          (
          <year>2019</year>
          )
          <fpage>2105</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Just</surname>
          </string-name>
          ,
          <article-title>Individual diferences in syntactic processing: The role of working memory</article-title>
          ,
          <source>Journal of memory and language 30</source>
          (
          <year>1991</year>
          )
          <fpage>580</fpage>
          -
          <lpage>602</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lorusso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moro</surname>
          </string-name>
          , et al.,
          <article-title>Asymmetries in extraction from nominal copular sentences: a challenging case study for nlp tools</article-title>
          ,
          <source>in: Proceedings of the Sixth Italian Conference on Computational Linguistics CLiC-it 2019 (Bari, November 13-15</source>
          ,
          <year>2019</year>
          ), CEUR,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kovaleva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          ,
          <article-title>A primer in BERTology: What we know about how BERT works, Transactions of the Association for Computational Linguistics 8 (</article-title>
          <year>2020</year>
          )
          <fpage>842</fpage>
          -
          <lpage>866</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .tacl-
          <volume>1</volume>
          .54. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00349</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferrando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sarti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bisazza</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. R.</surname>
          </string-name>
          <article-title>Costa-jussà, A primer on the inner workings of transformer-based language models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/ 2405.00208. arXiv:
          <volume>2405</volume>
          .
          <fpage>00208</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Wallace</surname>
          </string-name>
          , Attention is not Explanation, in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>3543</fpage>
          -
          <lpage>3556</lpage>
          . URL: https://aclanthology.org/ N19-1357. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1357.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiegrefe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pinter</surname>
          </string-name>
          ,
          <article-title>Attention is not not explanation</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          . URL: https://aclanthology.org/D19-1002. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1002.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>What does BERT look at? an analysis of BERT's attention</article-title>
          , in: T. Linzen, G. Chrupała,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belinkov</surname>
          </string-name>
          , D. Hupkes (Eds.),
          <source>Proceedings of the 2019 ACL Workshop BlackboxNLP</source>
          : Analyzing and
          <article-title>Interpreting Neural Networks for NLP, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>276</fpage>
          -
          <lpage>286</lpage>
          . URL: https://aclanthology.org/W19-4828. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W19</fpage>
          -4828.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Serrano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          , Is attention interpretable?, in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>2931</fpage>
          -
          <lpage>2951</lpage>
          . URL: https://aclanthology.org/P19-1282. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1282.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kuribayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yokoi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Inui</surname>
          </string-name>
          ,
          <article-title>Attention is not only a weight: Analyzing transformers with vector norms</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language</source>
          Prohttps://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>I.</given-names>
            <surname>Tenney</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>E. Pavlick,</given-names>
          </string-name>
          <article-title>BERT rediscovers the classical NLP pipeline</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>4593</fpage>
          -
          <lpage>4601</lpage>
          . URL: https://aclanthology.org/P19-1452. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1452.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>