<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Analysis on How Pre-Trained Language Models Learn Diferent Aspects</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ejdis Gjinika</string-name>
          <email>e.gjinika@studenti.unibs.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Arici</string-name>
          <email>nicola.arici@unibs.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Putelli</string-name>
          <email>luca.putelli@unibs.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alfonso E. Gerevini</string-name>
          <email>alfonso.gerevini@unibs.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Serina</string-name>
          <email>ivan.serina@unibs.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di Brescia</institution>
          ,
          <addr-line>Via Branze 38, Brescia, IT</addr-line>
        </aff>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>By now, it is widely known that pre-trained Neural Language Models (NLM) and Large Language Models (LLM) possess remarkable capabilities and they are able to solve many Natural Language Processing Tasks. However, not as much is understood regarding how Transformer-based models acquire this ability during their complex training process. In this context, an interesting line of work surfaced in the last few years: the study of the so-called learning trajectories. Several studies tested the knowledge acquired by a model not only when it was fully trained, but also in its checkpoints, i.e. intermediate versions of the model at diferent stages during its training. Nonetheless, most of these works focused on simple tasks, often analysing single grammatical aspects (such as part-of-speech tags, transitive verbs, etc.) without a proper comparison with more complex tasks and with semantics-based aspects. In this paper, we consider two additional tasks to study the learning trajectory of NLMs and to compare diferent aspects. The first one consists on classifying a sentence as correct or wrong, from the grammatical point of view, from a novel dataset which can contain several types of errors. The second one is a totally semantic-based task revolving understanding whether a sentence is funny or not. In our experimental evaluation, we compare the learning trajectories on these two tasks with three simpler grammatical aspects. Thus, we highlight the most important similarities and divergences in terms of how these types of knowledge are learned by three GPT-NeoX models. Moreover, we analyse the behaviour of each layers of the models considered, verifying whether there are significant diferences among them.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Explainability</kwd>
        <kwd>Interpretability</kwd>
        <kwd>Learning Trajectory</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the rise of the Transformer architecture [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and the release of the first Large Language Models
(LLMs), the race to create the biggest, the most powerful and the most accurate LLM began. New
generative capabilities, such as few-shot learning, have been explored and new state-of-the-art results
have been obtained in many NLP tasks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Currently, we can use a countless variety of models, spanning from the smallest ones, which can
eficiently and swiftly tackle basic tasks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], to the larger models that can handle multiple intricate tasks
with excellent performance. However, our comprehension of the language understanding mechanisms
behind these models, assuming that they understand [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], is limited, as well as the mechanisms behind
their predictions. In recent years, diferent research has started to focus on the interpretability of Neural
Language Models (NLMs) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] from diferent angles: by analyzing self-attention weights to find relations
among words [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], by determining whether NLMs have acquired specific world knowledge [ 8], or by
investigating their linguistic capabilities [9].
      </p>
      <p>A widely widespread approach for evaluating the capabilities of a NLM is through one or more
probing tasks, i.e. training a simple classifier to verify if a particular language property is contained in
the embedded representation of words and sentences calculated by the model [10, 11]. This technique
obtains good results and valuable insights, showing how NLMs possess knowledge related to syntax and
grammar [9], temporal relations [12], and more semantic related aspects like the presence of metaphors
[13].</p>
      <p>However, one flaw of these works is that they mostly take into account single aspects of the language
(such as syntactic correctness, parts of speech, etc.) or specific models, such as BERT [ 14], without
creating a proper comparison between diferent tasks or models. Moreover, most of these studies
evaluated the performance of probing tasks simply in terms of accuracy, a technique that has been
demonstrated to have some shortcomings [15, 16]. Moreover, it would be quite interesting not only to
understand the capabilities of a fully-trained NLM (as it has been done by the aforementioned works),
but also to investigate how these capabilities are achieved and how these concepts are acquired by a
NLM during their training procedure.</p>
      <p>In this work, we put to the test and compare diferent size NLMs and their capabilities across five
diferent language properties, ranging from grammar, (with classical tasks involving transitive verbs,
passive forms and concordance among verbs and nouns) to semantics, such as asking the model to
detect simple humorous sentences. Moreover, we introduce a task into which the probe consists on
classifying a sentence as correct or wrong, from the grammatical point of view, from a novel dataset
which “mixes” several types of grammatical errors.</p>
      <p>
        To evaluate the performance of NLMs in these tasks, we implement the state-of-the-art Minimum
Description Length (MDL) method [17]. Another important aspect of our work is that we are interested
in when and how a property is learned by a model. Following a relatively new way of analyzing NLMs
[
        <xref ref-type="bibr" rid="ref8 ref9">18, 19, 20</xref>
        ], we decided to evaluate the so-called learning trajectories of the models, by studying their
performance (using our probing tasks) across their training. In order to do that, we execute our probing
tasks in diferent checkpoints (i.e. points during training), acquiring information regarding how quickly
a property is learned, when it reaches its best performance, and how it evolves across time. Finally,
we are interested in “where” these language properties are encoded by NLMs. Therefore, we perform
an in-depth analysis of the learning trajectories of diferent layers, evaluating their similarities and
diferences.
      </p>
      <p>The remainder of the paper is organized as follows. In Section 2, we describe the related work; in
Section 3 we present our probing tasks and the datasets we exploited; in Section 4 we explain the
methodology we followed; in Section 5 we show the experimental results obtained and finally, we draw
the conclusions and possible future works.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The explanation of how Neural Language Models work and their knowledge has been the subject of
many works, following diferent points of view and methodologies [
        <xref ref-type="bibr" rid="ref10 ref11">21, 22</xref>
        ]. First, several white-box
approaches studied the self-attention weights of the model’s heads analysing whether they encode
meaningful relationships among words [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This line of work includes visualization techniques [
        <xref ref-type="bibr" rid="ref12">23</xref>
        ],
clustering [
        <xref ref-type="bibr" rid="ref13">24</xref>
        ] and categorization [
        <xref ref-type="bibr" rid="ref14 ref15">25, 26</xref>
        ]. Second, an interesting analysis has been conducted on
whether these models learned specific world knowledge in subjects like history, geography, etc. [ 8],
introducing benchmarks [
        <xref ref-type="bibr" rid="ref16">27</xref>
        ] and standard tests [
        <xref ref-type="bibr" rid="ref17">28</xref>
        ]. However, one of the most important lines of work
focused on syntactic and grammatical capabilities using probing tasks [11].
      </p>
      <p>
        A probe is a small feed-forward neural network that receives in input the embedded word (or
sentence) representations, generated by a NLM, and it is trained to solve a specific supervised task [
        <xref ref-type="bibr" rid="ref18">29</xref>
        ]
such, for instance, whether a sentence contains a specific grammatical error or not. Probing has been
exploited to study diferent forms of syntactic properties [
        <xref ref-type="bibr" rid="ref19">9, 30</xref>
        ], whether the NLMs encode some forms
of dependency parsing [
        <xref ref-type="bibr" rid="ref20">31</xref>
        ] and temporal relations [12]. An interesting work more concerned with
semantic-based aspects is the one in [13], which focuses on how metaphors are recognised by several
pre-trained Neural Language Models across diferent datasets and languages. Another interesting
application is presented in [32, 33], into which the authors exploit probing tasks in order to visualize
gender bias in BERT word representations.
      </p>
      <p>
        Although in most studies probing tasks have been only applied to fully trained NLMs [
        <xref ref-type="bibr" rid="ref19">9, 12, 30</xref>
        ],
another line of work exploited them in order to understand how and when such models acquire these
capabilities in their training process. More specifically, Saphra and Lopez [18] analysed a LSTM-based
language model and discovered that several syntactic features (such as parts of speech) are learned in
the first stages of the training, whereas learning most complex aspects (such as topic-related knowledge)
need more training steps. The work in [
        <xref ref-type="bibr" rid="ref8">19</xref>
        ] obtained similar results for the AlBERT model [34]. They also
performed a comparison among the learning processes of grammar and basic semantics (in particular,
coreference and semantic role labeling) and reported that they are very similar. In fact, these types of
knowledge are learned quite early in the training and they do not improve after the first steps. Focusing
on factual knowledge and common sense, the work in [
        <xref ref-type="bibr" rid="ref9">20</xref>
        ] analysed RoBERTa in terms of its learning
trajectories and found that this type of information is learned more in depth as the training progresses.
      </p>
      <p>In this paper, we perform a similar analysis. However, we compare simple grammar tasks with a
more refined grammar test, which combines several types of possible errors, and with a semantic-only
case study which is understanding whether a sentence contains some humorous content.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Case Studies and Datasets</title>
      <p>The case studies approached in this work regard syntax, grammar and a purely semantic task. More
specifically, we created five diferent tasks, called Causative, Coordinate structures, Passive, Mix and
Humor and we collected the respective datasets. Apart from the last one, our datasets were taken and
adapted from the BLiMP benchmark [35]. Please note that the Mix dataset includes four diferent types
of grammar concepts assembled in a single task. In the Table 1 we show a positive and negative sentence
for each task we considered.</p>
      <p>In the following, we explain more in-depth each task and the respective dataset.</p>
      <p>Causative dataset This dataset is made by sentences that may contain syntactic errors in terms of
verb-object concordance. In particular, some verbs may be intransitive and therefore their use associated
with an object complement leads to a syntactic error. This dataset consists of 2000 sentences and the
labels are equally distributed (1000 correct and 1000 wrong).</p>
      <p>Coordinate structures dataset This dataset is made by sentences that may contain structure errors,
undermining the coherence of the sentence. Although the sentence structure comprehends several
components, all sentences in this data contain at most one error. This dataset contains 4000 sentences,
equally distributed between correct and wrong.</p>
      <p>Passive dataset This dataset is made by sentences that may contain an error related to the use of
verbs in their (possibly not existent) passive form, leading to compatibility errors between verb and
subject. Therefore, the goal of the Passive task is to identify whether a verb supports or not the passive
form. This dataset contains 4000 sentences with the labels equally distributed.</p>
      <p>Mix dataset This dataset has been made specifically for this paper and combines four diferent
types of grammatical errors together, taken from the BLiMP benchmark [35]. Specifically, we consider
(i) sentences that may contain errors in the use of determiners paired with nouns (determiners-noun
agreement); (ii) sentences into which the verbs may not agree with nouns (verbs-noun agreement); (iii)
sentence into which verbs that may not agree with subjects with irregular plurals (irregular plural
subject-verb agreement); (iv) sentence which may contain errors in the agreement between subjects
and verbs with regular plurals (regular plural subject-verb agreement). The goal of this dataset is to
test the capability of a NLM to simply identify whether a sentence contains a grammatical error in
general, without a strong regularity among the positive and negative examples provided to the probe.
Therefore, the overall task should be more challenging. This dataset has been built with 4000 instances,
with equally distributed labels. The error types are sampled randomly.</p>
      <p>Humor dataset This datasets consists of sentences that contain simple forms of humor (such as
puns and jokes), and sentences that were taken from titles of newspaper articles of extracted from
Wikipedia and therefore with no humorous content. This task is purely semantic, and all sentences
are grammatically and syntactically correct. This dataset is taken from a from a Kaggle competition of
humor detection and contains 4000 sentences.1 The labels are equally distributed between humorous
or not.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>In this section, we describe the overall procedure we employed to calculate and analyse the learning
trajectories of NLMs for the considered tasks.</p>
      <sec id="sec-4-1">
        <title>4.1. Structuring and Evaluating the Probing Tasks</title>
        <p>Each of our probe consists of a feed-forward neural network which has the goal to learn, starting from
the embedded representations provided by a pre-trained NLM, one of the tasks described in Section
3. All tasks we considered are binary, i.e. they have only two classes. For the grammatical ones (Mix
dataset included), we assign label 1 to the correct sentences and 0 to the wrong ones. Similarly, for the
Humor task we assign label 1 to those sentences which contain some forms of humor and 0 to those
which do not.</p>
        <p>The probing is designed as follows. Considering a task, its dataset and a NLM, the probe receives
in input the  -dimensional embedded representation of a sentence  and has to correctly classify .
The embedded representation of  is calculated by averaging all embedding vectors for each token of
the sentence, following the procedure in [36]. For each task, we exploited the same neural network
structure. In particular, we used two hidden layers with 4 neurons (using ReLu as activation function)
and two output neurons with softmax activation function.</p>
        <p>Given that a probe is basically a neural network classifier, its performance could evaluated in terms
of accuracy. A high accuracy, considering that the input are just the embedding vector provided by
the pre-trained NLM, should mean that those representation correctly encodes information regarding
the task we analysed. However, studies such as [15, 16] demonstrated that accuracy is not a feasible
1https://www.kaggle.com/competitions/humor-detection/data
metric for these analyses. In fact, the authors of [15] show how probe classifiers achieve very high
performance, if trained with high quantities of examples, even using totally random data in input. This
is due to the strong capability of the neural network to find some patterns even in random data. For
the same reason, perturbating the labels for creating meaningless control task lead to good results
with probes trained with an adequate number of examples [16]. In both cases, a significant decrease
in performance over these random control tasks can be seen only by training the probes with small
datasets.</p>
        <p>In order to solve these evaluation issues, Voita and Titov [17] introduced MDL (Minimum Description
Length), a method based on information theory for measuring knowledge and capabilities of NLM
through probing tasks. The authors of [17] demonstrated that MDL is very robust with respect to
control tasks, random seeds, datasets and probe characteristics. The main idea behind MDL is measuring
both the performance of the probe network but also its efort , in terms of the quantity of data necessary
to obtain such a performance.</p>
        <p>Inspired by the work by Aghazadeh et al. [13], we used the online coding version of MDL which
works as follows. Instead of training a probe just once using the entire training set, first the method
divides the training set into  portions of increasing size. Next,  − 1 neural networks (all with the
same hyperparameters, and all starting from the same weights initiated randomly) are trained, each
one with a diferent portion. The first is trained with the first portion, the second with the second
portion (which includes also the instances of the first one), and so on. The evaluation is conducted in
terms of cross-entropy over a validation set, which basically consists of the “new instances” from the
next portion. Therefore, a neural network trained on the -th portion is evaluated by calculating the
cross-entropy over the instances in the next portion excluding the ones used for training. Following
[17], the portions consist of the 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.25, 12.5, 25, 50, and 100 % of the data.</p>
        <p>This process is evaluated in terms of a metric called codelength, which intuitively can be seen as the
sum of all the cross-entropy obtained during the process. More formally, it is defined as:
− 1
ℎ = |0| · 2() + ∑︁ ()
=0
into which |0| is the size of the first portion,  is the number of classes of the probing task, and 
is the cross-entropy calculated over the next portion instances.</p>
        <p>Given that the codelength metric depends on the size of the training set, in [17] the authors propose
another, more general, metric called compression, which is defined as:
into which  is the size of the training set. Since in our experiments  = 2, we can finally define the
compression metric as:
 =
 · 2()
ℎ
 =</p>
        <p>|0| + ∑︀ =−0 1()
In Section 5, almost all the results we obtained will be shown and explained in terms of this metric.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Trajectory and Layer Analysis</title>
        <p>
          The probing task technique described above is often used only on a fully trained NLM to test its
capabilities [
          <xref ref-type="bibr" rid="ref19">9, 12, 30</xref>
          ]. However, we extend this analysis and we aim to gain insight into how each
language property is acquired by the NLMs, we implemented this probing approach at diferent stages
of the model’s training [
          <xref ref-type="bibr" rid="ref8 ref9">19, 20</xref>
          ]. By examining diferent training steps of the model, it is possible to
trace its learning trajectory over a particular aspect, such as the tasks we described in Section 3.
        </p>
        <p>Therefore, in our analysis, we perform the probing task on all the models’ checkpoints. A checkpoint
is an intermediate version of the model saved at a particular time during the training process. Although
many strategies to save a checkpoint can be adopted (such as saving every time a fixed amount of time
has passed, every epoch, etc.) the checkpoints we consider are saved after the model has processed a
specific number of tokens. More information regarding this aspect is provided in Section 5. Probing over
the diferent checkpoints do not require particular expedients. In fact, it consists of training the probing
tasks repeatedly over each checkpoint and evaluating its performance in terms of compression. By
measuring how this metric changes over the checkpoints, it is possible to obtain the learning trajectory
for that particular task.</p>
        <p>Another fundamental aspect we consider in our study is analysing probing tasks in diferent parts
of the NLM architecture and, in particular, among its layers. Thus, not only we execute and evaluate
probing task considering all model’s checkpoints, but also to the diferent layers of each checkpoint. In
order to do that, we simply repeat the procedure explained above changing the probe input: for testing
the last layer, we provide to the probe the embedded representation calculated by the last layer of the
architecture, for testing the penultimate layer we provide the one calculated by the penultimate, etc.
Therefore, considering a NLM with  layers and from which  checkpoints were saved, we execute the
probing task  ×  times. This way, we obtain the learning trajectory for a specific task for each layer.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>The experimental evaluation we conducted considers three GPT-NeoX models belonging to the Pythia
benchmark suite [37]. Pythia is a collection of publicly released model of various sizes based on the
GPTNeoX architecture [38]. All the models are trained over the same dataset (The Pile [39]) for almost
300 tokens. For each model, Pythia provides 154 checkpoints. The checkpoints are saved after 0 (i.e.
with the model weights initialized randomly), 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1000 training steps and,
after the 1000th step, every subsequent 1000 steps. For all three models, each step consists on training
the model with 2 additional tokens. From this collection we consider the models with 70 , 160
and 410 parameters. These models have a diferent number of layers, in fact they have respectively
6, 12 and 24 layers each.</p>
      <p>For our layer analysis, we did not consider the 70 model due to the low number of layers of
this model. Taking 160 and 410 , in order to conduct a clearer and more meaningful comparison
between two models with a diferent number of layers, we decided to group the layers by zones based
on their order in the GPT stack. More specifically, starting from the layers closer to the input, we select
three groups:
• Bottom Layers which consist of the first third of the layers (e.g., the first 4 layers for the 12
layers model), excluding the embedding layer;
• the Middle Layers, which consist of the middle third (e.g., from layer 5 to 7 included for the 12
layers model);
• the Top Layers, which consist of the last third, excluding the last one which we treat and visualise
separately (e.g., from layer 8 to 11 for the 12 layers model).</p>
      <p>As explained in Section 4.1, all the experiments reported in this section are in terms of the compression
metric calculated by the MDL method [17]. This choice is not only coherent with the literature [15, 16]
but it also helps understanding learning trajectories and diferences among layers. In fact, accuracy may
not show significant changes across diferent configurations. For instance, considering the 160 model
and the Coordinate structures task in their final checkpoint, the layers have all a very high accuracy
(97.55 on average, with a standard deviation of 0.02) but a compression that ranges from 2.1 to 10.
Similarly, accuracy can reach very high values even in the very early stages of the trainin. Considering
the same model for the Coordinate structures task, after 512 steps the probe trained on the last layer
obtains an accuracy of 95.25. However, at the same step the compression is quite low (3.77), indicating
that the concept is not totally contained in the representation.
Humor
Coordinate_structures
Passive
Causative
Mix
160M</p>
      <p>410M
8
7
6
5
4
3
2
1 0B
41B 104B 167B 230B 293B 1 0B</p>
      <sec id="sec-5-1">
        <title>5.1. Learning Trajectories and Task Comparison</title>
        <p>The learning trajectories of the GPT-NeoX models, considering their last layer, for all the five probing
tasks we considered are available in Figure 1. As expected, 410 is the best performing model, with
higher compression with respect to 160 and 70 , which is the smallest and worst performing model.</p>
        <p>Comparing the diferent tasks we analysed, the highest values of compression are generally obtained
by the Humor task. For 70 , there is in fact a highly notable diference across all the learning trajectory
(a maximum of 7.5 for Humor versus 4.8 for Coordinate structures, 4.4 for Passive, 2.9 for Causative
and 1.5 for Mix). Considering 160 , the performance obtained for Humor are very similar to those
obtained for Coordinate structures and Passive until the last part of the training process. The Causative
task has intermediate results in all the models we considered, reaching a maximum value of 4.4 of the
compression for 410 . This is probably due to its smaller dataset, which has only 2000 instances with
respect to the 4000 instances of the other tasks.</p>
        <p>The Mix task exhibits a completely diferent behaviour. In fact, it obtains very low values in all the
models, not even reaching a value of 2 of compression for the most powerful model (410 ), whereas
Causative exceeds 4 and the other tasks can even exceed 7. Moreover, the learning trajectory is basically
lfat, showing no visible improvement during the learning process. This is probably due to the complexity
of the dataset. In fact, the Mix task is composed by four diferent simple tasks and all models struggle
to identify the sentence that contains an error, without focusing on a single aspect and not knowing
exactly which the error is. For all the learning trajectories, we can see that most of the knowledge
is acquired in the very early stage of the training (before the threshold of 41 tokens), without no
noteworthy improvement afterwards.</p>
        <p>
          In Figure 1, looking at the trajectories of the 70 and 160 models we can see that, at some point
(about 150-180B tokens), the performance of all tasks except Mix decreases significantly. A possible
explanation of this phenomenon is that, after extensive training, the last layer mostly focuses on Masked
Language Modeling task which is typically used for training the NLM. Therefore, the layer probably
“forgets” relevant linguistic information. Although this phenomenon has previously been observed
in [
          <xref ref-type="bibr" rid="ref10">21, 40</xref>
          ], a more in-depth analysis of this aspect is required. In particular, it is important to note
that the 410 model does not show performance decay in none of the tasks analyzed and instead the
learning trajectories are mostly stable. We speculate that this may be due to its size (nearly 2.5 times the
parameters of the 160 model) and a better management of the knowledge among his levels. However,
Humor
Coordinate_structures
Passive
Causative
Mix
        </p>
        <p>Top Layers
1 0B
41B 104B 167B 230B 293B 1 0B
we cannot exclude that other tasks may have a performance decay or that continuing its lead would
lead to a similar decrease in terms of compression.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Layer Comparison</title>
        <p>The next experiment we conducted regards the study of the learning trajectory considering diferent
layers of the NLMs. As for the previous experiments, we considered the same five probing tasks.</p>
        <p>The results are available in Figure 2. For brevity’s sake, we report only the results obtained for the
160 model but similar observations could be done for 70 and 410 . As explained in Section 4.2,
we divided the 12 layers of the model in three groups: Bottom (on the left), which refers to the lowest
4 layers of the architecture, Middle (in the center), which refers to the subsequent 3 layers and Top (on
the right), which refers to the highest 4 layers, excluding the last one. In Figure 2 we report the median
compression of the three groups across all the considered tasks.</p>
        <p>Generally, the best results are obtained by the Middle layers, with just two exceptions: the Causative
task, which has a slightly highest compression considering the Top layers, and the Mix task which,
again, obtains very poor results with no notable diferences among the three groups. The Top layers,
however, obtain definitely better compression with respect to the Bottom ones, especially for Coordinate
structures and Passive (with values ranging from 7 to 9 for Top and ranging mostly from 3 to 4 for
Bottom). Instead, Humor reaches very high values of compressions in all three cases.</p>
        <p>A more detailed look of the Passive and Humor tasks is given in Figure 3 for both the 160 and
410 models. We did not include 70 in this analysis due to its low number of layers, which limits
the significance of this analysis. In particular, the plots show the performance of the Bottom (in blue),
Middle (in orange) and Top (in green) layers and the last layer of the model. Besides representing the
median value of each group layers (the line at the center of each area), we represent also the area
between the 1st and 3rd quartile of the compression distribution obtained by the layers. This way, we
show the variability of the performance among the diferent layers in each group.</p>
        <p>In Figure 3, we can see that the 160 model has a low variability for all the layer groups, and
especially for Top and Middle considering the Humor task. The highest variability can be seen for
the Bottom layers in the Passive task. Instead, 410 presents a higher variability especially for the
Bottom layers. Specifically, the Passive area for the Bottom layers ranges approximately from 2.5 to 7.
41B
104B 167B
160M Humor
230B
293B
1 0B
41B
104B 167B
410M Humor
230B</p>
        <p>293B
1 0B
1 0B</p>
        <p>1 0B</p>
        <p>The lowest results are obtained by the first layer and the highest are obtained by layer 7. Nonetheless,
it is important to point out that despite a higher number of layers, with respect to 160 , the Top
layers of 410 performs very well in the Humor task with a very high compressions and a very slow
variability. This indicates that basically all layers encode very eficiently and very similarly this type of
knowledge. Analysing the trajectories, we can observe a slight decay of performance for the Passive
task, considering Middle and Top layers, and for Humor considering only the Top layers. These decays
are not present in 160 . The last layer (in red in Figure 3) presents the same behaviour we observed in
Figure 1 and described in Section 5.1, with a very evident decrease of performance after about 167
tokens for the 160 model.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>We have investigated and compared the learning trajectories for three GPT-NeoX models of diferent
size considering diferent probing tasks: three specific grammar tasks ( Causative, Coordinate structures
and Passive), a generic grammatical correctness task (Mix) and a purely semantic one (Humor).</p>
      <p>
        Our evaluation, conducted in terms of compression exploiting the MDL method, shows that the
considered models acquire knowledge both related to grammar and related to a specific aspect of
semantics, with very high performance for Coordinate structures, Passive and Humor. However, the Mix
task shows a very low compression which denotes a limited capability of a NLM to discern whether a
sentence is correct or not in general and without focusing on a single, specific aspect. The learning
trajectories we analysed showed that most of this knowledge is acquired early and, for the most part, the
compression is stable or (only for the last layer) even decreasing. The performance decay phenomenon
is probably due to the specialisation of the last layer in the Masked Language Modeling task for which
it is trained [
        <xref ref-type="bibr" rid="ref10">21, 40</xref>
        ].
      </p>
      <p>Moreover, we have analysed the diferent layers of the considered models grouping them in three
groups: the Bottom layers (the third closest to the input), the Middle layers and the Top ones. Generally,
the best results are obtained by the Middle layers, whereas the Bottom layers provide the worst results,
especially for the grammar tasks.</p>
      <p>As future work, we want to explore the performance decrease of the last layer with more tasks and
verifying possible explanations. Moreover, this analysis requires the access to open source model which
provide the embedded representations of words and sentences. An important development would be to
devise similar procedures to analyse closed source models.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been partly funded by Regione Lombardia through the initiative "Programma degli
interventi per la ripresa economica: sviluppo di nuovi accordi di collaborazione con le università per la
ricerca, l’innovazione e il trasferimento tecnologico" - DGR n. XI/4445/2021.
[8] F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, A. Miller, Language models as
knowledge bases?, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics,
Hong Kong, China, 2019, pp. 2463–2473. URL: https://aclanthology.org/D19-1250. doi:10.18653/
v1/D19-1250.
[9] A. Miaschi, D. Brunato, F. Dell’Orletta, G. Venturi, Linguistic profiling of a neural language
model, in: D. Scott, N. Bel, C. Zong (Eds.), Proceedings of the 28th International Conference on
Computational Linguistics, International Committee on Computational Linguistics, Barcelona,
Spain (Online), 2020, pp. 745–756. URL: https://aclanthology.org/2020.coling-main.65. doi:10.
18653/v1/2020.coling-main.65.
[10] A. Köhn, What’s in an embedding? analyzing word embeddings through multilingual evaluation,
in: L. Màrquez, C. Callison-Burch, J. Su (Eds.), Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon,
Portugal, 2015, pp. 2067–2073. URL: https://aclanthology.org/D15-1246. doi:10.18653/v1/D15-1246.
[11] Y. Belinkov, Probing classifiers: Promises, shortcomings, and advances, Computational Linguistics
48 (2022) 207–219. URL: https://aclanthology.org/2022.cl-1.7. doi:10.1162/coli_a_00422.
[12] T. Caselli, I. Dini, F. Dell’Orletta, How about time? probing a multilingual language model for
temporal relations, in: N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi,
P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He,
T. K. Lee, E. Santus, F. Bond, S.-H. Na (Eds.), Proceedings of the 29th International Conference
on Computational Linguistics, International Committee on Computational Linguistics, Gyeongju,
Republic of Korea, 2022, pp. 3197–3209. URL: https://aclanthology.org/2022.coling-1.283.
[13] E. Aghazadeh, M. Fayyaz, Y. Yaghoobzadeh, Metaphors in pre-trained language models: Probing
and generalization across datasets and languages, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.),
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 2037–2050.</p>
      <p>URL: https://aclanthology.org/2022.acl-long.144. doi:10.18653/v1/2022.acl-long.144.
[14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423.
doi:10.18653/v1/N19-1423.
[15] K. Zhang, S. Bowman, Language modeling teaches you more than translation does: Lessons
learned through auxiliary syntactic task analysis, in: T. Linzen, G. Chrupała, A. Alishahi (Eds.),
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 359–361.</p>
      <p>URL: https://aclanthology.org/W18-5448. doi:10.18653/v1/W18-5448.
[16] J. Hewitt, P. Liang, Designing and interpreting probes with control tasks, in: K. Inui, J. Jiang, V. Ng,
X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLPIJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 2733–2743. URL:
https://aclanthology.org/D19-1275. doi:10.18653/v1/D19-1275.
[17] E. Voita, I. Titov, Information-theoretic probing with minimum description length, in: B. Webber,
T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 183–
196. URL: https://aclanthology.org/2020.emnlp-main.14. doi:10.18653/v1/2020.emnlp-main.
14.
[18] N. Saphra, A. Lopez, Understanding learning dynamics of language models with SVCCA, in:
J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota,
2019, pp. 4129–4138. URL: https://aclanthology.org/N19-1419. doi:10.18653/v1/N19-1419.
[32] M. Dusi, N. Arici, A. E. Gerevini, L. Putelli, I. Serina, Graphical identification of gender bias in BERT
with a weakly supervised approach, in: D. Nozza, L. C. Passaro, M. Polignano (Eds.), Proceedings
of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022) co-located
with 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA
2022), Udine, November 30th, 2022, volume 3287 of CEUR Workshop Proceedings, CEUR-WS.org,
2022, pp. 164–176.
[33] M. Dusi, N. Arici, A. E. Gerevini, L. Putelli, I. Serina, Discrimination bias detection through
categorical association in pre-trained language models, IEEE Access (2024) 1–1. doi:10.1109/
ACCESS.2024.3482010.
[34] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A lite BERT for
selfsupervised learning of language representations, in: 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. URL:
https://openreview.net/forum?id=H1eA7AEtvS.
[35] A. Warstadt, A. Parrish, H. Liu, A. Mohananey, W. Peng, S. Wang, S. R. Bowman, Blimp: The
benchmark of linguistic minimal pairs for english, Trans. Assoc. Comput. Linguistics 8 (2020)
377–392. URL: https://doi.org/10.1162/tacl_a_00321. doi:10.1162/TACL\_A\_00321.
[36] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks, in:
K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019,
pp. 3982–3992. URL: https://aclanthology.org/D19-1410. doi:10.18653/v1/D19-1410.
[37] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan,
S. Purohit, U. S. Prashanth, E. Raf, A. Skowron, L. Sutawika, O. van der Wal, Pythia: A suite for
analyzing large language models across training and scaling, in: A. Krause, E. Brunskill, K. Cho,
B. Engelhardt, S. Sabato, J. Scarlett (Eds.), International Conference on Machine Learning, ICML
2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning
Research, PMLR, 2023, pp. 2397–2430.
[38] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell,
J. Phang, M. Pieler, U. S. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, S. Weinbach,
GPTNeoX-20B: An open-source autoregressive language model, in: Proceedings of the ACL Workshop
on Challenges &amp; Perspectives in Creating Large Language Models, 2022. URL: https://arxiv.org/
abs/2204.06745.
[39] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite,
N. Nabeshima, S. Presser, C. Leahy, The pile: An 800gb dataset of diverse text for language modeling,
CoRR abs/2101.00027 (2021). URL: https://arxiv.org/abs/2101.00027. arXiv:2101.00027.
[40] J. Wallat, J. Singh, A. Anand, BERTnesia: Investigating the capture and forgetting of knowledge in
BERT, in: A. Alishahi, Y. Belinkov, G. Chrupała, D. Hupkes, Y. Pinter, H. Sajjad (Eds.), Proceedings
of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP,
Association for Computational Linguistics, Online, 2020, pp. 174–183. URL: https://aclanthology.
org/2020.blackboxnlp-1.17. doi:10.18653/v1/2020.blackboxnlp-1.17.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Musto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pellungrini</surname>
          </string-name>
          , E. Purificato, G. Semeraro,
          <string-name>
            <given-names>M.</given-names>
            <surname>Setzu</surname>
          </string-name>
          , XAI.it
          <year>2024</year>
          :
          <article-title>An Overview on the Future of Explainable AI in the era of Large Language Models</article-title>
          ,
          <source>in: Proceedings of 5th Italian Workshop on Explainable Artificial Intelligence, co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence</source>
          , Bolzano, Italy,
          <source>November 25-28</source>
          ,
          <year>2024</year>
          , CEUR. org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon, U. von Luxburg, S. Bengio,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V. N.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9</source>
          ,
          <year>2017</year>
          , Long Beach, CA, USA,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          . URL: https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          ,
          <article-title>NeurIPS 2020</article-title>
          , December 6-
          <issue>12</issue>
          ,
          <year>2020</year>
          , virtual,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Arici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Gerevini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Olivato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Putelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sigalini</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Serina</surname>
          </string-name>
          ,
          <article-title>Real-world implementation and integration of an automatic scoring system for workplace safety courses in italian</article-title>
          ,
          <source>Future Internet</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <article-title>268</article-title>
          . URL: https://doi.org/10.3390/fi15080268. doi:
          <volume>10</volume>
          .3390/FI15080268.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the dangers of stochastic parrots: Can language models be too big?</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>
          , FAccT '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          . doi:
          <volume>10</volume>
          .1145/3442188.3445922.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belinkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <article-title>Analysis methods in neural language processing: A survey, Transactions of the Association for Computational Linguistics 7 (</article-title>
          <year>2019</year>
          )
          <fpage>49</fpage>
          -
          <lpage>72</lpage>
          . URL: https://aclanthology.org/Q19-1004. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00254</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>What does BERT look at? an analysis of BERT's attention</article-title>
          , in: T. Linzen, G. Chrupała,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belinkov</surname>
          </string-name>
          , D. Hupkes (Eds.),
          <source>Proceedings of the 2019 ACL Workshop BlackboxNLP</source>
          : Analyzing and
          <article-title>Interpreting Neural Networks for NLP, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>276</fpage>
          -
          <lpage>286</lpage>
          . URL: https://aclanthology.org/ W19-4828. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W19</fpage>
          -4828.
          <article-title>1 (Long and Short Papers), Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>3257</fpage>
          -
          <lpage>3267</lpage>
          . URL: https://aclanthology.org/N19-1329. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1329.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [19]
          <string-name>
            <surname>C.-H. Chiang</surname>
            ,
            <given-names>S.-F.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , H.-y. Lee,
          <article-title>Pretrained language model embryology: The birth of ALBERT, in: B</article-title>
          .
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>6813</fpage>
          -
          <lpage>6828</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>553</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>553</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kasai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Probing across time: What does RoBERTa know and when?</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2021</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Punta Cana, Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>820</fpage>
          -
          <lpage>842</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-emnlp.
          <volume>71</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>71</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kovaleva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          ,
          <article-title>A primer in BERTology: What we know about how BERT works, Transactions of the Association for Computational Linguistics 8 (</article-title>
          <year>2020</year>
          )
          <fpage>842</fpage>
          -
          <lpage>866</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .tacl-
          <volume>1</volume>
          .54. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00349</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <article-title>Explainability for large language models: A survey</article-title>
          ,
          <source>ACM Trans. Intell. Syst. Technol</source>
          .
          <volume>15</volume>
          (
          <year>2024</year>
          ). URL: https: //doi.org/10.1145/3639372. doi:
          <volume>10</volume>
          .1145/3639372.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vig</surname>
          </string-name>
          ,
          <article-title>A multiscale visualization of attention in the transformer model</article-title>
          ,
          <source>in: Proceedings of the 57th Conference of the Association for Computational Linguistics</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2019</year>
          , Florence, Italy,
          <source>July 28 - August 2</source>
          ,
          <year>2019</year>
          , Volume
          <volume>3</volume>
          :
          <string-name>
            <given-names>System</given-names>
            <surname>Demonstrations</surname>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>How far does BERT look at: Distance-based clustering and analysis of BERT's attention</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <string-name>
            <surname>COLING</surname>
          </string-name>
          <year>2020</year>
          , Barcelona, Spain (Online),
          <source>December 8-13</source>
          ,
          <year>2020</year>
          ,
          <source>International Committee on Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3853</fpage>
          -
          <lpage>3860</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Serina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Putelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Gerevini</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Serina</surname>
          </string-name>
          ,
          <article-title>Synonyms, antonyms and factual knowledge in BERT heads</article-title>
          ,
          <source>Future Internet</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <fpage>230</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>L.</given-names>
            <surname>Putelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Gerevini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mehmood</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Serina</surname>
          </string-name>
          ,
          <article-title>On the behaviour of bert's attention for the classification of medical reports</article-title>
          , in: C. Musto,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          , G. Semeraro (Eds.),
          <source>Proceedings of the 3rd Italian Workshop on Explainable Artificial Intelligence co-located with 21th International Conference of the Italian Association for Artificial Intelligence(AIxIA</source>
          <year>2022</year>
          ), Udine, Italy,
          <source>November 28 - December 3</source>
          ,
          <year>2022</year>
          , volume
          <volume>3277</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>H.</given-names>
            <surname>ElSahar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vougiouklis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Remaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gravier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Hare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Laforest</surname>
          </string-name>
          , E. Simperl, T-rex:
          <article-title>A large scale alignment of natural language with knowledge base triples</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hasida</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Piperidis</surname>
          </string-name>
          , T. Tokunaga (Eds.),
          <source>Proceedings of the Eleventh International Conference on Language Resources and Evaluation</source>
          ,
          <string-name>
            <surname>LREC</surname>
          </string-name>
          <year>2018</year>
          , Miyazaki, Japan, May 7-
          <issue>12</issue>
          ,
          <year>2018</year>
          ,
          <string-name>
            <given-names>European</given-names>
            <surname>Language Resources Association</surname>
          </string-name>
          (ELRA),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Araki</surname>
          </string-name>
          , G. Neubig,
          <article-title>How can we know what language models know</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <fpage>423</fpage>
          -
          <lpage>438</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , G. Boleda,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Padó</surname>
          </string-name>
          ,
          <article-title>Distributional vectors encode referential attributes</article-title>
          , in: L.
          <string-name>
            <surname>Màrquez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Callison-Burch</surname>
          </string-name>
          , J. Su (Eds.),
          <source>Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Lisbon, Portugal,
          <year>2015</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>21</lpage>
          . URL: https://aclanthology.org/D15-1002. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D15</fpage>
          -1002.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Seddah</surname>
          </string-name>
          ,
          <article-title>What does BERT learn about the structure of language?</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>3651</fpage>
          -
          <lpage>3657</lpage>
          . URL: https://aclanthology.org/P19-1356. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1356.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hewitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>A structural probe for finding syntax in word representations</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>