<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vivi Nastase</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chunyang Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Samo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paola Merlo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Idiap Research Institute</institution>
          ,
          <addr-line>Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Geneva</institution>
          ,
          <addr-line>Geneva</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon - subject-verb agreement across a variety of sentence structures - in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps - detect syntactic objects and their properties in individual sentences, and ifnd patterns across an input sequence of sentences - we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific diferences, and syntactic structure is not shared, even across closely related languages. Questo lavoro chiede se i modelli linguistici multilingue preaddestrati catturino rappresentazioni linguistiche astratte valide attraverso svariate lingue. Il nostro approccio sviluppa dati sintetici curati su larga scala, con proprietà specifiche, e li utilizza per studiare le rappresentazioni di frasi costruite con modelli linguistici preaddestrati. Utilizziamo un nuovo task a scelta multipla e i dati aferenti, le Blackbird Language Matrices (BLM), per concentrarci su uno specifico fenomeno strutturale grammaticale - l'accordo tra il soggetto e il verbo - in diverse lingue. Per trovare la soluzione corretta a questo task è necessario un sistema che rilevi modelli e paradigmi linguistici complessi nelle rappresentazioni testuali. Utilizzando un'architettura a due livelli che risolve il problema in due fasi - prima impara gli oggetti sintattici e le loro proprietà nelle singole frasi e poi ne ricava gli elementi comuni - dimostriamo che, nonostante siano stati addestrati su testi multilingue in modo coerente, i modelli linguistici multilingue preaddestrati presentano diferenze specifiche per ogni lingua e inoltre la struttura sintattica non è condivisa, nemmeno tra lingue tipologicamente molto vicine.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;syntactic information</kwd>
        <kwd>synthetic structured data</kwd>
        <kwd>multi-lingual</kwd>
        <kwd>cross-lingual</kwd>
        <kwd>diagnostic studies of deep learning models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Large language models, trained on huge amount of texts,</title>
        <p>have reached a level of performance that rivals human
capabilities on a range of established benchmarks [1].</p>
      </sec>
      <sec id="sec-1-2">
        <title>Despite high performance on high-level language pro</title>
        <p>cessing tasks, it is not yet clear what kind of information
these language models encode, and how. For example,
transformer-based pretrained models have shown
excellent performance in tasks that seem to require that the
model encodes syntactic information [2].</p>
      </sec>
      <sec id="sec-1-3">
        <title>All the knowledge that the LLMs encode comes from</title>
        <p>unstructured texts and the shallow regularities they are
very good at detecting, and which they are able to
leverage into information that correlates to higher structures
in language. Most notably, [3] have shown that from the
unstructured textual input, BERT [4] is able to infer POS,
structural, entity-related, syntactic and semantic
information at successively higher layers of the architecture,
mirroring the classical NLP pipeline [5]. We ask: How is
this information encoded in the output layer of the model,
i.e. the embeddings? Does it rely on surface information
– such as inflections, function words – and is assembled
on the demands of the task/probes [6], or does it indeed
reflect something deeper that the language model has
assembled through the progressive transformation of the
input through its many layers?</p>
        <p>To investigate this question, we use a seemingly simple
task – subject-verb agreement. Subject-verb agreement
is often used to test the syntactic abilities of deep neural
networks [7, 8, 9, 10], because, while apparently simple
and linear, it is in fact structurally, and theoretically,
complex, and requires connecting the subject and the verb
across arbitrarily long or complex structural distance.</p>
      </sec>
      <sec id="sec-1-4">
        <title>It has an added useful dimension – it relies on syntactic structure and grammatical number information that many languages share.</title>
      </sec>
      <sec id="sec-1-5">
        <title>In previous work we have shown that simple struc</title>
      </sec>
      <sec id="sec-1-6">
        <title>Context</title>
        <p>PP1-sg
PP1-sg
PP1-pl
PP1-pl
PP1-sg
PP1-sg
PP1-pl
tural information – the chunk structure of a sentence –
which can be leveraged to determine subject-verb agree- 1 NP-sg VP-sg
ment, or to contribute towards more semantic tasks, can 23 NNPP--spgl VVPP--spgl
be detected in the sentence embeddings obtained from 4 NP-pl VP-pl
a pre-trained model [11]. This result, though, does not 5 NP-sg PP2-sg VP-sg
cast light on whether the discovered structure is deeper 6 NP-pl PP2-sg VP-pl
and more abstract, or it is rather just a reflection of sur- 7 NP-sg PP2-sg VP-sg
face indicators, such as function words or morphological 8 ???
markers. Answers</p>
        <p>To tease apart these two options, we set up an experi- 1 NP-pl PP1-pl PP2-sg VP-pl Correct
ment covering four languages: English, French, Italian 2 NP-pl PP1-pl et PP2-sg VP-pl Coord
and Romanian. These languages, while diferent, have 3 NP-pl PP1-pl VP-pl WNA
shared properties that make sharing of syntactic structure 4 NP-pl PP1-sg PP1-sg VP-pl WN1
a reasonable expectation, if the pretrained multilingual 5 NP-pl PP1-pl PP2-pl VP-pl WN2
model does indeed discover and encode syntactic struc- 6 NP-pl PP1-pl PP2-pl VP-sg AEV
ture. We use parallel datasets in the four languages, built 78 NNPP--ppll PPPP11--spgl PPPP22--spgl VVPP--ssgg AAEENN12
by (approximately) translating the BLM-AgrF dataset
[12], a multiple-choice linguistic test inspired from the
Raven Progressive Matrices visual intelligence test, previ- Figure 1: BLM instances for verb-subject agreement, with
ously used to explore subject-verb agreement in French. two attractors. The errors can be grouped in two types:</p>
        <p>Our work ofers two contributions: (i) four parallel (i) sequence errors: WNA= wrong nr. of attractors; WN1=
datasets – on English, French, Italian and Romanian, fo- wrong gram. nr. for 1 attractor noun (N1); WN2= wrong
cmuusletdilionngusualbtjeesctti-nvgerobfaagmreuelmtielinntg;u(iail) pcrreotsrsa-ilninegdumaloadnedl, AgNrE1aV;mA=.aEngNrr.2efe=omarge2rnetemearertotnrrtaocetnrorotrhrneoonvuenNrb(2N;.A2E);N(i1i)=gargarmeemmaetnictaelrerorrroorns:
to explore the degree to which syntactic structure
information is shared across diferent languages. Our
crosslingual and multilingual experiments show poor transfer
across languages, even those most related, like Italian
and French. This result indicates that pretrained
models encode syntactic information based on shallow and
language-specific clues, from which they are not yet able
to take the step towards abstracting grammatical
structure. The datasets are available at https://www.idiap.ch
/dataset/(blm-agre|blm-agrf|blm-agri|blm_agrr) and the
code at https://github.com/CLCL-Geneva/BLM-SNFDise
ntangling.</p>
      </sec>
      <sec id="sec-1-7">
        <title>Such an approach can be very useful for probing lan</title>
        <p>guage models, as it allows to test whether they indeed
detect the relevant linguistic objects and their properties,
and whether (or to what degree) they use this
information to find larger patterns. We have developed BLMs
as a linguistic test. Figure 1 illustrates the template of a</p>
      </sec>
      <sec id="sec-1-8">
        <title>BLM subject-verb agreement matrix, with the diferent</title>
        <p>linguistic objects – chunks/phrases – and their relevant
properties, in this case grammatical number. Examples
in all languages under investigation are provided in
Appendix B.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. BLM task and BLM-Agr datasets</title>
      <sec id="sec-2-1">
        <title>BLM-Agr datasets A BLM problem for subject-verb</title>
        <p>agreement consists of a context set of seven sentences
Inspired by existing IQ tests —Raven’s progressive ma- that share the subject-verb agreement phenomenon, but
trices (RPMs)— we have developed a framework, called difer in other aspects – e.g. number of linearly
intervenBlackbird Language Matrices (BLMs) [13] and several ing noun phrases between the subject and the verb (called
datasets [12, 14]. RPMs consist of a sequence of images, attractors because they can interfere with the agreement),
called the context, connected in a logical sequence by diferent grammatical numbers for these attractors, and
underlying generative rules [15]. The task is to deter- diferent clause structures. The sequence is generated
mine the missing element in this visual sequence, the by a rule of progression of number of attractors, and
answer. The candidate answers are constructed to be alternation in the grammatical number of the diferent
similar enough that the solution can be found only if the phrases. Each context is paired with a set of candidate
rules are identified correctly. answers generated from the correct answer by altering</p>
        <p>Solving an RPM problem is usually done in two steps: it to produce minimally contrastive error types. We have
(i) identify the relevant objects and their attributes; (ii) two types of errors (see Figure 1: (i) sequence errors –
decompose the main problem into subproblems, based on these candidate answers are grammatically correct, but
object and attribute identification, in a way that allows they are not the correct continuation of the sequence; (ii)
detecting the global pattern or underlying rules [16]. agreement errors – these candidate answers are
grammatically erroneous, because the verb is in agreement sibilities (e.g.,  = "np-s pp1-s vp-s"), all corresponding
with one of the intervening attractors. By constructing sentences are collected into a set .
candidate answers with such specific error types, we can The dataset consists of triples (, +, − ),
investigate the kind of information and structure learned. where  is an input sentence, + is the correct output –</p>
        <p>The seed data for French was created by manually a sentence diferent from  but with the same chunk
patcompleting data previously published data [17]. From this tern. − are  = 7 incorrect outputs, randomly
initial data, we generated a dataset that comprises three chosen from the sentences that have a chunk pattern
difsubsets of increasing lexical complexity (details in [12]): ferent from . For each language, we sample uniformly
Types I, II, III, corresponding to diferent amounts of approx. 4000 instances from the generated data based on
lexical variation within a problem instance. Each subset the pattern of the input sentence, randomly split 80:20
contains three clause structures uniformly distributed into train:test. The train part is split 80:20 into train:dev,
within the data. The dataset used here is a variation of the resulting in a 2576:630:798 split for train:dev:test.</p>
      </sec>
      <sec id="sec-2-2">
        <title>BLM-AgrF [12] that separates sequence-based from other</title>
        <p>types of errors, to be able to perform deeper analyses
into the behaviour of pretrained language models. 3. Probing the encoding of syntax</p>
        <p>The datasets in English, Italian and Romanian were
created by manually translating the seed French sentences We aim to test whether the syntactic information detected
into the other languages by native (Italian and Romanian) in multilingual pretrained sentence embeddings is based
and near-native (English) speakers. The internal struc- on shallow, language-specific clues, or whether it is more
ture in these languages is very similar, so translations are abstract structural information. Using the subject-verb
approximately parallel. The diferences lie in the treat- agreement task and the parallel datasets in four languages
ment of preposition and determiner sequences that must provides clues to the answer.
be conflated into one word in some cases in Italian and The datasets all share sentences with the same
syntacFrench, but not in English. French and Italian use number- tic structures, as illustrated in Figure 1. However, there
specific determiners and inflections, while Romanian and are language specific diferences, as in the structure of
English encode grammatical number exclusively through the chunks (noun or verb or prepositional phrases) and
inflections. In English most plural forms are marked by each language has diferent ways to encode grammatical
a sufix. Romanian has more variation, and noun inflec- number (see section 2).
tions also encode case. Determiners are separate tokens, If the grammatical information in the sentences in
which are overt indicators of grammatical number and our dataset – i.e. the sequences of chunks with specific
of phrase boundaries, whereas inflections may or may properties relevant to the subject-verb agreement task
not be tokenized separately. (Figure 1) – is an abstract form of knowledge within the</p>
        <p>Table 1 shows the datasets statistics for the four BLM pretrained model, it will be shared across languages. We
problems. After splitting each subset 90:10 into train:test would then see a high level of performance for a model
subsets, we randomly sample 2000 instances as train data. trained on one of these languages, and tested on any
20% of the train data is used for development. of the other. Additionally, when training on a dataset
consisting of data in the four languages, the model should</p>
        <p>English French Italian Romanian detect a shared parameter space that would lead to high
Type I 230 252 230 230 results when testing on data for each language.
Type II 4052 4927 4121 4571 If however the grammatical information is a reflection
Type III 4052 4810 4121 4571 of shallow language indicators, we expect to see higher
performance on languages that have overt grammatical
number and chunk indicators, such as French and Italian,
and a low rate of cross-language transfer.</p>
        <p>A sentence dataset From the seed files for each
language we build a dataset to study sentence structure
independently of a task. The seed files contain noun,
verb and prepositional phrases, with singular and plural
variations. From these chunks, we build sentences with
all (grammatically correct) combinations of np [pp1
[pp2]] vp1. For each chunk pattern  of the 14
pos</p>
      </sec>
      <sec id="sec-2-3">
        <title>1pp1 and pp2 may be included or not, pp2 may be included only if</title>
        <p>pp1 is included</p>
        <sec id="sec-2-3-1">
          <title>3.1. System architectures</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>A sentence-level VAE To test whether chunk struc</title>
        <p>ture can be detected in sentence embeddings we use a</p>
      </sec>
      <sec id="sec-2-5">
        <title>VAE-like system, which encodes a sentence, and decodes</title>
        <p>a diferent sentence with the same chunk structure,
using a set of contrastive negative examples – sentences
that have diferent chunk structures from the input – to
encourage the latent to encode the chunk structure.</p>
        <p>The architecture of the sentence-level VAE is similar to and several incorrect answers . Every sentence is
a previously proposed system [18]: the encoder consists embedded using the pretrained model. To simplify the
of a CNN layer with a 15x15 kernel, which is applied to a discussion, in the sections that follows, when we say
32x24-shaped sentence embedding, followed by a linear sentence we actually mean its embedding.
layer that compresses the output of the CNN into a latent The two-level VAE system takes a BLM instance as
layer of size 5. The decoder mirrors the encoder. input, decomposes its context sequence  into sentences</p>
        <p>An instance consists of a triple (, +, − ), and passes them individually as input to the
sentencewhere  is an input sentence with embedding  level VAE. For each sentence  ∈ , the system builds
and chunk structure , + is a sentence with embed- on-the-fly the candidate answers for the sentence level:
ding + with same chunk structure , and − = the same sentence  from input is used as the correct
{| = 1, } is a set of  = 7 sentences output, and a random selection of sentences from  are
with embeddings  , each with chunk pattern diferent the negative answers. After an instance is processed by
from  (and diferent from each other). The input  the sentence level, for each sentence  ∈ , we obtain its
is encoded into latent representation , from which we representation from the latent layer  , and reassemble
sample a vector ˜, which is decoded into the output ˆ. the input sequence as  = [ ], and pass it as
To encourage the latent to encode the structure of the in- input to the task-level VAE. The loss function combines
put sentence we use a max-margin loss function, to push the losses on the two levels – a max-margin loss on the
for a higher similarity score for ˆ with the sentence sentence level that contrasts the sentence reconstructed
that has the same chunk pattern as the input (+ ) than on the sentence level with the correct answer and the
the ones that do not. At prediction time, the sentence erroneous ones, and a max-margin loss on the task level
from the {+} ∪ − options that has the highest that contrasts the answer constructed by the decoder
score relative to the decoded answer is taken as correct. with the answer set of the BLM instance (details in [11]).
Two-level VAE for BLMs We use a two-level system 3.2. Experiments
illustrated in Figure 2, which separates the solving of
the BLM task on subject-verb agreement into two steps: To explore how syntactic information – in particular
(i) compress sentence embeddings into a representation chunk structure – is encoded, we perform cross-language
that captures the sentence chunk structure and the rele- and multi-language experiments, using rfist the sentences
vant chunk properties (on the sentence level) (ii) use the dataset, and then the BLM agreement task. We report F1
compressed sentence representations to solve the BLM averages over three runs.
agreement problems, by detecting the pattern across the Cross-lingual experiments – train on data from one
lansequence of structures (on the task level). This archi- guage, test on all the others – show whether patterns
detecture will allow us to test whether sentence structure tected in sentence embeddings that encode chunk
struc– in terms of chunks – is shared across languages in a ture are transferable across languages. The results on
pretrained multilingual model. testing on the same language as the training provide
support for the experimental set-up – the high results show
that the pretrained language model used does encode the
necessary information, and the system architecture is
adequate to distill it.</p>
      </sec>
      <sec id="sec-2-6">
        <title>The multilingual experiments, where we learn a model</title>
        <p>from data in all the languages, will provide additional
clues – if the performance on testing on individual
languages is comparable to when training on each language
alone, it means some information is shared across
languages and can be beneficial.</p>
        <p>All reported experiments use Electra [19]2, with the
sentence representations the embedding of the [CLS]
token (details in [11]).</p>
      </sec>
      <sec id="sec-2-7">
        <title>An instance for a BLM problem consists of an ordered</title>
        <p>context sequence  of sentences,  = {| = 1, 7} as
input, and an answer set  with one correct answer ,</p>
      </sec>
      <sec id="sec-2-8">
        <title>2Electra pretrained model: google/electra-base-discriminator</title>
        <p>3.2.1. Syntactic structure in sentences
We use only the sentence level of the system illustrated
in Figure 2 to explore chunk structure in sentences, using
the data described in Section 2. For the cross-lingual
experiments, the training dataset for each language is
used to train a model that is then tested on each test
set. For the multilingual setup, we assemble a common
training data from the training data for all languages.
3.2.2. Solving the BLM agreement task
We solve the BLM agreement task using the two-level
system, where a compacted sentence representation learned
on the sentence level should help detect patterns in the
input sequence of a BLM instance. Because the datasets
are parallel, with shared sentence and sequence patterns,
we test whether the added learning signal from the task
level can help push the system to learn to map an input
sentence into a representation that captures structure
shared across languages. We perform cross-lingual
experiments, where a model is trained on data from one
language, and tested on all the test sets, and a
multilingual experiment, where for each type I/II/III data, we
assemble a training dataset from the training sets of the
same type from the other languages. The model is then
tested on the separate test sets.</p>
        <sec id="sec-2-8-1">
          <title>3.3. Evaluation</title>
        </sec>
      </sec>
      <sec id="sec-2-9">
        <title>For each training set we build three models, and plot the average F1 score. The standard deviation is very small, so we do not include it in the plot, but it is reported in the results Tables in Appendix C.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results</title>
      <p>Structure in sentences Figure 3 shows the results for
the experiments on detecting chunk structure in sentence
embeddings, in cross-lingual and multilingual training
setups, for comparison (detailed results in Table 3).
compared to learning in a monolingual setting. This
again indicates that the system could not detect a shared
parameter space for the information that is being learned,
the chunk structure, and thus this information is encoded
diferently in the languages under study.</p>
      <p>An additional interesting insight comes from the
analysis of the latent layer representations. Figure 4 shows
the tSNE projection of the latent representations of the
sentences in the training data after multilingual
training. Diferent colours show diferent chunk patterns, and
diferent markers show diferent languages. Had the
information encoding syntactic structure been shared, the
clusters for the same pattern in the diferent languages
would overlap. Instead, we note that each language seems
to have its own quite separate pattern clusters.</p>
      <sec id="sec-3-1">
        <title>Two observations are relevant to our investigation: (i)</title>
        <p>while training and testing on the same language leads to
good performance – indicating that Electra sentence
embeddings do contain relevant information about chunks,
and that the system does detect the chunk pattern in
these representations – there is very little transfer efect.</p>
      </sec>
      <sec id="sec-3-2">
        <title>A slight efect is detected for the model learned on Italian and tested on French; (ii) learning using multilingual training data leads to a deterioration of the performance,</title>
        <p>Discussion and related work Pretrained language
models are learned from shallow cooccurrences through
a lexical prediction task. The input information is
transformed through several transformer layers, various parts
boosting each other through self-attention. Analysis of
the architecture of transformer models, like BERT [4],
have localised and followed the flow of specific types
of linguistic information through the system [20, 3], to
the degree that the classical NLP pipeline seems to be
reflected in the succession of the model’s layers. Analysis
of contextualized token embeddings shows that they can
encode specific linguistic information, such as sentence
structure [21] (including in a multilingual set-up [22]),
predicate argument structure [23], subjecthood and
objecthood [24], among others. Sentence embeddings have
also been probed using classifiers, and determined to
encode specific types of linguistic information, such as
subject-verb agreement [9], word order, tree depth,
constituent information [25], auxiliaries[26] and argument
structure [27].</p>
        <p>Generative models like LLAMA seem to use English as
the latent language in the middle layers [28], while other
analyses of internal model parameters has lead to
uncovering language agnostic and language specific networks
of parameters [29], or neurons encoding cross-language
number agreement information across several internal
layers [30]. It has also been shown that subject-verb
agreement information is not shared by BiLSTM
models [31] or multilingual BERT [32]. Testing the degree
to which word/sentence embeddings are multilingual
has usually been done using a classification probe, for
tasks like NER, POS tagging [33], language identification
[34], or more complex tasks like question answering and
sentence retrieval [35]. There are contradictory results
on various cross-lingual model transfers, some of which
can be explained by factors such as domain and size of
training data, typological closeness of languages [36], or
by the power of the classification probes. Generative or
classification probes do not provide insights into whether
the pretrained model finds deeper regularities and
encodes abstract structures, or the predictions are based on
shallower features that the probe used assembles for the
specific test it is used for [37, 6].</p>
        <p>We aimed to answer this question by using a
multilingual setup, and a simple syntactic structure detection
task in an indirectly supervised setting. The datasets
used – in English, French, Italian and Romanian – are
(approximately) lexically parallel, and are parallel in
syntactic structure. The property of interest is grammatical
number, and the task is subject-verb agreement. The
languages chosen share commonalities – French, Italian
and Romanian are all Romance languages, English and
French share much lexical material – but there are also
diferences: French and Italian use a similar manner to
encode grammatical number, mainly through articles that
can also signal phrase boundaries. English has a very
limited form of nominal plural morphology, but determiners
are useful for signaling phrase boundaries. In Romanian,
number is expressed through inflection, sufixation and
case, and articles are also often expressed through specific
sufixes, thus overt phrase boundaries are less common
than in French, Italian and English. These
commonalities and diferences help us interpret the results, and
provide clues on how the targeted syntactic information
is encoded.</p>
        <p>Previous experiments have shown that syntactic
information – chunk sequences and their properties – can be
accessed in transformer-based pretrained sentence
embeddings [11]. In this multilingual setup, we test whether
this information has been identified based on
languagespecific shallow features, or whether the system has
uncovered and encoded more abstract structures.</p>
        <p>The low rate of transfer for the monolingual training
setup and the decreased performance for the multilingual
training setup for both our experimental configurations
indicate that the chunk sequence information is language
specific and is assembled by the system based on shallow
features. Further clues come from the fact that the only
transfer happens between French and Italian, which
encode phrases and grammatical number in a very similar
manner. Embedding the sentence structure detection into
a larger system, where it receives an additional learning
signal (shared across languages) does not help to push
towards finding a shared sentence representation space
that encodes in a uniform manner the sentence structure
shared across languages.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusions</title>
      <sec id="sec-4-1">
        <title>We have aimed to add some evidence to the question</title>
        <p>How do state-of-the-art systems ≪ know≫ what they
≪ know≫ ? [37] by projecting the subject-verb
agreement problem in a multilingual space. We chose
languages that share syntactic structures, and have
particular diferences that can provide clues about whether
the models learned rely on shallower indicators, or the
pretrained models encode deeper knowledge. Our
experiments show that pretrained language models do not
encode abstract syntactic structures, but rather this
information is assembled "upon request" – by the probe or task
– based on language-specific indicators. Understanding
how information is encoded in large language models can
help determine the next necessary step towards making
language models truly deep.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Acknowledgments We gratefully acknowledge the partial support of this work by the Swiss National Science Foundation, through grant SNF Advanced grant TMAG1_209426 to PM.</title>
        <p>test., Psychological review 97 (1990) 404. [24] I. Papadimitriou, E. A. Chi, R. Futrell, K. Mahowald,
[17] J. Franck, G. Vigliocco, J. Nicol, Subject-verb agree- Deep subjecthood: Higher-order grammatical
feament errors in french and english: The role of syn- tures in multilingual BERT, in: P. Merlo, J.
Tiedetactic hierarchy, Language and cognitive processes mann, R. Tsarfaty (Eds.), Proceedings of the 16th
17 (2002) 371–404. Conference of the European Chapter of the
Associ[18] V. Nastase, P. Merlo, Grammatical information in ation for Computational Linguistics: Main Volume,
BERT sentence embeddings as two-dimensional Association for Computational Linguistics, Online,
arrays, in: B. Can, M. Mozes, S. Cahyawijaya, 2021, pp. 2522–2532. URL: https://aclanthology.org
N. Saphra, N. Kassner, S. Ravfogel, A. Ravichan- /2021.eacl-main.215. doi:10.18653/v1/2021.e
der, C. Zhao, I. Augenstein, A. Rogers, K. Cho, acl-main.215.</p>
        <p>E. Grefenstette, L. Voita (Eds.), Proceedings of the [25] A. Conneau, G. Kruszewski, G. Lample, L. Barrault,
8th Workshop on Representation Learning for NLP M. Baroni, What you can cram into a single $&amp;!#*
(RepL4NLP 2023), Association for Computational vector: Probing sentence embeddings for
linguisLinguistics, Toronto, Canada, 2023, pp. 22–39. URL: tic properties, in: I. Gurevych, Y. Miyao (Eds.),
https://aclanthology.org/2023.repl4nlp- 1.3. Proceedings of the 56th Annual Meeting of the
Asdoi:10.18653/v1/2023.repl4nlp-1.3. sociation for Computational Linguistics (Volume
[19] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Elec- 1: Long Papers), Association for Computational
tra: Pre- training text encoders as discriminators Linguistics, Melbourne, Australia, 2018, pp. 2126–
rather than generators, in: ICLR, 2020, pp. 1–18. 2136. URL: https://aclanthology.org/P18- 1198.
[20] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, doi:10.18653/v1/P18-1198.</p>
        <p>R. T. McCoy, N. Kim, B. Van Durme, S. R. Bowman, [26] Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y.
GoldD. Das, et al., What do you learn from context? prob- berg, Fine-grained analysis of sentence
embeding for sentence structure in contextualized word dings using auxiliary prediction tasks, in: 5th
Interrepresentations, in: The Seventh International Con- national Conference on Learning Representations,
ference on Learning Representations (ICLR), 2019, ICLR 2017, Toulon, France, April 24-26, 2017,
Conpp. 235–249. ference Track Proceedings, OpenReview.net, 2017.
[21] J. Hewitt, C. D. Manning, A structural probe for URL: https://openreview.net/f orum?id=BJh6Ztuxl.
ifnding syntax in word representations, in: Proceed- [27] M. Wilson, J. Petty, R. Frank, How abstract is
linings of the 2019 Conference of the North American guistic generalization in large language models?
exChapter of the Association for Computational Lin- periments with argument structure, Transactions
guistics: Human Language Technologies, Volume of the Association for Computational Linguistics
1 (Long and Short Papers), Association for Compu- 11 (2023) 1377–1395. URL: https://aclanthology.org
tational Linguistics, Minneapolis, Minnesota, 2019, /2023.tacl-1.78. doi:10.1162/tacl_a_00608.
pp. 4129–4138. URL: https://aclanthology.org/N19 [28] C. Wendler, V. Veselovsky, G. Monea, R. West,
-1419. doi:10.18653/v1/N19-1419. Do llamas work in English? on the latent
lan[22] E. A. Chi, J. Hewitt, C. D. Manning, Finding univer- guage of multilingual transformers, in: L.-W.
sal grammatical relations in multilingual BERT, in: Ku, A. Martins, V. Srikumar (Eds.), Proceedings
D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), of the 62nd Annual Meeting of the Association
Proceedings of the 58th Annual Meeting of the As- for Computational Linguistics (Volume 1: Long
Pasociation for Computational Linguistics, Associa- pers), Association for Computational Linguistics,
tion for Computational Linguistics, Online, 2020, Bangkok, Thailand, 2024, pp. 15366–15394. URL:
pp. 5564–5577. URL: https://aclanthology.org/2020. https://aclanthology.org/2024.acl- long.820.
acl-main.493. doi:10.18653/v1/2020.acl-mai doi:10.18653/v1/2024.acl-long.820.
n.493. [29] T. Tang, W. Luo, H. Huang, D. Zhang, X. Wang,
[23] S. Conia, E. Barba, A. Scirè, R. Navigli, Semantic role X. Zhao, F. Wei, J.-R. Wen, Language-specific
neulabeling meets definition modeling: Using natural rons: The key to multilingual capabilities in large
language to describe predicate-argument structures, language models, in: L.-W. Ku, A. Martins, V.
Srikuin: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Find- mar (Eds.), Proceedings of the 62nd Annual Meeting
ings of the Association for Computational Linguis- of the Association for Computational Linguistics
tics: EMNLP 2022, Association for Computational (Volume 1: Long Papers), Association for
CompuLinguistics, Abu Dhabi, United Arab Emirates, 2022, tational Linguistics, Bangkok, Thailand, 2024, pp.
pp. 4253–4270. URL: https://aclanthology.org/202 5701–5715. URL: https://aclanthology.org/2024.ac
2.findings-emnlp.313. doi: 10.18653/v1/2022.f l-long.309. doi:10.18653/v1/2024.acl-long.
indings-emnlp.313. 309.
[30] A. G. de Varda, M. Marelli, Data-driven cross- guistics (Volume 1: Long Papers), Association for
lingual syntax: An agreement study with massively Computational Linguistics, Toronto, Canada, 2023,
multilingual models, Computational Linguistics 49 pp. 5877–5891. URL: https://aclanthology.org/2023.
(2023) 261–299. URL: https://aclanthology.org/2023. acl-long.323. doi:10.18653/v1/2023.acl-lon
cl-2.1. doi:10.1162/coli_a_00472. g.323.
[31] P. Dhar, A. Bisazza, Understanding cross-lingual [37] A. Lenci, Understanding natural language
unsyntactic transfer in multilingual recurrent neural derstanding systems, Sistemi intelligenti, Rivista
networks, in: S. Dobnik, L. Øvrelid (Eds.), Proceed- quadrimestrale di scienze cognitive e di intelligenza
ings of the 23rd Nordic Conference on Computa- artificiale (2023) 277–302. URL: https://www.rivi
tional Linguistics (NoDaLiDa), Linköping Univer- steweb.it/doi/10.1422/107438. doi:10.1422/1074
sity Electronic Press, Sweden, Reykjavik, Iceland 38.
(Online), 2021, pp. 74–85. URL: https://aclantholo
gy.org/2021.nodalida-main.8.
[32] A. Mueller, G. Nicolai, P. Petrou-Zeniou, N. Talmina,</p>
      </sec>
      <sec id="sec-4-3">
        <title>T. Linzen, Cross-linguistic syntactic evaluation of</title>
        <p>word prediction models, in: D. Jurafsky, J. Chai,
N. Schluter, J. Tetreault (Eds.), Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, Association for
Computational Linguistics, Online, 2020, pp. 5523–5539. URL:
https://aclanthology.org/2020.acl- main.490.
doi:10.18653/v1/2020.acl-main.490.
[33] T. Pires, E. Schlinger, D. Garrette, How
multilingual is multilingual BERT?, in: A. Korhonen,
D. Traum, L. Màrquez (Eds.), Proceedings of the
57th Annual Meeting of the Association for
Computational Linguistics, Association for
Computational Linguistics, Florence, Italy, 2019, pp. 4996–
5001. URL: https://aclanthology.org/P19- 1493.
doi:10.18653/v1/P19-1493.
[34] G. I. Winata, A. Madotto, Z. Lin, R. Liu, J. Yosinski,</p>
      </sec>
      <sec id="sec-4-4">
        <title>P. Fung, Language models are few-shot multilin</title>
        <p>gual learners, in: D. Ataman, A. Birch, A. Conneau,</p>
      </sec>
      <sec id="sec-4-5">
        <title>O. Firat, S. Ruder, G. G. Sahin (Eds.), Proceedings of</title>
        <p>the 1st Workshop on Multilingual Representation</p>
      </sec>
      <sec id="sec-4-6">
        <title>Learning, Association for Computational Linguis</title>
        <p>tics, Punta Cana, Dominican Republic, 2021, pp.</p>
      </sec>
      <sec id="sec-4-7">
        <title>1–15. URL: https://aclanthology.org/2021.mrl-1.1.</title>
        <p>doi:10.18653/v1/2021.mrl-1.1.
[35] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat,</p>
      </sec>
      <sec id="sec-4-8">
        <title>M. Johnson, XTREME: A massively multilingual</title>
        <p>multi-task benchmark for evaluating cross-lingual
generalisation, in: H. D. III, A. Singh (Eds.),
Proceedings of the 37th International Conference on</p>
      </sec>
      <sec id="sec-4-9">
        <title>Machine Learning, volume 119 of Proceedings of</title>
        <p>Machine Learning Research, PMLR, 2020, pp. 4411–
4421. URL: https://proceedings.mlr.press/v119/hu2
0b.html.
[36] F. Philippy, S. Guo, S. Haddadan, Towards a
common understanding of contributing factors for
cross-lingual transfer in multilingual language
models: A review, in: A. Rogers, J. Boyd-Graber,</p>
      </sec>
      <sec id="sec-4-10">
        <title>N. Okazaki (Eds.), Proceedings of the 61st Annual</title>
      </sec>
      <sec id="sec-4-11">
        <title>Meeting of the Association for Computational Lin</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. Generating data from a seed file</title>
      <p>To build the sentence data, we use a seed file that was used to generate the subject-verb agreement data. A seed,
consisting of noun, prepositional and verb phrases with diferent grammatical numbers, can be combined to build
sentences consisting of diferent sequences of such chunks. Table 2 includes a partial line from the seed file. To
produce the data in the 4 languages, we translate the seed file, from which the sentences and BLM data are then
constructed.</p>
    </sec>
    <sec id="sec-6">
      <title>B. Example of data for the agreement BLM</title>
      <p>B.1. Example of BLM instances (type I) in diferent languages</p>
      <sec id="sec-6-1">
        <title>C.2. Results on the BLM Agr* data</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>C. Results</title>
      <sec id="sec-7-1">
        <title>C.1. Chunk sequence detection in sentences</title>
        <p>0.884 (0.002)
0.103 (0.032)
0.113 (0.033)
0.113 (0.026)
0.757 (0.015)
0.132 (0.024)
0.100 (0.020)
0.088 (0.007)
0.638 (0.025)
0.114 (0.007)
0.091 (0.009)
0.086 (0.008)
type_II_EN
0.772 (0.030)
0.151 (0.006)
0.106 (0.014)
0.107 (0.002)
0.670 (0.002)
0.188 (0.009)
0.100 (0.010)
0.093 (0.013)
0.620 (0.005)
0.168 (0.007)
0.091 (0.005)
0.082 (0.014)
type_III_EN
0.739 (0.012)
0.160 (0.007)
0.132 (0.011)
0.091 (0.011)
0.662 (0.008)
0.202 (0.013)
0.111 (0.004)
0.086 (0.007)
0.654 (0.010)
0.183 (0.003)
0.106 (0.003)
0.082 (0.001)
0.123 (0.032)
0.948 (0.009)
0.341 (0.018)
0.186 (0.014)
0.119 (0.009)
0.868 (0.010)
0.386 (0.016)
0.174 (0.005)
0.117 (0.007)
0.820 (0.013)
0.337 (0.016)
0.170 (0.007)
type_II_FR
0.154 (0.023)
0.972 (0.006)
0.417 (0.018)
0.177 (0.020)
0.158 (0.015)
0.903 (0.007)
0.448 (0.011)
0.182 (0.008)
0.150 (0.012)
0.870 (0.005)
0.387 (0.002)
0.175 (0.007)
type_III_FR
0.174 (0.023)
0.923 (0.013)
0.384 (0.016)
0.164 (0.023)
0.164 (0.009)
0.883 (0.001)
0.425 (0.005)
0.158 (0.006)
0.155 (0.006)
0.860 (0.004)
0.373 (0.003)
0.156 (0.007)
0.125 (0.046)
0.466 (0.010)
0.845 (0.010)
0.188 (0.015)
0.129 (0.029)
0.433 (0.008)
0.875 (0.004)
0.173 (0.006)
0.129 (0.028)
0.406 (0.013)
0.806 (0.009)
0.174 (0.003)
type_II_IT
0.103 (0.014)
0.484 (0.015)
0.791 (0.004)
0.170 (0.009)
0.106 (0.006)
0.434 (0.010)
0.840 (0.003)
0.159 (0.011)
0.116 (0.007)
0.386 (0.008)
0.770 (0.008)
0.172 (0.003)
0.106 (0.034)
0.164 (0.029)
0.183 (0.021)
0.733 (0.027)
0.103 (0.019)
0.187 (0.011)
0.196 (0.009)
0.726 (0.009)
0.108 (0.013)
0.169 (0.017)
0.170 (0.013)
0.314 (0.010)
type_II_RO
0.090 (0.007)
0.143 (0.018)
0.151 (0.034)
0.625 (0.014)
0.100 (0.010)
0.146 (0.013)
0.152 (0.020)
0.636 (0.006)
0.092 (0.009)
0.127 (0.012)
0.132 (0.016)
0.311 (0.017)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>