Exploring syntactic information in sentence embeddings
                                through multilingual subject-verb agreement
                                Vivi Nastase1,* , Chunyang Jiang1,2 , Giuseppe Samo1 and Paola Merlo1,2
                                1
                                    Idiap Research Institute, Martigny, Switzerland
                                2
                                    University of Geneva, Geneva, Switzerland


                                                   Abstract
                                                   In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically
                                                   valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with
                                                   specific properties, and using them to study sentence representations built using pretrained language models. We use a
                                                   new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural
                                                   phenomenon – subject-verb agreement across a variety of sentence structures – in several languages. Finding a solution to
                                                   this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level
                                                   architecture that solves the problem in two steps – detect syntactic objects and their properties in individual sentences, and
                                                   find patterns across an input sequence of sentences – we show that despite having been trained on multilingual texts in a
                                                   consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not
                                                   shared, even across closely related languages.

                                                   Questo lavoro chiede se i modelli linguistici multilingue preaddestrati catturino rappresentazioni linguistiche astratte valide
                                                   attraverso svariate lingue. Il nostro approccio sviluppa dati sintetici curati su larga scala, con proprietà specifiche, e li utilizza
                                                   per studiare le rappresentazioni di frasi costruite con modelli linguistici preaddestrati. Utilizziamo un nuovo task a scelta
                                                   multipla e i dati afferenti, le Blackbird Language Matrices (BLM), per concentrarci su uno specifico fenomeno strutturale
                                                   grammaticale - l’accordo tra il soggetto e il verbo - in diverse lingue. Per trovare la soluzione corretta a questo task è necessario
                                                   un sistema che rilevi modelli e paradigmi linguistici complessi nelle rappresentazioni testuali. Utilizzando un’architettura a
                                                   due livelli che risolve il problema in due fasi - prima impara gli oggetti sintattici e le loro proprietà nelle singole frasi e poi
                                                   ne ricava gli elementi comuni - dimostriamo che, nonostante siano stati addestrati su testi multilingue in modo coerente, i
                                                   modelli linguistici multilingue preaddestrati presentano differenze specifiche per ogni lingua e inoltre la struttura sintattica
                                                   non è condivisa, nemmeno tra lingue tipologicamente molto vicine.

                                                   Keywords
                                                   syntactic information, synthetic structured data, multi-lingual, cross-lingual, diagnostic studies of deep learning models


                                1. Introduction                                                                                             unstructured textual input, BERT [4] is able to infer POS,
                                                                                                                                            structural, entity-related, syntactic and semantic infor-
                                Large language models, trained on huge amount of texts,                                                     mation at successively higher layers of the architecture,
                                have reached a level of performance that rivals human                                                       mirroring the classical NLP pipeline [5]. We ask: How is
                                capabilities on a range of established benchmarks [1].                                                      this information encoded in the output layer of the model,
                                Despite high performance on high-level language pro-                                                        i.e. the embeddings? Does it rely on surface information
                                cessing tasks, it is not yet clear what kind of information                                                 – such as inflections, function words – and is assembled
                                these language models encode, and how. For example,                                                         on the demands of the task/probes [6], or does it indeed
                                transformer-based pretrained models have shown excel-                                                       reflect something deeper that the language model has
                                lent performance in tasks that seem to require that the                                                     assembled through the progressive transformation of the
                                model encodes syntactic information [2].                                                                    input through its many layers?
                                   All the knowledge that the LLMs encode comes from                                                           To investigate this question, we use a seemingly simple
                                unstructured texts and the shallow regularities they are                                                    task – subject-verb agreement. Subject-verb agreement
                                very good at detecting, and which they are able to lever-                                                   is often used to test the syntactic abilities of deep neural
                                age into information that correlates to higher structures                                                   networks [7, 8, 9, 10], because, while apparently simple
                                in language. Most notably, [3] have shown that from the                                                     and linear, it is in fact structurally, and theoretically, com-
                                                                                                                                            plex, and requires connecting the subject and the verb
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                        across arbitrarily long or complex structural distance.
                                Dec 04 — 06, 2024, Pisa, Italy                                                                              It has an added useful dimension – it relies on syntac-
                                *
                                 Corresponding author.
                                $ vivi.a.nastase@gmail.com (V. Nastase);
                                                                                                                                            tic structure and grammatical number information that
                                chunyang.jiang42@gmail.com (C. Jiang); giuseppe.samo@idiap.ch                                               many languages share.
                                (G. Samo); Paola.Merlo@unige.ch (P. Merlo)                                                                     In previous work we have shown that simple struc-
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
tural information – the chunk structure of a sentence –                         Context
                                                                1    NP-sg     PP1-sg                VP-sg
which can be leveraged to determine subject-verb agree-
                                                                2    NP-pl     PP1-sg                VP-pl
ment, or to contribute towards more semantic tasks, can
                                                                3    NP-sg     PP1-pl                VP-sg
be detected in the sentence embeddings obtained from            4    NP-pl     PP1-pl                VP-pl
a pre-trained model [11]. This result, though, does not         5    NP-sg     PP1-sg PP2-sg         VP-sg
cast light on whether the discovered structure is deeper        6    NP-pl     PP1-sg PP2-sg         VP-pl
and more abstract, or it is rather just a reflection of sur-    7    NP-sg     PP1-pl  PP2-sg        VP-sg
face indicators, such as function words or morphological        8    ???
markers.                                                                                Answers
   To tease apart these two options, we set up an experi-       1    NP-pl     PP1-pl    PP2-sg        VP-pl     Correct
ment covering four languages: English, French, Italian          2    NP-pl     PP1-pl    et PP2-sg     VP-pl       Coord
and Romanian. These languages, while different, have            3    NP-pl     PP1-pl                  VP-pl       WNA
shared properties that make sharing of syntactic structure      4    NP-pl     PP1-sg     PP1-sg       VP-pl        WN1
a reasonable expectation, if the pretrained multilingual        5    NP-pl     PP1-pl     PP2-pl       VP-pl        WN2
                                                                6    NP-pl     PP1-pl     PP2-pl       VP-sg        AEV
model does indeed discover and encode syntactic struc-
                                                                7    NP-pl     PP1-sg     PP2-pl       VP-sg       AEN1
ture. We use parallel datasets in the four languages, built     8    NP-pl     PP1-pl     PP2-sg       VP-sg       AEN2
by (approximately) translating the BLM-AgrF dataset
[12], a multiple-choice linguistic test inspired from the
Raven Progressive Matrices visual intelligence test, previ-    Figure 1: BLM instances for verb-subject agreement, with
ously used to explore subject-verb agreement in French.        two attractors. The errors can be grouped in two types:
                                                               (i) sequence errors: WNA= wrong nr. of attractors; WN1=
   Our work offers two contributions: (i) four parallel
datasets – on English, French, Italian and Romanian, fo-       wrong gram. nr. for 1𝑠𝑡 attractor noun (N1); WN2= wrong
                                                               gram. nr. for 2𝑛𝑑 attractor noun (N2); (ii) grammatical errors:
cused on subject-verb agreement; (ii) cross-lingual and
                                                               AEV=agreement error on the verb; AEN1=agreement error on
multilingual testing of a multilingual pretrained model,       N1; AEN2=agreement error on N2.
to explore the degree to which syntactic structure infor-
mation is shared across different languages. Our cross-
                                                                  Such an approach can be very useful for probing lan-
lingual and multilingual experiments show poor transfer
                                                               guage models, as it allows to test whether they indeed
across languages, even those most related, like Italian
                                                               detect the relevant linguistic objects and their properties,
and French. This result indicates that pretrained mod-
                                                               and whether (or to what degree) they use this informa-
els encode syntactic information based on shallow and
                                                               tion to find larger patterns. We have developed BLMs
language-specific clues, from which they are not yet able
                                                               as a linguistic test. Figure 1 illustrates the template of a
to take the step towards abstracting grammatical struc-
                                                               BLM subject-verb agreement matrix, with the different
ture. The datasets are available at https://www.idiap.ch
                                                               linguistic objects – chunks/phrases – and their relevant
/dataset/(blm-agre|blm-agrf|blm-agri|blm_agrr) and the
                                                               properties, in this case grammatical number. Examples
code at https://github.com/CLCL-Geneva/BLM-SNFDise
                                                               in all languages under investigation are provided in Ap-
ntangling.
                                                               pendix B.

2. BLM task and BLM-Agr datasets                             BLM-Agr datasets A BLM problem for subject-verb
                                                             agreement consists of a context set of seven sentences
Inspired by existing IQ tests —Raven’s progressive ma- that share the subject-verb agreement phenomenon, but
trices (RPMs)— we have developed a framework, called differ in other aspects – e.g. number of linearly interven-
Blackbird Language Matrices (BLMs) [13] and several ing noun phrases between the subject and the verb (called
datasets [12, 14]. RPMs consist of a sequence of images, attractors because they can interfere with the agreement),
called the context, connected in a logical sequence by different grammatical numbers for these attractors, and
underlying generative rules [15]. The task is to deter- different clause structures. The sequence is generated
mine the missing element in this visual sequence, the by a rule of progression of number of attractors, and
answer. The candidate answers are constructed to be alternation in the grammatical number of the different
similar enough that the solution can be found only if the phrases. Each context is paired with a set of candidate
rules are identified correctly.                              answers generated from the correct answer by altering
    Solving an RPM problem is usually done in two steps: it to produce minimally contrastive error types. We have
(i) identify the relevant objects and their attributes; (ii) two types of errors (see Figure 1: (i) sequence errors –
decompose the main problem into subproblems, based on these candidate answers are grammatically correct, but
object and attribute identification, in a way that allows they are not the correct continuation of the sequence; (ii)
detecting the global pattern or underlying rules [16].       agreement errors – these candidate answers are gram-
matically erroneous, because the verb is in agreement                 sibilities (e.g., 𝑝 = "np-s pp1-s vp-s"), all corresponding
with one of the intervening attractors. By constructing               sentences are collected into a set 𝑆𝑝 .
candidate answers with such specific error types, we can                 The dataset consists of triples (𝑖𝑛, 𝑜𝑢𝑡+ , 𝑂𝑢𝑡− ),
investigate the kind of information and structure learned.            where 𝑖𝑛 is an input sentence, 𝑜𝑢𝑡+ is the correct output –
   The seed data for French was created by manually                   a sentence different from 𝑖𝑛 but with the same chunk pat-
completing data previously published data [17]. From this             tern. 𝑂𝑢𝑡− are 𝑁𝑛𝑒𝑔𝑠 = 7 incorrect outputs, randomly
initial data, we generated a dataset that comprises three             chosen from the sentences that have a chunk pattern dif-
subsets of increasing lexical complexity (details in [12]):           ferent from 𝑖𝑛. For each language, we sample uniformly
Types I, II, III, corresponding to different amounts of               approx. 4000 instances from the generated data based on
lexical variation within a problem instance. Each subset              the pattern of the input sentence, randomly split 80:20
contains three clause structures uniformly distributed                into train:test. The train part is split 80:20 into train:dev,
within the data. The dataset used here is a variation of the          resulting in a 2576:630:798 split for train:dev:test.
BLM-AgrF [12] that separates sequence-based from other
types of errors, to be able to perform deeper analyses
into the behaviour of pretrained language models.                     3. Probing the encoding of syntax
   The datasets in English, Italian and Romanian were cre-
                                                                      We aim to test whether the syntactic information detected
ated by manually translating the seed French sentences
                                                                      in multilingual pretrained sentence embeddings is based
into the other languages by native (Italian and Romanian)
                                                                      on shallow, language-specific clues, or whether it is more
and near-native (English) speakers. The internal struc-
                                                                      abstract structural information. Using the subject-verb
ture in these languages is very similar, so translations are
                                                                      agreement task and the parallel datasets in four languages
approximately parallel. The differences lie in the treat-
                                                                      provides clues to the answer.
ment of preposition and determiner sequences that must
                                                                         The datasets all share sentences with the same syntac-
be conflated into one word in some cases in Italian and
                                                                      tic structures, as illustrated in Figure 1. However, there
French, but not in English. French and Italian use number-
                                                                      are language specific differences, as in the structure of
specific determiners and inflections, while Romanian and
                                                                      the chunks (noun or verb or prepositional phrases) and
English encode grammatical number exclusively through
                                                                      each language has different ways to encode grammatical
inflections. In English most plural forms are marked by
                                                                      number (see section 2).
a suffix. Romanian has more variation, and noun inflec-
                                                                         If the grammatical information in the sentences in
tions also encode case. Determiners are separate tokens,
                                                                      our dataset – i.e. the sequences of chunks with specific
which are overt indicators of grammatical number and
                                                                      properties relevant to the subject-verb agreement task
of phrase boundaries, whereas inflections may or may
                                                                      (Figure 1) – is an abstract form of knowledge within the
not be tokenized separately.
                                                                      pretrained model, it will be shared across languages. We
   Table 1 shows the datasets statistics for the four BLM
                                                                      would then see a high level of performance for a model
problems. After splitting each subset 90:10 into train:test
                                                                      trained on one of these languages, and tested on any
subsets, we randomly sample 2000 instances as train data.
                                                                      of the other. Additionally, when training on a dataset
20% of the train data is used for development.
                                                                      consisting of data in the four languages, the model should
                 English     French      Italian    Romanian          detect a shared parameter space that would lead to high
     Type I        230         252          230        230            results when testing on data for each language.
     Type II      4052        4927         4121       4571               If however the grammatical information is a reflection
     Type III     4052        4810         4121       4571            of shallow language indicators, we expect to see higher
                                                                      performance on languages that have overt grammatical
Table 1
                                                                      number and chunk indicators, such as French and Italian,
Test data statistics. The amount of training data is always
2000 instances.
                                                                      and a low rate of cross-language transfer.

A sentence dataset From the seed files for each lan-                  3.1. System architectures
guage we build a dataset to study sentence structure
                                                                      A sentence-level VAE To test whether chunk struc-
independently of a task. The seed files contain noun,
                                                                      ture can be detected in sentence embeddings we use a
verb and prepositional phrases, with singular and plural
                                                                      VAE-like system, which encodes a sentence, and decodes
variations. From these chunks, we build sentences with
                                                                      a different sentence with the same chunk structure, us-
all (grammatically correct) combinations of np [pp1
                                                                      ing a set of contrastive negative examples – sentences
[pp2 ]] vp1 . For each chunk pattern 𝑝 of the 14 pos-
                                                                      that have different chunk structures from the input – to
1
                                                                      encourage the latent to encode the chunk structure.
    pp1 and pp2 may be included or not, pp2 may be included only if
    pp1 is included
   The architecture of the sentence-level VAE is similar to     and several incorrect answers 𝑎𝑒𝑟𝑟 . Every sentence is
a previously proposed system [18]: the encoder consists         embedded using the pretrained model. To simplify the
of a CNN layer with a 15x15 kernel, which is applied to a       discussion, in the sections that follows, when we say
32x24-shaped sentence embedding, followed by a linear           sentence we actually mean its embedding.
layer that compresses the output of the CNN into a latent          The two-level VAE system takes a BLM instance as
layer of size 5. The decoder mirrors the encoder.               input, decomposes its context sequence 𝑆 into sentences
   An instance consists of a triple (𝑖𝑛, 𝑜𝑢𝑡+ , 𝑂𝑢𝑡− ),         and passes them individually as input to the sentence-
where 𝑖𝑛 is an input sentence with embedding 𝑒𝑖𝑛                level VAE. For each sentence 𝑠𝑖 ∈ 𝑆, the system builds
and chunk structure 𝑝, 𝑜𝑢𝑡+ is a sentence with embed-           on-the-fly the candidate answers for the sentence level:
ding 𝑒𝑜𝑢𝑡+ with same chunk structure 𝑝, and 𝑂𝑢𝑡− =              the same sentence 𝑠𝑖 from input is used as the correct
{𝑠𝑘 |𝑘 = 1, 𝑁𝑛𝑒𝑔𝑠 } is a set of 𝑁𝑛𝑒𝑔𝑠 = 7 sentences             output, and a random selection of sentences from 𝑆 are
with embeddings 𝑒𝑠𝑘 , each with chunk pattern different         the negative answers. After an instance is processed by
from 𝑝 (and different from each other). The input 𝑒𝑖𝑛           the sentence level, for each sentence 𝑠𝑖 ∈ 𝑆, we obtain its
is encoded into latent representation 𝑧𝑖 , from which we        representation from the latent layer 𝑙𝑠𝑖 , and reassemble
sample a vector 𝑧˜𝑖 , which is decoded into the output ˆ𝑒𝑖𝑛 .   the input sequence as 𝑆𝑙 = 𝑠𝑡𝑎𝑐𝑘[𝑙𝑠𝑖 ], and pass it as
To encourage the latent to encode the structure of the in-      input to the task-level VAE. The loss function combines
put sentence we use a max-margin loss function, to push         the losses on the two levels – a max-margin loss on the
for a higher similarity score for ˆ𝑒𝑖𝑛 with the sentence        sentence level that contrasts the sentence reconstructed
that has the same chunk pattern as the input (𝑒𝑜𝑢𝑡+ ) than      on the sentence level with the correct answer and the
the ones that do not. At prediction time, the sentence          erroneous ones, and a max-margin loss on the task level
from the {𝑜𝑢𝑡+ } ∪ 𝑂𝑢𝑡− options that has the highest            that contrasts the answer constructed by the decoder
score relative to the decoded answer is taken as correct.       with the answer set of the BLM instance (details in [11]).

Two-level VAE for BLMs We use a two-level system                3.2. Experiments
illustrated in Figure 2, which separates the solving of
the BLM task on subject-verb agreement into two steps:         To explore how syntactic information – in particular
(i) compress sentence embeddings into a representation         chunk structure – is encoded, we perform cross-language
that captures the sentence chunk structure and the rele-       and multi-language experiments, using first the sentences
vant chunk properties (on the sentence level) (ii) use the     dataset, and then the BLM agreement task. We report F1
compressed sentence representations to solve the BLM           averages over three runs.
agreement problems, by detecting the pattern across the           Cross-lingual experiments – train on data from one lan-
sequence of structures (on the task level). This archi-        guage, test on all the others – show whether patterns de-
tecture will allow us to test whether sentence structure       tected in sentence embeddings that encode chunk struc-
– in terms of chunks – is shared across languages in a         ture are transferable across languages. The results on
pretrained multilingual model.                                 testing on the same language as the training provide sup-
                                                               port for the experimental set-up – the high results show
                                                               that the pretrained language model used does encode the
                                                               necessary information, and the system architecture is
                                                               adequate to distill it.
                                                                  The multilingual experiments, where we learn a model
                                                               from data in all the languages, will provide additional
                                                               clues – if the performance on testing on individual lan-
                                                               guages is comparable to when training on each language
Figure 2: A two-level VAE: the sentence level learns to com- alone, it means some information is shared across lan-
press a sentence into a representation useful to solve the BLM guages and can be beneficial.
problem on the task level.
                                                              3.2.1. Syntactic structure in sentences
    All reported experiments use Electra [19]2 , with the
sentence representations the embedding of the [CLS] We use only the sentence level of the system illustrated
token (details in [11]).                                      in Figure 2 to explore chunk structure in sentences, using
    An instance for a BLM problem consists of an ordered the data described in Section 2. For the cross-lingual
context sequence 𝑆 of sentences, 𝑆 = {𝑠𝑖 |𝑖 = 1, 7} as experiments, the training dataset for each language is
input, and an answer set 𝐴 with one correct answer 𝑎𝑐 , used to train a model that is then tested on each test
                                                              set. For the multilingual setup, we assemble a common
2
  Electra pretrained model: google/electra-base-discriminator training data from the training data for all languages.
3.2.2. Solving the BLM agreement task                      compared to learning in a monolingual setting. This
                                                           again indicates that the system could not detect a shared
We solve the BLM agreement task using the two-level sys-
                                                           parameter space for the information that is being learned,
tem, where a compacted sentence representation learned
                                                           the chunk structure, and thus this information is encoded
on the sentence level should help detect patterns in the
                                                           differently in the languages under study.
input sequence of a BLM instance. Because the datasets
are parallel, with shared sentence and sequence patterns,
we test whether the added learning signal from the task
level can help push the system to learn to map an input
sentence into a representation that captures structure
shared across languages. We perform cross-lingual ex-
periments, where a model is trained on data from one
language, and tested on all the test sets, and a multilin-
gual experiment, where for each type I/II/III data, we
assemble a training dataset from the training sets of the
same type from the other languages. The model is then
tested on the separate test sets.

3.3. Evaluation
For each training set we build three models, and plot the    Figure 4: tSNE projection of the latent representation of
average F1 score. The standard deviation is very small,      sentences from the training data, coloured by their chunk
so we do not include it in the plot, but it is reported in   pattern. Different markers indicate the languages: "o" for
the results Tables in Appendix C.                            English, "x" for French, "+" for Italian, "*" for Romanian. We
                                                             note that while representations cluster by the pattern, the
                                                             clusters for different languages are disjoint.
4. Results
                                                                An additional interesting insight comes from the anal-
Structure in sentences Figure 3 shows the results for
                                                             ysis of the latent layer representations. Figure 4 shows
the experiments on detecting chunk structure in sentence
                                                             the tSNE projection of the latent representations of the
embeddings, in cross-lingual and multilingual training
                                                             sentences in the training data after multilingual train-
setups, for comparison (detailed results in Table 3).
                                                             ing. Different colours show different chunk patterns, and
                                                             different markers show different languages. Had the in-
                                                             formation encoding syntactic structure been shared, the
                                                             clusters for the same pattern in the different languages
                                                             would overlap. Instead, we note that each language seems
                                                             to have its own quite separate pattern clusters.

                                                              Structure in sentences for the BLM agreement task
                                                              When the sentence structure detection is embedded in
                                                              the system for solving the BLM agreement task, where an
                                                              additional supervision signals comes from the task, we
                                                              note a similar result as when processing the sentences
Figure 3: Cross-language testing for detecting chunk struc- individually. Figure 5 shows the results for the multi-
ture in sentence embeddings.                                  lingual and monolingual training setups for the type I
                                                              data. Complete results are in Tables 4-5 in the appendix.
   Two observations are relevant to our investigation: (i)
while training and testing on the same language leads to
                                                              Discussion and related work Pretrained language
good performance – indicating that Electra sentence em-
                                                              models are learned from shallow cooccurrences through
beddings do contain relevant information about chunks,
                                                              a lexical prediction task. The input information is trans-
and that the system does detect the chunk pattern in
                                                              formed through several transformer layers, various parts
these representations – there is very little transfer effect.
                                                              boosting each other through self-attention. Analysis of
A slight effect is detected for the model learned on Ital-
                                                              the architecture of transformer models, like BERT [4],
ian and tested on French; (ii) learning using multilingual
                                                              have localised and followed the flow of specific types
training data leads to a deterioration of the performance,
                                                              of linguistic information through the system [20, 3], to
                                                               languages chosen share commonalities – French, Italian
                                                               and Romanian are all Romance languages, English and
                                                               French share much lexical material – but there are also
                                                               differences: French and Italian use a similar manner to
                                                               encode grammatical number, mainly through articles that
                                                               can also signal phrase boundaries. English has a very lim-
                                                               ited form of nominal plural morphology, but determiners
                                                               are useful for signaling phrase boundaries. In Romanian,
                                                               number is expressed through inflection, suffixation and
                                                               case, and articles are also often expressed through specific
Figure 5: Average F1 performance on training on type I data    suffixes, thus overt phrase boundaries are less common
over three runs – cross-language and multi-language            than in French, Italian and English. These commonal-
                                                               ities and differences help us interpret the results, and
the degree that the classical NLP pipeline seems to be         provide clues on how the targeted syntactic information
reflected in the succession of the model’s layers. Analysis    is encoded.
of contextualized token embeddings shows that they can            Previous experiments have shown that syntactic infor-
encode specific linguistic information, such as sentence       mation – chunk sequences and their properties – can be
structure [21] (including in a multilingual set-up [22]),      accessed in transformer-based pretrained sentence em-
predicate argument structure [23], subjecthood and ob-         beddings [11]. In this multilingual setup, we test whether
jecthood [24], among others. Sentence embeddings have          this information has been identified based on language-
also been probed using classifiers, and determined to          specific shallow features, or whether the system has un-
encode specific types of linguistic information, such as       covered and encoded more abstract structures.
subject-verb agreement [9], word order, tree depth, con-          The low rate of transfer for the monolingual training
stituent information [25], auxiliaries[26] and argument        setup and the decreased performance for the multilingual
structure [27].                                                training setup for both our experimental configurations
   Generative models like LLAMA seem to use English as         indicate that the chunk sequence information is language
the latent language in the middle layers [28], while other     specific and is assembled by the system based on shallow
analyses of internal model parameters has lead to uncov-       features. Further clues come from the fact that the only
ering language agnostic and language specific networks         transfer happens between French and Italian, which en-
of parameters [29], or neurons encoding cross-language         code phrases and grammatical number in a very similar
number agreement information across several internal           manner. Embedding the sentence structure detection into
layers [30]. It has also been shown that subject-verb          a larger system, where it receives an additional learning
agreement information is not shared by BiLSTM mod-             signal (shared across languages) does not help to push
els [31] or multilingual BERT [32]. Testing the degree         towards finding a shared sentence representation space
to which word/sentence embeddings are multilingual             that encodes in a uniform manner the sentence structure
has usually been done using a classification probe, for        shared across languages.
tasks like NER, POS tagging [33], language identification
[34], or more complex tasks like question answering and
sentence retrieval [35]. There are contradictory results       5. Conclusions
on various cross-lingual model transfers, some of which
can be explained by factors such as domain and size of         We have aimed to add some evidence to the question
training data, typological closeness of languages [36], or     How do state-of-the-art systems ≪know≫ what they
by the power of the classification probes. Generative or       ≪know≫? [37] by projecting the subject-verb agree-
classification probes do not provide insights into whether     ment problem in a multilingual space. We chose lan-
the pretrained model finds deeper regularities and en-         guages that share syntactic structures, and have partic-
codes abstract structures, or the predictions are based on     ular differences that can provide clues about whether
shallower features that the probe used assembles for the       the models learned rely on shallower indicators, or the
specific test it is used for [37, 6].                          pretrained models encode deeper knowledge. Our ex-
   We aimed to answer this question by using a multi-          periments show that pretrained language models do not
lingual setup, and a simple syntactic structure detection      encode abstract syntactic structures, but rather this infor-
task in an indirectly supervised setting. The datasets         mation is assembled "upon request" – by the probe or task
used – in English, French, Italian and Romanian – are          – based on language-specific indicators. Understanding
(approximately) lexically parallel, and are parallel in syn-   how information is encoded in large language models can
tactic structure. The property of interest is grammatical      help determine the next necessary step towards making
number, and the task is subject-verb agreement. The            language models truly deep.
Acknowledgments We gratefully acknowledge the                     https://aclanthology.org/D19-1275. doi:10.18653
partial support of this work by the Swiss National Science        /v1/D19-1275.
Foundation, through grant SNF Advanced grant TMAG-            [7] T. Linzen, E. Dupoux, Y. Goldberg, Assessing
1_209426 to PM.                                                   the ability of LSTMs to learn syntax-sensitive de-
                                                                  pendencies, Transactions of the Association of
                                                                  Computational Linguistics 4 (2016) 521–535. URL:
References                                                        https://www.mitpressjournals.org/doi/abs/10.1162
                                                                  /tacl_a_00115.
 [1] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh,
                                                              [8] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen,
     J. Michael, F. Hill, O. Levy, S. Bowman, Super-
                                                                  M. Baroni, Colorless green recurrent networks
     glue: A stickier benchmark for general-purpose
                                                                  dream hierarchically, in: Proceedings of the 2018
     language understanding systems, in: H. Wal-
                                                                  Conference of the North American Chapter of the
     lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
                                                                  Association for Computational Linguistics: Hu-
     E. Fox, R. Garnett (Eds.), Advances in Neural In-
                                                                  man Language Technologies, Association for Com-
     formation Processing Systems, volume 32, Curran
                                                                  putational Linguistics, 2018, pp. 1195–1205. URL:
     Associates, Inc., 2019. URL: https://proceedings.ne
                                                                  http://aclweb.org/anthology/N18-1108. doi:10.1
     urips.cc/paper/2019/file/4496bf24afe7fab6f046bf
                                                                  8653/v1/N18-1108.
     4923da8de6-Paper.pdf.
                                                              [9] Y. Goldberg, Assessing bert’s syntactic abilities,
 [2] C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal,
                                                                  arXiv preprint arXiv:1901.05287 (2019).
     O. Levy, Emergent linguistic structure in artificial
                                                             [10] T. Linzen, M. Baroni, Syntactic structure from deep
     neural networks trained by self-supervision, Pro-
                                                                  learning, Annual Review of Linguistics 7 (2021)
     ceedings of the National Academy of Sciences 117
                                                                  195–212. doi:10.1146/annurev-linguistics
     (2020) 30046 – 30054.
                                                                  -032020-051035.
 [3] A. Rogers, O. Kovaleva, A. Rumshisky, A primer
                                                             [11] V. Nastase, P. Merlo, Are there identifiable struc-
     in BERTology: What we know about how BERT
                                                                  tural parts in the sentence embedding whole?, in:
     works, Transactions of the Association for Compu-
                                                                  Proceedings of the Workshop on analyzing and in-
     tational Linguistics 8 (2020) 842–866. URL: https:
                                                                  terpreting neural networks for NLP (BlackBoxNLP),
     //aclanthology.org/2020.tacl-1.54. doi:10.1162/
                                                                  2024.
     tacl_a_00349.
                                                             [12] A. An, C. Jiang, M. A. Rodriguez, V. Nastase,
 [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
                                                                  P. Merlo, BLM-AgrF: A new French benchmark
     Pre-training of deep bidirectional transformers for
                                                                  to investigate generalization of agreement in neu-
     language understanding, in: Proceedings of the
                                                                  ral networks, in: Proceedings of the 17th Confer-
     2019 Conference of the North American Chapter of
                                                                  ence of the European Chapter of the Association for
     the Association for Computational Linguistics: Hu-
                                                                  Computational Linguistics, Association for Compu-
     man Language Technologies, Volume 1 (Long and
                                                                  tational Linguistics, Dubrovnik, Croatia, 2023, pp.
     Short Papers), Association for Computational Lin-
                                                                  1363–1374. URL: https://aclanthology.org/2023.eacl
     guistics, Minneapolis, Minnesota, 2019, pp. 4171–
                                                                  -main.99.
     4186. URL: https://aclanthology.org/N19-1423.
                                                             [13] P. Merlo, Blackbird language matrices (BLM), a new
     doi:10.18653/v1/N19-1423.
                                                                  task for rule-like generalization in neural networks:
 [5] I. Tenney, D. Das, E. Pavlick, BERT rediscov-
                                                                  Motivations and formal specifications, ArXiv cs.CL
     ers the classical NLP pipeline, in: A. Korhonen,
                                                                  2306.11444 (2023). URL: https://doi.org/10.48550/a
     D. Traum, L. Màrquez (Eds.), Proceedings of the
                                                                  rXiv.2306.11444. doi:10.48550/arXiv.2306.11
     57th Annual Meeting of the Association for Com-
                                                                  444.
     putational Linguistics, Association for Computa-
                                                             [14] G. Samo, V. Nastase, C. Jiang, P. Merlo, BLM-s/lE:
     tional Linguistics, Florence, Italy, 2019, pp. 4593–
                                                                  A structured dataset of English spray-load verb al-
     4601. URL: https://aclanthology.org/P19-1452.
                                                                  ternations for testing generalization in LLMs, in:
     doi:10.18653/v1/P19-1452.
                                                                  Findings of the 2023 Conference on Empirical Meth-
 [6] J. Hewitt, P. Liang, Designing and interpreting
                                                                  ods in Natural Language Processing, 2023.
     probes with control tasks, in: K. Inui, J. Jiang,
                                                             [15] J. C. Raven, Standardization of progressive matrices,
     V. Ng, X. Wan (Eds.), Proceedings of the 2019 Con-
                                                                  British Journal of Medical Psychology 19 (1938) 137–
     ference on Empirical Methods in Natural Language
                                                                  150.
     Processing and the 9th International Joint Con-
                                                             [16] P. A. Carpenter, M. A. Just, P. Shell, What one
     ference on Natural Language Processing (EMNLP-
                                                                  intelligence test measures: a theoretical account of
     IJCNLP), Association for Computational Linguis-
                                                                  the processing in the raven progressive matrices
     tics, Hong Kong, China, 2019, pp. 2733–2743. URL:
     test., Psychological review 97 (1990) 404.                [24] I. Papadimitriou, E. A. Chi, R. Futrell, K. Mahowald,
[17] J. Franck, G. Vigliocco, J. Nicol, Subject-verb agree-         Deep subjecthood: Higher-order grammatical fea-
     ment errors in french and english: The role of syn-            tures in multilingual BERT, in: P. Merlo, J. Tiede-
     tactic hierarchy, Language and cognitive processes             mann, R. Tsarfaty (Eds.), Proceedings of the 16th
     17 (2002) 371–404.                                             Conference of the European Chapter of the Associ-
[18] V. Nastase, P. Merlo, Grammatical information in               ation for Computational Linguistics: Main Volume,
     BERT sentence embeddings as two-dimensional                    Association for Computational Linguistics, Online,
     arrays, in: B. Can, M. Mozes, S. Cahyawijaya,                  2021, pp. 2522–2532. URL: https://aclanthology.org
     N. Saphra, N. Kassner, S. Ravfogel, A. Ravichan-               /2021.eacl-main.215. doi:10.18653/v1/2021.e
     der, C. Zhao, I. Augenstein, A. Rogers, K. Cho,                acl-main.215.
     E. Grefenstette, L. Voita (Eds.), Proceedings of the      [25] A. Conneau, G. Kruszewski, G. Lample, L. Barrault,
     8th Workshop on Representation Learning for NLP                M. Baroni, What you can cram into a single $&!#*
     (RepL4NLP 2023), Association for Computational                 vector: Probing sentence embeddings for linguis-
     Linguistics, Toronto, Canada, 2023, pp. 22–39. URL:            tic properties, in: I. Gurevych, Y. Miyao (Eds.),
     https://aclanthology.org/2023.repl4nlp- 1.3.                   Proceedings of the 56th Annual Meeting of the As-
     doi:10.18653/v1/2023.repl4nlp-1.3.                             sociation for Computational Linguistics (Volume
[19] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Elec-          1: Long Papers), Association for Computational
     tra: Pre- training text encoders as discriminators             Linguistics, Melbourne, Australia, 2018, pp. 2126–
     rather than generators, in: ICLR, 2020, pp. 1–18.              2136. URL: https://aclanthology.org/P18-1198.
[20] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak,                doi:10.18653/v1/P18-1198.
     R. T. McCoy, N. Kim, B. Van Durme, S. R. Bowman,          [26] Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. Gold-
     D. Das, et al., What do you learn from context? prob-          berg, Fine-grained analysis of sentence embed-
     ing for sentence structure in contextualized word              dings using auxiliary prediction tasks, in: 5th Inter-
     representations, in: The Seventh International Con-            national Conference on Learning Representations,
     ference on Learning Representations (ICLR), 2019,              ICLR 2017, Toulon, France, April 24-26, 2017, Con-
     pp. 235–249.                                                   ference Track Proceedings, OpenReview.net, 2017.
[21] J. Hewitt, C. D. Manning, A structural probe for               URL: https://openreview.net/forum?id=BJh6Ztuxl.
     finding syntax in word representations, in: Proceed-      [27] M. Wilson, J. Petty, R. Frank, How abstract is lin-
     ings of the 2019 Conference of the North American              guistic generalization in large language models? ex-
     Chapter of the Association for Computational Lin-              periments with argument structure, Transactions
     guistics: Human Language Technologies, Volume                  of the Association for Computational Linguistics
     1 (Long and Short Papers), Association for Compu-              11 (2023) 1377–1395. URL: https://aclanthology.org
     tational Linguistics, Minneapolis, Minnesota, 2019,            /2023.tacl-1.78. doi:10.1162/tacl_a_00608.
     pp. 4129–4138. URL: https://aclanthology.org/N19          [28] C. Wendler, V. Veselovsky, G. Monea, R. West,
     -1419. doi:10.18653/v1/N19-1419.                               Do llamas work in English? on the latent lan-
[22] E. A. Chi, J. Hewitt, C. D. Manning, Finding univer-           guage of multilingual transformers, in: L.-W.
     sal grammatical relations in multilingual BERT, in:            Ku, A. Martins, V. Srikumar (Eds.), Proceedings
     D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.),        of the 62nd Annual Meeting of the Association
     Proceedings of the 58th Annual Meeting of the As-              for Computational Linguistics (Volume 1: Long Pa-
     sociation for Computational Linguistics, Associa-              pers), Association for Computational Linguistics,
     tion for Computational Linguistics, Online, 2020,              Bangkok, Thailand, 2024, pp. 15366–15394. URL:
     pp. 5564–5577. URL: https://aclanthology.org/2020.             https://aclanthology.org/2024.acl- long.820.
     acl-main.493. doi:10.18653/v1/2020.acl-mai                     doi:10.18653/v1/2024.acl-long.820.
     n.493.                                                    [29] T. Tang, W. Luo, H. Huang, D. Zhang, X. Wang,
[23] S. Conia, E. Barba, A. Scirè, R. Navigli, Semantic role        X. Zhao, F. Wei, J.-R. Wen, Language-specific neu-
     labeling meets definition modeling: Using natural              rons: The key to multilingual capabilities in large
     language to describe predicate-argument structures,            language models, in: L.-W. Ku, A. Martins, V. Sriku-
     in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Find-           mar (Eds.), Proceedings of the 62nd Annual Meeting
     ings of the Association for Computational Linguis-             of the Association for Computational Linguistics
     tics: EMNLP 2022, Association for Computational                (Volume 1: Long Papers), Association for Compu-
     Linguistics, Abu Dhabi, United Arab Emirates, 2022,            tational Linguistics, Bangkok, Thailand, 2024, pp.
     pp. 4253–4270. URL: https://aclanthology.org/202               5701–5715. URL: https://aclanthology.org/2024.ac
     2.findings-emnlp.313. doi:10.18653/v1/2022.f                   l-long.309. doi:10.18653/v1/2024.acl-long.
     indings-emnlp.313.                                             309.
[30] A. G. de Varda, M. Marelli, Data-driven cross-         guistics (Volume 1: Long Papers), Association for
     lingual syntax: An agreement study with massively      Computational Linguistics, Toronto, Canada, 2023,
     multilingual models, Computational Linguistics 49      pp. 5877–5891. URL: https://aclanthology.org/2023.
     (2023) 261–299. URL: https://aclanthology.org/2023.    acl-long.323. doi:10.18653/v1/2023.acl-lon
     cl-2.1. doi:10.1162/coli_a_00472.                      g.323.
[31] P. Dhar, A. Bisazza, Understanding cross-lingual [37] A. Lenci, Understanding natural language un-
     syntactic transfer in multilingual recurrent neural    derstanding systems, Sistemi intelligenti, Rivista
     networks, in: S. Dobnik, L. Øvrelid (Eds.), Proceed-   quadrimestrale di scienze cognitive e di intelligenza
     ings of the 23rd Nordic Conference on Computa-         artificiale (2023) 277–302. URL: https://www.rivi
     tional Linguistics (NoDaLiDa), Linköping Univer-       steweb.it/doi/10.1422/107438. doi:10.1422/1074
     sity Electronic Press, Sweden, Reykjavik, Iceland      38.
     (Online), 2021, pp. 74–85. URL: https://aclantholo
     gy.org/2021.nodalida-main.8.
[32] A. Mueller, G. Nicolai, P. Petrou-Zeniou, N. Talmina,
     T. Linzen, Cross-linguistic syntactic evaluation of
     word prediction models, in: D. Jurafsky, J. Chai,
     N. Schluter, J. Tetreault (Eds.), Proceedings of the
     58th Annual Meeting of the Association for Com-
     putational Linguistics, Association for Computa-
     tional Linguistics, Online, 2020, pp. 5523–5539. URL:
     https://aclanthology.org/2020.acl- main.490.
     doi:10.18653/v1/2020.acl-main.490.
[33] T. Pires, E. Schlinger, D. Garrette, How multi-
     lingual is multilingual BERT?, in: A. Korhonen,
     D. Traum, L. Màrquez (Eds.), Proceedings of the
     57th Annual Meeting of the Association for Com-
     putational Linguistics, Association for Computa-
     tional Linguistics, Florence, Italy, 2019, pp. 4996–
     5001. URL: https://aclanthology.org/P19-1493.
     doi:10.18653/v1/P19-1493.
[34] G. I. Winata, A. Madotto, Z. Lin, R. Liu, J. Yosinski,
     P. Fung, Language models are few-shot multilin-
     gual learners, in: D. Ataman, A. Birch, A. Conneau,
     O. Firat, S. Ruder, G. G. Sahin (Eds.), Proceedings of
     the 1st Workshop on Multilingual Representation
     Learning, Association for Computational Linguis-
     tics, Punta Cana, Dominican Republic, 2021, pp.
     1–15. URL: https://aclanthology.org/2021.mrl-1.1.
     doi:10.18653/v1/2021.mrl-1.1.
[35] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat,
     M. Johnson, XTREME: A massively multilingual
     multi-task benchmark for evaluating cross-lingual
     generalisation, in: H. D. III, A. Singh (Eds.), Pro-
     ceedings of the 37th International Conference on
     Machine Learning, volume 119 of Proceedings of
     Machine Learning Research, PMLR, 2020, pp. 4411–
     4421. URL: https://proceedings.mlr.press/v119/hu2
     0b.html.
[36] F. Philippy, S. Guo, S. Haddadan, Towards a
     common understanding of contributing factors for
     cross-lingual transfer in multilingual language mod-
     els: A review, in: A. Rogers, J. Boyd-Graber,
     N. Okazaki (Eds.), Proceedings of the 61st Annual
     Meeting of the Association for Computational Lin-
A. Generating data from a seed file
To build the sentence data, we use a seed file that was used to generate the subject-verb agreement data. A seed,
consisting of noun, prepositional and verb phrases with different grammatical numbers, can be combined to build
sentences consisting of different sequences of such chunks. Table 2 includes a partial line from the seed file. To
produce the data in the 4 languages, we translate the seed file, from which the sentences and BLM data are then
constructed.
 Subj_sg      Subj_pl     P1_sg         P1_pl         P2_sg          P2_pl          V_sg                          V_pl
 The computer The comput- with the pro- with the pro- of the experi- of the experi- is broken                     are broken
              ers         gram          grams         ment           ments
                                                    a BLM instance
                                                    Context:
                                                    The computer with the program is broken.
  Sent. with different chunks                       The computers with the program are broken.
                                                    The computer with the programs is broken.
  The computer is broken.       np-s vp-s
                                                    The computers with the programs are broken.
  The computers are broken.     np-p vp-p           The computer with the program of the experiment is broken.
  The computer with the pro- np-s pp1-s             The computers with the program of the experiment are broken.
  gram is broken.            vp-s                   The computer with the programs of the experiment is broken.
  ...                           ...                 Answer set:
  The computers with the pro- np-p pp1-p            The computers with the programs of the experiment are broken.
  grams of the experiments are pp2-p vp-p           The computers with the programs of the experiments are broken.
  broken.                                           The computers with the program of the experiment are broken.
                                                    The computers with the program of the experiment is broken.
                                                    ...


Table 2
A line from the seed file on top, and a set of individual sentences built from it, as well as one BLM instance.
B. Example of data for the agreement BLM
B.1. Example of BLM instances (type I) in different languages

                   English - Context                                                  English - Answers
  1   The owner of the parrot is coming.                       1   The owners of the parrots in the tree are coming.
  2   The owners of the parrot are coming.                     2   The owners of the parrots in the trees are coming.
  3   The owner of the parrots is coming.                      3   The owner of the parrots in the tree is coming.
  4   The owners of the parrots are coming.                    4   The owners of the parrots in the tree are coming.
  5   The owner of the parrot in the tree is coming.           5   The owners of the parrot in the tree are coming.
  6   The owners of the parrot in the tree are coming.         6   The owners of the parrots in the trees are coming.
  7   The owner of the parrots in the tree is coming.          7   The owners of the parrots and the trees are coming.
  ?   ???                                                      ?   The owners of the parrots in the tree in the gardens are coming.


                     French - Context                                                    French - Answers
  1   Le proprietaire du perroquet viendra.                    1   Les proprietaires des perroquets dans l’arbre viendront.
  2   Les proprietaires du perroquet viendront.                2   Les proprietaires des perroquets dans les arbres viendront.
  3   Le proprietaire des perroquets viendra.                  3   Le proprietaire des perroquets dans l’arbre viendra.
  4   Les proprietaires des perroquets viendront.              4   Les proprietaires des perroquets dans l’arbre viendront.
  5   Le proprietaire du perroquet dans l’arbre viendra.       5   Les proprietaires du perroquet dans l’arbre viendront.
  6   Les proprietaires du perroquet dans l’arbre viendront.   6   Les proprietaires des perroquets dans les arbres viendront.
  7   Le proprietaire des perroquets dans l’arbre viendra.     7   Les proprietaires des perroquets et les arbres viendront.
  ?   ???                                                      ?   Les proprietaires des perroquets dans l’arbre des jardins viendront.


                     Italian - Context                                                 Italian - Answers
  1   Il padrone del pappagallo arriverà.                      1   I padroni dei pappagalli sull’albero arriveranno.
  2   I padroni del pappagallo arriveranno.                    2   I padroni dei pappagalli sugli alberi arriveranno.
  3   Il padrone dei pappagalli arriverà.                      3   Il padrone dei pappagalli sull’albero arriverà.
  4   I padroni dei pappagalli arriveranno.                    4   I padroni dei pappagalli sull’albero arriveranno.
  5   Il padrone del pappagallo sull’albero arriverà.          5   I padroni del pappagallo sull’albero arriveranno.
  6   I padroni del pappagallo sull’albero arriveranno.        6   I padroni dei pappagalli sugli alberi arriveranno.
  7   Il padrone dei pappagalli sull’albero arriverà.          7   I padroni dei pappagalli e gli alberi arriveranno.
  ?   ???                                                      ?   I padroni dei pappagalli sull’albero dei giardini arriveranno.


                  Romanian - Context                                                  Romanian - Answers
  1   Posesorul papagalului va veni.                           1   Posesorii papagalilor din copac vor veni.
  2   Posesorii papagalului vor veni.                          2   Posesorii papagalilor din copaci vor veni.
  3   Posesorul papagalilor va veni.                           3   Posesorul papagalilor din copac va veni.
  4   Posesorii papagalilor vor veni.                          4   Posesorii papagalilor din copac vor veni.
  5   Posesorul papagalului din copac va veni.                 5   Posesorii papagalului din copac vor veni.
  6   Posesorii papagalului din copac vor veni.                6   Posesorii papagalilor din copaci vor veni.
  7   Posesorul papagalilor din copac va veni.                 7   Posesorii papagalilor s, i copacii vor veni.
  ?   ???                                                      ?   Posesorii papagalilor din copac din grădini vor veni.

Figure 6: Parallel examples of a type I data instance in English, French, Italian and Romanian
C. Results
C.1. Chunk sequence detection in sentences

                               test on
                                               EN               FR                 IT            RO
                 train on
                 MultiLang                0.780 (0.039)    0.865 (0.036)   0.811 (0.012)    0.432 (0.025)
                 EN                       0.975 (0.008)    0.160 (0.005)   0.141 (0.011)    0.144 (0.006)
                 FR                       0.207 (0.018)    0.978 (0.008)   0.206 (0.016)    0.150 (0.010)
                 IT                       0.179 (0.029)    0.372 (0.016)   0.982 (0.008)    0.161 (0.007)
                 RO                       0.164 (0.004)    0.197 (0.021)   0.192 (0.011)    0.673 (0.038)
Table 3
Average F1 scores (standard deviation) for sentence chunk detection in sentences


C.2. Results on the BLM Agr* data

                               test on
                                           type_I_EN        type_I_FR        type_I_IT       type_I_RO
                 train on
                 type_I                   0.839 (0.007)    0.938 (0.011)   0.868 (0.021)    0.462 (0.023)
                 type_II                  0.696 (0.006)    0.944 (0.003)   0.759 (0.004)    0.409 (0.031)
                 type_III                 0.558 (0.013)    0.791 (0.026)   0.641 (0.023)    0.290 (0.027)
                                           type_II_EN       type_II_FR      type_II_IT       type_II_RO
                 type_I                   0.748 (0.001)    0.873 (0.006)   0.851 (0.015)    0.448 (0.015)
                 type_II                  0.642 (0.002)    0.871 (0.012)   0.802 (0.002)    0.394 (0.012)
                 type_III                 0.484 (0.023)    0.760 (0.027)   0.691 (0.023)    0.299 (0.010)
                                           type_III_EN      type_III_FR     type_III_IT     type_III_RO
                 type_I                   0.643 (0.003)    0.768 (0.004)   0.696 (0.022)    0.236 (0.004)
                 type_II                  0.585 (0.010)    0.797 (0.008)   0.693 (0.009)    0.240 (0.006)
                 type_III                 0.480 (0.026)    0.739 (0.027)   0.691 (0.017)    0.262 (0.002)
Table 4
Multilingual learning results for the BLM agreement task in terms of average F1 over three runs, and standard deviation.
                               train on
                                            type_I_EN        type_I_FR        type_I_IT       type_I_RO
                 test on
                 type_I_EN                0.884 (0.002)    0.123 (0.032)    0.125 (0.046)    0.106 (0.034)
                 type_I_FR                0.103 (0.032)    0.948 (0.009)    0.466 (0.010)    0.164 (0.029)
                 type_I_IT                0.113 (0.033)    0.341 (0.018)    0.845 (0.010)    0.183 (0.021)
                 type_I_RO                0.113 (0.026)    0.186 (0.014)    0.188 (0.015)    0.733 (0.027)
                 type_II_EN               0.757 (0.015)    0.119 (0.009)    0.129 (0.029)    0.103 (0.019)
                 type_II_FR               0.132 (0.024)    0.868 (0.010)    0.433 (0.008)    0.187 (0.011)
                 type_II_IT               0.100 (0.020)    0.386 (0.016)    0.875 (0.004)    0.196 (0.009)
                 type_II_RO               0.088 (0.007)    0.174 (0.005)    0.173 (0.006)    0.726 (0.009)
                 type_III_EN              0.638 (0.025)    0.117 (0.007)    0.129 (0.028)    0.108 (0.013)
                 type_III_FR              0.114 (0.007)    0.820 (0.013)    0.406 (0.013)    0.169 (0.017)
                 type_III_IT              0.091 (0.009)    0.337 (0.016)    0.806 (0.009)    0.170 (0.013)
                 type_III_RO              0.086 (0.008)    0.170 (0.007)    0.174 (0.003)    0.314 (0.010)
                                           type_II_EN       type_II_FR       type_II_IT       type_II_RO
                 type_I_EN                0.772 (0.030)    0.154 (0.023)    0.103 (0.014)    0.090 (0.007)
                 type_I_FR                0.151 (0.006)    0.972 (0.006)    0.484 (0.015)    0.143 (0.018)
                 type_I_IT                0.106 (0.014)    0.417 (0.018)    0.791 (0.004)    0.151 (0.034)
                 type_I_RO                0.107 (0.002)    0.177 (0.020)    0.170 (0.009)    0.625 (0.014)
                 type_II_EN               0.670 (0.002)    0.158 (0.015)    0.106 (0.006)    0.100 (0.010)
                 type_II_FR               0.188 (0.009)    0.903 (0.007)    0.434 (0.010)    0.146 (0.013)
                 type_II_IT               0.100 (0.010)    0.448 (0.011)    0.840 (0.003)    0.152 (0.020)
                 type_II_RO               0.093 (0.013)    0.182 (0.008)    0.159 (0.011)    0.636 (0.006)
                 type_III_EN              0.620 (0.005)    0.150 (0.012)    0.116 (0.007)    0.092 (0.009)
                 type_III_FR              0.168 (0.007)    0.870 (0.005)    0.386 (0.008)    0.127 (0.012)
                 type_III_IT              0.091 (0.005)    0.387 (0.002)    0.770 (0.008)    0.132 (0.016)
                 type_III_RO              0.082 (0.014)    0.175 (0.007)    0.172 (0.003)    0.311 (0.017)
                                           type_III_EN      type_III_FR      type_III_IT     type_III_RO
                 type_I_EN                0.739 (0.012)    0.174 (0.023)    0.154 (0.013)    0.059 (0.009)
                 type_I_FR                0.160 (0.007)    0.923 (0.013)    0.434 (0.005)    0.196 (0.029)
                 type_I_IT                0.132 (0.011)    0.384 (0.016)    0.797 (0.009)    0.197 (0.005)
                 type_I_RO                0.091 (0.011)    0.164 (0.023)    0.170 (0.022)    0.280 (0.010)
                 type_II_EN               0.662 (0.008)    0.164 (0.009)    0.142 (0.015)    0.076 (0.010)
                 type_II_FR               0.202 (0.013)    0.883 (0.001)    0.454 (0.010)    0.203 (0.010)
                 type_II_IT               0.111 (0.004)    0.425 (0.005)    0.840 (0.002)    0.203 (0.006)
                 type_II_RO               0.086 (0.007)    0.158 (0.006)    0.158 (0.012)    0.379 (0.013)
                 type_III_EN              0.654 (0.010)    0.155 (0.006)    0.140 (0.016)    0.082 (0.007)
                 type_III_FR              0.183 (0.003)    0.860 (0.004)    0.431 (0.004)    0.191 (0.003)
                 type_III_IT              0.106 (0.003)    0.373 (0.003)    0.836 (0.005)    0.182 (0.004)
                 type_III_RO              0.082 (0.001)    0.156 (0.007)    0.155 (0.007)    0.353 (0.006)
Table 5
Results as average F1 (sd) over three runs, for the BLM subject-verb agreement task, in the monolingual training setting.