=Paper=
{{Paper
|id=Vol-3878/70_main_long
|storemode=property
|title=Exploring Syntactic Information in Sentence Embeddings through Multilingual Subject-verb Agreement
|pdfUrl=https://ceur-ws.org/Vol-3878/70_main_long.pdf
|volume=Vol-3878
|authors=Vivi Nastase,Giuseppe Samo,Chunyang Jiang,Paola Merlo
|dblpUrl=https://dblp.org/rec/conf/clic-it/NastaseSJM24a
}}
==Exploring Syntactic Information in Sentence Embeddings through Multilingual Subject-verb Agreement==
Exploring syntactic information in sentence embeddings
through multilingual subject-verb agreement
Vivi Nastase1,* , Chunyang Jiang1,2 , Giuseppe Samo1 and Paola Merlo1,2
1
Idiap Research Institute, Martigny, Switzerland
2
University of Geneva, Geneva, Switzerland
Abstract
In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically
valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with
specific properties, and using them to study sentence representations built using pretrained language models. We use a
new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural
phenomenon – subject-verb agreement across a variety of sentence structures – in several languages. Finding a solution to
this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level
architecture that solves the problem in two steps – detect syntactic objects and their properties in individual sentences, and
find patterns across an input sequence of sentences – we show that despite having been trained on multilingual texts in a
consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not
shared, even across closely related languages.
Questo lavoro chiede se i modelli linguistici multilingue preaddestrati catturino rappresentazioni linguistiche astratte valide
attraverso svariate lingue. Il nostro approccio sviluppa dati sintetici curati su larga scala, con proprietà specifiche, e li utilizza
per studiare le rappresentazioni di frasi costruite con modelli linguistici preaddestrati. Utilizziamo un nuovo task a scelta
multipla e i dati afferenti, le Blackbird Language Matrices (BLM), per concentrarci su uno specifico fenomeno strutturale
grammaticale - l’accordo tra il soggetto e il verbo - in diverse lingue. Per trovare la soluzione corretta a questo task è necessario
un sistema che rilevi modelli e paradigmi linguistici complessi nelle rappresentazioni testuali. Utilizzando un’architettura a
due livelli che risolve il problema in due fasi - prima impara gli oggetti sintattici e le loro proprietà nelle singole frasi e poi
ne ricava gli elementi comuni - dimostriamo che, nonostante siano stati addestrati su testi multilingue in modo coerente, i
modelli linguistici multilingue preaddestrati presentano differenze specifiche per ogni lingua e inoltre la struttura sintattica
non è condivisa, nemmeno tra lingue tipologicamente molto vicine.
Keywords
syntactic information, synthetic structured data, multi-lingual, cross-lingual, diagnostic studies of deep learning models
1. Introduction unstructured textual input, BERT [4] is able to infer POS,
structural, entity-related, syntactic and semantic infor-
Large language models, trained on huge amount of texts, mation at successively higher layers of the architecture,
have reached a level of performance that rivals human mirroring the classical NLP pipeline [5]. We ask: How is
capabilities on a range of established benchmarks [1]. this information encoded in the output layer of the model,
Despite high performance on high-level language pro- i.e. the embeddings? Does it rely on surface information
cessing tasks, it is not yet clear what kind of information – such as inflections, function words – and is assembled
these language models encode, and how. For example, on the demands of the task/probes [6], or does it indeed
transformer-based pretrained models have shown excel- reflect something deeper that the language model has
lent performance in tasks that seem to require that the assembled through the progressive transformation of the
model encodes syntactic information [2]. input through its many layers?
All the knowledge that the LLMs encode comes from To investigate this question, we use a seemingly simple
unstructured texts and the shallow regularities they are task – subject-verb agreement. Subject-verb agreement
very good at detecting, and which they are able to lever- is often used to test the syntactic abilities of deep neural
age into information that correlates to higher structures networks [7, 8, 9, 10], because, while apparently simple
in language. Most notably, [3] have shown that from the and linear, it is in fact structurally, and theoretically, com-
plex, and requires connecting the subject and the verb
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, across arbitrarily long or complex structural distance.
Dec 04 — 06, 2024, Pisa, Italy It has an added useful dimension – it relies on syntac-
*
Corresponding author.
$ vivi.a.nastase@gmail.com (V. Nastase);
tic structure and grammatical number information that
chunyang.jiang42@gmail.com (C. Jiang); giuseppe.samo@idiap.ch many languages share.
(G. Samo); Paola.Merlo@unige.ch (P. Merlo) In previous work we have shown that simple struc-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
tural information – the chunk structure of a sentence – Context
1 NP-sg PP1-sg VP-sg
which can be leveraged to determine subject-verb agree-
2 NP-pl PP1-sg VP-pl
ment, or to contribute towards more semantic tasks, can
3 NP-sg PP1-pl VP-sg
be detected in the sentence embeddings obtained from 4 NP-pl PP1-pl VP-pl
a pre-trained model [11]. This result, though, does not 5 NP-sg PP1-sg PP2-sg VP-sg
cast light on whether the discovered structure is deeper 6 NP-pl PP1-sg PP2-sg VP-pl
and more abstract, or it is rather just a reflection of sur- 7 NP-sg PP1-pl PP2-sg VP-sg
face indicators, such as function words or morphological 8 ???
markers. Answers
To tease apart these two options, we set up an experi- 1 NP-pl PP1-pl PP2-sg VP-pl Correct
ment covering four languages: English, French, Italian 2 NP-pl PP1-pl et PP2-sg VP-pl Coord
and Romanian. These languages, while different, have 3 NP-pl PP1-pl VP-pl WNA
shared properties that make sharing of syntactic structure 4 NP-pl PP1-sg PP1-sg VP-pl WN1
a reasonable expectation, if the pretrained multilingual 5 NP-pl PP1-pl PP2-pl VP-pl WN2
6 NP-pl PP1-pl PP2-pl VP-sg AEV
model does indeed discover and encode syntactic struc-
7 NP-pl PP1-sg PP2-pl VP-sg AEN1
ture. We use parallel datasets in the four languages, built 8 NP-pl PP1-pl PP2-sg VP-sg AEN2
by (approximately) translating the BLM-AgrF dataset
[12], a multiple-choice linguistic test inspired from the
Raven Progressive Matrices visual intelligence test, previ- Figure 1: BLM instances for verb-subject agreement, with
ously used to explore subject-verb agreement in French. two attractors. The errors can be grouped in two types:
(i) sequence errors: WNA= wrong nr. of attractors; WN1=
Our work offers two contributions: (i) four parallel
datasets – on English, French, Italian and Romanian, fo- wrong gram. nr. for 1𝑠𝑡 attractor noun (N1); WN2= wrong
gram. nr. for 2𝑛𝑑 attractor noun (N2); (ii) grammatical errors:
cused on subject-verb agreement; (ii) cross-lingual and
AEV=agreement error on the verb; AEN1=agreement error on
multilingual testing of a multilingual pretrained model, N1; AEN2=agreement error on N2.
to explore the degree to which syntactic structure infor-
mation is shared across different languages. Our cross-
Such an approach can be very useful for probing lan-
lingual and multilingual experiments show poor transfer
guage models, as it allows to test whether they indeed
across languages, even those most related, like Italian
detect the relevant linguistic objects and their properties,
and French. This result indicates that pretrained mod-
and whether (or to what degree) they use this informa-
els encode syntactic information based on shallow and
tion to find larger patterns. We have developed BLMs
language-specific clues, from which they are not yet able
as a linguistic test. Figure 1 illustrates the template of a
to take the step towards abstracting grammatical struc-
BLM subject-verb agreement matrix, with the different
ture. The datasets are available at https://www.idiap.ch
linguistic objects – chunks/phrases – and their relevant
/dataset/(blm-agre|blm-agrf|blm-agri|blm_agrr) and the
properties, in this case grammatical number. Examples
code at https://github.com/CLCL-Geneva/BLM-SNFDise
in all languages under investigation are provided in Ap-
ntangling.
pendix B.
2. BLM task and BLM-Agr datasets BLM-Agr datasets A BLM problem for subject-verb
agreement consists of a context set of seven sentences
Inspired by existing IQ tests —Raven’s progressive ma- that share the subject-verb agreement phenomenon, but
trices (RPMs)— we have developed a framework, called differ in other aspects – e.g. number of linearly interven-
Blackbird Language Matrices (BLMs) [13] and several ing noun phrases between the subject and the verb (called
datasets [12, 14]. RPMs consist of a sequence of images, attractors because they can interfere with the agreement),
called the context, connected in a logical sequence by different grammatical numbers for these attractors, and
underlying generative rules [15]. The task is to deter- different clause structures. The sequence is generated
mine the missing element in this visual sequence, the by a rule of progression of number of attractors, and
answer. The candidate answers are constructed to be alternation in the grammatical number of the different
similar enough that the solution can be found only if the phrases. Each context is paired with a set of candidate
rules are identified correctly. answers generated from the correct answer by altering
Solving an RPM problem is usually done in two steps: it to produce minimally contrastive error types. We have
(i) identify the relevant objects and their attributes; (ii) two types of errors (see Figure 1: (i) sequence errors –
decompose the main problem into subproblems, based on these candidate answers are grammatically correct, but
object and attribute identification, in a way that allows they are not the correct continuation of the sequence; (ii)
detecting the global pattern or underlying rules [16]. agreement errors – these candidate answers are gram-
matically erroneous, because the verb is in agreement sibilities (e.g., 𝑝 = "np-s pp1-s vp-s"), all corresponding
with one of the intervening attractors. By constructing sentences are collected into a set 𝑆𝑝 .
candidate answers with such specific error types, we can The dataset consists of triples (𝑖𝑛, 𝑜𝑢𝑡+ , 𝑂𝑢𝑡− ),
investigate the kind of information and structure learned. where 𝑖𝑛 is an input sentence, 𝑜𝑢𝑡+ is the correct output –
The seed data for French was created by manually a sentence different from 𝑖𝑛 but with the same chunk pat-
completing data previously published data [17]. From this tern. 𝑂𝑢𝑡− are 𝑁𝑛𝑒𝑔𝑠 = 7 incorrect outputs, randomly
initial data, we generated a dataset that comprises three chosen from the sentences that have a chunk pattern dif-
subsets of increasing lexical complexity (details in [12]): ferent from 𝑖𝑛. For each language, we sample uniformly
Types I, II, III, corresponding to different amounts of approx. 4000 instances from the generated data based on
lexical variation within a problem instance. Each subset the pattern of the input sentence, randomly split 80:20
contains three clause structures uniformly distributed into train:test. The train part is split 80:20 into train:dev,
within the data. The dataset used here is a variation of the resulting in a 2576:630:798 split for train:dev:test.
BLM-AgrF [12] that separates sequence-based from other
types of errors, to be able to perform deeper analyses
into the behaviour of pretrained language models. 3. Probing the encoding of syntax
The datasets in English, Italian and Romanian were cre-
We aim to test whether the syntactic information detected
ated by manually translating the seed French sentences
in multilingual pretrained sentence embeddings is based
into the other languages by native (Italian and Romanian)
on shallow, language-specific clues, or whether it is more
and near-native (English) speakers. The internal struc-
abstract structural information. Using the subject-verb
ture in these languages is very similar, so translations are
agreement task and the parallel datasets in four languages
approximately parallel. The differences lie in the treat-
provides clues to the answer.
ment of preposition and determiner sequences that must
The datasets all share sentences with the same syntac-
be conflated into one word in some cases in Italian and
tic structures, as illustrated in Figure 1. However, there
French, but not in English. French and Italian use number-
are language specific differences, as in the structure of
specific determiners and inflections, while Romanian and
the chunks (noun or verb or prepositional phrases) and
English encode grammatical number exclusively through
each language has different ways to encode grammatical
inflections. In English most plural forms are marked by
number (see section 2).
a suffix. Romanian has more variation, and noun inflec-
If the grammatical information in the sentences in
tions also encode case. Determiners are separate tokens,
our dataset – i.e. the sequences of chunks with specific
which are overt indicators of grammatical number and
properties relevant to the subject-verb agreement task
of phrase boundaries, whereas inflections may or may
(Figure 1) – is an abstract form of knowledge within the
not be tokenized separately.
pretrained model, it will be shared across languages. We
Table 1 shows the datasets statistics for the four BLM
would then see a high level of performance for a model
problems. After splitting each subset 90:10 into train:test
trained on one of these languages, and tested on any
subsets, we randomly sample 2000 instances as train data.
of the other. Additionally, when training on a dataset
20% of the train data is used for development.
consisting of data in the four languages, the model should
English French Italian Romanian detect a shared parameter space that would lead to high
Type I 230 252 230 230 results when testing on data for each language.
Type II 4052 4927 4121 4571 If however the grammatical information is a reflection
Type III 4052 4810 4121 4571 of shallow language indicators, we expect to see higher
performance on languages that have overt grammatical
Table 1
number and chunk indicators, such as French and Italian,
Test data statistics. The amount of training data is always
2000 instances.
and a low rate of cross-language transfer.
A sentence dataset From the seed files for each lan- 3.1. System architectures
guage we build a dataset to study sentence structure
A sentence-level VAE To test whether chunk struc-
independently of a task. The seed files contain noun,
ture can be detected in sentence embeddings we use a
verb and prepositional phrases, with singular and plural
VAE-like system, which encodes a sentence, and decodes
variations. From these chunks, we build sentences with
a different sentence with the same chunk structure, us-
all (grammatically correct) combinations of np [pp1
ing a set of contrastive negative examples – sentences
[pp2 ]] vp1 . For each chunk pattern 𝑝 of the 14 pos-
that have different chunk structures from the input – to
1
encourage the latent to encode the chunk structure.
pp1 and pp2 may be included or not, pp2 may be included only if
pp1 is included
The architecture of the sentence-level VAE is similar to and several incorrect answers 𝑎𝑒𝑟𝑟 . Every sentence is
a previously proposed system [18]: the encoder consists embedded using the pretrained model. To simplify the
of a CNN layer with a 15x15 kernel, which is applied to a discussion, in the sections that follows, when we say
32x24-shaped sentence embedding, followed by a linear sentence we actually mean its embedding.
layer that compresses the output of the CNN into a latent The two-level VAE system takes a BLM instance as
layer of size 5. The decoder mirrors the encoder. input, decomposes its context sequence 𝑆 into sentences
An instance consists of a triple (𝑖𝑛, 𝑜𝑢𝑡+ , 𝑂𝑢𝑡− ), and passes them individually as input to the sentence-
where 𝑖𝑛 is an input sentence with embedding 𝑒𝑖𝑛 level VAE. For each sentence 𝑠𝑖 ∈ 𝑆, the system builds
and chunk structure 𝑝, 𝑜𝑢𝑡+ is a sentence with embed- on-the-fly the candidate answers for the sentence level:
ding 𝑒𝑜𝑢𝑡+ with same chunk structure 𝑝, and 𝑂𝑢𝑡− = the same sentence 𝑠𝑖 from input is used as the correct
{𝑠𝑘 |𝑘 = 1, 𝑁𝑛𝑒𝑔𝑠 } is a set of 𝑁𝑛𝑒𝑔𝑠 = 7 sentences output, and a random selection of sentences from 𝑆 are
with embeddings 𝑒𝑠𝑘 , each with chunk pattern different the negative answers. After an instance is processed by
from 𝑝 (and different from each other). The input 𝑒𝑖𝑛 the sentence level, for each sentence 𝑠𝑖 ∈ 𝑆, we obtain its
is encoded into latent representation 𝑧𝑖 , from which we representation from the latent layer 𝑙𝑠𝑖 , and reassemble
sample a vector 𝑧˜𝑖 , which is decoded into the output ˆ𝑒𝑖𝑛 . the input sequence as 𝑆𝑙 = 𝑠𝑡𝑎𝑐𝑘[𝑙𝑠𝑖 ], and pass it as
To encourage the latent to encode the structure of the in- input to the task-level VAE. The loss function combines
put sentence we use a max-margin loss function, to push the losses on the two levels – a max-margin loss on the
for a higher similarity score for ˆ𝑒𝑖𝑛 with the sentence sentence level that contrasts the sentence reconstructed
that has the same chunk pattern as the input (𝑒𝑜𝑢𝑡+ ) than on the sentence level with the correct answer and the
the ones that do not. At prediction time, the sentence erroneous ones, and a max-margin loss on the task level
from the {𝑜𝑢𝑡+ } ∪ 𝑂𝑢𝑡− options that has the highest that contrasts the answer constructed by the decoder
score relative to the decoded answer is taken as correct. with the answer set of the BLM instance (details in [11]).
Two-level VAE for BLMs We use a two-level system 3.2. Experiments
illustrated in Figure 2, which separates the solving of
the BLM task on subject-verb agreement into two steps: To explore how syntactic information – in particular
(i) compress sentence embeddings into a representation chunk structure – is encoded, we perform cross-language
that captures the sentence chunk structure and the rele- and multi-language experiments, using first the sentences
vant chunk properties (on the sentence level) (ii) use the dataset, and then the BLM agreement task. We report F1
compressed sentence representations to solve the BLM averages over three runs.
agreement problems, by detecting the pattern across the Cross-lingual experiments – train on data from one lan-
sequence of structures (on the task level). This archi- guage, test on all the others – show whether patterns de-
tecture will allow us to test whether sentence structure tected in sentence embeddings that encode chunk struc-
– in terms of chunks – is shared across languages in a ture are transferable across languages. The results on
pretrained multilingual model. testing on the same language as the training provide sup-
port for the experimental set-up – the high results show
that the pretrained language model used does encode the
necessary information, and the system architecture is
adequate to distill it.
The multilingual experiments, where we learn a model
from data in all the languages, will provide additional
clues – if the performance on testing on individual lan-
guages is comparable to when training on each language
Figure 2: A two-level VAE: the sentence level learns to com- alone, it means some information is shared across lan-
press a sentence into a representation useful to solve the BLM guages and can be beneficial.
problem on the task level.
3.2.1. Syntactic structure in sentences
All reported experiments use Electra [19]2 , with the
sentence representations the embedding of the [CLS] We use only the sentence level of the system illustrated
token (details in [11]). in Figure 2 to explore chunk structure in sentences, using
An instance for a BLM problem consists of an ordered the data described in Section 2. For the cross-lingual
context sequence 𝑆 of sentences, 𝑆 = {𝑠𝑖 |𝑖 = 1, 7} as experiments, the training dataset for each language is
input, and an answer set 𝐴 with one correct answer 𝑎𝑐 , used to train a model that is then tested on each test
set. For the multilingual setup, we assemble a common
2
Electra pretrained model: google/electra-base-discriminator training data from the training data for all languages.
3.2.2. Solving the BLM agreement task compared to learning in a monolingual setting. This
again indicates that the system could not detect a shared
We solve the BLM agreement task using the two-level sys-
parameter space for the information that is being learned,
tem, where a compacted sentence representation learned
the chunk structure, and thus this information is encoded
on the sentence level should help detect patterns in the
differently in the languages under study.
input sequence of a BLM instance. Because the datasets
are parallel, with shared sentence and sequence patterns,
we test whether the added learning signal from the task
level can help push the system to learn to map an input
sentence into a representation that captures structure
shared across languages. We perform cross-lingual ex-
periments, where a model is trained on data from one
language, and tested on all the test sets, and a multilin-
gual experiment, where for each type I/II/III data, we
assemble a training dataset from the training sets of the
same type from the other languages. The model is then
tested on the separate test sets.
3.3. Evaluation
For each training set we build three models, and plot the Figure 4: tSNE projection of the latent representation of
average F1 score. The standard deviation is very small, sentences from the training data, coloured by their chunk
so we do not include it in the plot, but it is reported in pattern. Different markers indicate the languages: "o" for
the results Tables in Appendix C. English, "x" for French, "+" for Italian, "*" for Romanian. We
note that while representations cluster by the pattern, the
clusters for different languages are disjoint.
4. Results
An additional interesting insight comes from the anal-
Structure in sentences Figure 3 shows the results for
ysis of the latent layer representations. Figure 4 shows
the experiments on detecting chunk structure in sentence
the tSNE projection of the latent representations of the
embeddings, in cross-lingual and multilingual training
sentences in the training data after multilingual train-
setups, for comparison (detailed results in Table 3).
ing. Different colours show different chunk patterns, and
different markers show different languages. Had the in-
formation encoding syntactic structure been shared, the
clusters for the same pattern in the different languages
would overlap. Instead, we note that each language seems
to have its own quite separate pattern clusters.
Structure in sentences for the BLM agreement task
When the sentence structure detection is embedded in
the system for solving the BLM agreement task, where an
additional supervision signals comes from the task, we
note a similar result as when processing the sentences
Figure 3: Cross-language testing for detecting chunk struc- individually. Figure 5 shows the results for the multi-
ture in sentence embeddings. lingual and monolingual training setups for the type I
data. Complete results are in Tables 4-5 in the appendix.
Two observations are relevant to our investigation: (i)
while training and testing on the same language leads to
Discussion and related work Pretrained language
good performance – indicating that Electra sentence em-
models are learned from shallow cooccurrences through
beddings do contain relevant information about chunks,
a lexical prediction task. The input information is trans-
and that the system does detect the chunk pattern in
formed through several transformer layers, various parts
these representations – there is very little transfer effect.
boosting each other through self-attention. Analysis of
A slight effect is detected for the model learned on Ital-
the architecture of transformer models, like BERT [4],
ian and tested on French; (ii) learning using multilingual
have localised and followed the flow of specific types
training data leads to a deterioration of the performance,
of linguistic information through the system [20, 3], to
languages chosen share commonalities – French, Italian
and Romanian are all Romance languages, English and
French share much lexical material – but there are also
differences: French and Italian use a similar manner to
encode grammatical number, mainly through articles that
can also signal phrase boundaries. English has a very lim-
ited form of nominal plural morphology, but determiners
are useful for signaling phrase boundaries. In Romanian,
number is expressed through inflection, suffixation and
case, and articles are also often expressed through specific
Figure 5: Average F1 performance on training on type I data suffixes, thus overt phrase boundaries are less common
over three runs – cross-language and multi-language than in French, Italian and English. These commonal-
ities and differences help us interpret the results, and
the degree that the classical NLP pipeline seems to be provide clues on how the targeted syntactic information
reflected in the succession of the model’s layers. Analysis is encoded.
of contextualized token embeddings shows that they can Previous experiments have shown that syntactic infor-
encode specific linguistic information, such as sentence mation – chunk sequences and their properties – can be
structure [21] (including in a multilingual set-up [22]), accessed in transformer-based pretrained sentence em-
predicate argument structure [23], subjecthood and ob- beddings [11]. In this multilingual setup, we test whether
jecthood [24], among others. Sentence embeddings have this information has been identified based on language-
also been probed using classifiers, and determined to specific shallow features, or whether the system has un-
encode specific types of linguistic information, such as covered and encoded more abstract structures.
subject-verb agreement [9], word order, tree depth, con- The low rate of transfer for the monolingual training
stituent information [25], auxiliaries[26] and argument setup and the decreased performance for the multilingual
structure [27]. training setup for both our experimental configurations
Generative models like LLAMA seem to use English as indicate that the chunk sequence information is language
the latent language in the middle layers [28], while other specific and is assembled by the system based on shallow
analyses of internal model parameters has lead to uncov- features. Further clues come from the fact that the only
ering language agnostic and language specific networks transfer happens between French and Italian, which en-
of parameters [29], or neurons encoding cross-language code phrases and grammatical number in a very similar
number agreement information across several internal manner. Embedding the sentence structure detection into
layers [30]. It has also been shown that subject-verb a larger system, where it receives an additional learning
agreement information is not shared by BiLSTM mod- signal (shared across languages) does not help to push
els [31] or multilingual BERT [32]. Testing the degree towards finding a shared sentence representation space
to which word/sentence embeddings are multilingual that encodes in a uniform manner the sentence structure
has usually been done using a classification probe, for shared across languages.
tasks like NER, POS tagging [33], language identification
[34], or more complex tasks like question answering and
sentence retrieval [35]. There are contradictory results 5. Conclusions
on various cross-lingual model transfers, some of which
can be explained by factors such as domain and size of We have aimed to add some evidence to the question
training data, typological closeness of languages [36], or How do state-of-the-art systems ≪know≫ what they
by the power of the classification probes. Generative or ≪know≫? [37] by projecting the subject-verb agree-
classification probes do not provide insights into whether ment problem in a multilingual space. We chose lan-
the pretrained model finds deeper regularities and en- guages that share syntactic structures, and have partic-
codes abstract structures, or the predictions are based on ular differences that can provide clues about whether
shallower features that the probe used assembles for the the models learned rely on shallower indicators, or the
specific test it is used for [37, 6]. pretrained models encode deeper knowledge. Our ex-
We aimed to answer this question by using a multi- periments show that pretrained language models do not
lingual setup, and a simple syntactic structure detection encode abstract syntactic structures, but rather this infor-
task in an indirectly supervised setting. The datasets mation is assembled "upon request" – by the probe or task
used – in English, French, Italian and Romanian – are – based on language-specific indicators. Understanding
(approximately) lexically parallel, and are parallel in syn- how information is encoded in large language models can
tactic structure. The property of interest is grammatical help determine the next necessary step towards making
number, and the task is subject-verb agreement. The language models truly deep.
Acknowledgments We gratefully acknowledge the https://aclanthology.org/D19-1275. doi:10.18653
partial support of this work by the Swiss National Science /v1/D19-1275.
Foundation, through grant SNF Advanced grant TMAG- [7] T. Linzen, E. Dupoux, Y. Goldberg, Assessing
1_209426 to PM. the ability of LSTMs to learn syntax-sensitive de-
pendencies, Transactions of the Association of
Computational Linguistics 4 (2016) 521–535. URL:
References https://www.mitpressjournals.org/doi/abs/10.1162
/tacl_a_00115.
[1] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh,
[8] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen,
J. Michael, F. Hill, O. Levy, S. Bowman, Super-
M. Baroni, Colorless green recurrent networks
glue: A stickier benchmark for general-purpose
dream hierarchically, in: Proceedings of the 2018
language understanding systems, in: H. Wal-
Conference of the North American Chapter of the
lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
Association for Computational Linguistics: Hu-
E. Fox, R. Garnett (Eds.), Advances in Neural In-
man Language Technologies, Association for Com-
formation Processing Systems, volume 32, Curran
putational Linguistics, 2018, pp. 1195–1205. URL:
Associates, Inc., 2019. URL: https://proceedings.ne
http://aclweb.org/anthology/N18-1108. doi:10.1
urips.cc/paper/2019/file/4496bf24afe7fab6f046bf
8653/v1/N18-1108.
4923da8de6-Paper.pdf.
[9] Y. Goldberg, Assessing bert’s syntactic abilities,
[2] C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal,
arXiv preprint arXiv:1901.05287 (2019).
O. Levy, Emergent linguistic structure in artificial
[10] T. Linzen, M. Baroni, Syntactic structure from deep
neural networks trained by self-supervision, Pro-
learning, Annual Review of Linguistics 7 (2021)
ceedings of the National Academy of Sciences 117
195–212. doi:10.1146/annurev-linguistics
(2020) 30046 – 30054.
-032020-051035.
[3] A. Rogers, O. Kovaleva, A. Rumshisky, A primer
[11] V. Nastase, P. Merlo, Are there identifiable struc-
in BERTology: What we know about how BERT
tural parts in the sentence embedding whole?, in:
works, Transactions of the Association for Compu-
Proceedings of the Workshop on analyzing and in-
tational Linguistics 8 (2020) 842–866. URL: https:
terpreting neural networks for NLP (BlackBoxNLP),
//aclanthology.org/2020.tacl-1.54. doi:10.1162/
2024.
tacl_a_00349.
[12] A. An, C. Jiang, M. A. Rodriguez, V. Nastase,
[4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
P. Merlo, BLM-AgrF: A new French benchmark
Pre-training of deep bidirectional transformers for
to investigate generalization of agreement in neu-
language understanding, in: Proceedings of the
ral networks, in: Proceedings of the 17th Confer-
2019 Conference of the North American Chapter of
ence of the European Chapter of the Association for
the Association for Computational Linguistics: Hu-
Computational Linguistics, Association for Compu-
man Language Technologies, Volume 1 (Long and
tational Linguistics, Dubrovnik, Croatia, 2023, pp.
Short Papers), Association for Computational Lin-
1363–1374. URL: https://aclanthology.org/2023.eacl
guistics, Minneapolis, Minnesota, 2019, pp. 4171–
-main.99.
4186. URL: https://aclanthology.org/N19-1423.
[13] P. Merlo, Blackbird language matrices (BLM), a new
doi:10.18653/v1/N19-1423.
task for rule-like generalization in neural networks:
[5] I. Tenney, D. Das, E. Pavlick, BERT rediscov-
Motivations and formal specifications, ArXiv cs.CL
ers the classical NLP pipeline, in: A. Korhonen,
2306.11444 (2023). URL: https://doi.org/10.48550/a
D. Traum, L. Màrquez (Eds.), Proceedings of the
rXiv.2306.11444. doi:10.48550/arXiv.2306.11
57th Annual Meeting of the Association for Com-
444.
putational Linguistics, Association for Computa-
[14] G. Samo, V. Nastase, C. Jiang, P. Merlo, BLM-s/lE:
tional Linguistics, Florence, Italy, 2019, pp. 4593–
A structured dataset of English spray-load verb al-
4601. URL: https://aclanthology.org/P19-1452.
ternations for testing generalization in LLMs, in:
doi:10.18653/v1/P19-1452.
Findings of the 2023 Conference on Empirical Meth-
[6] J. Hewitt, P. Liang, Designing and interpreting
ods in Natural Language Processing, 2023.
probes with control tasks, in: K. Inui, J. Jiang,
[15] J. C. Raven, Standardization of progressive matrices,
V. Ng, X. Wan (Eds.), Proceedings of the 2019 Con-
British Journal of Medical Psychology 19 (1938) 137–
ference on Empirical Methods in Natural Language
150.
Processing and the 9th International Joint Con-
[16] P. A. Carpenter, M. A. Just, P. Shell, What one
ference on Natural Language Processing (EMNLP-
intelligence test measures: a theoretical account of
IJCNLP), Association for Computational Linguis-
the processing in the raven progressive matrices
tics, Hong Kong, China, 2019, pp. 2733–2743. URL:
test., Psychological review 97 (1990) 404. [24] I. Papadimitriou, E. A. Chi, R. Futrell, K. Mahowald,
[17] J. Franck, G. Vigliocco, J. Nicol, Subject-verb agree- Deep subjecthood: Higher-order grammatical fea-
ment errors in french and english: The role of syn- tures in multilingual BERT, in: P. Merlo, J. Tiede-
tactic hierarchy, Language and cognitive processes mann, R. Tsarfaty (Eds.), Proceedings of the 16th
17 (2002) 371–404. Conference of the European Chapter of the Associ-
[18] V. Nastase, P. Merlo, Grammatical information in ation for Computational Linguistics: Main Volume,
BERT sentence embeddings as two-dimensional Association for Computational Linguistics, Online,
arrays, in: B. Can, M. Mozes, S. Cahyawijaya, 2021, pp. 2522–2532. URL: https://aclanthology.org
N. Saphra, N. Kassner, S. Ravfogel, A. Ravichan- /2021.eacl-main.215. doi:10.18653/v1/2021.e
der, C. Zhao, I. Augenstein, A. Rogers, K. Cho, acl-main.215.
E. Grefenstette, L. Voita (Eds.), Proceedings of the [25] A. Conneau, G. Kruszewski, G. Lample, L. Barrault,
8th Workshop on Representation Learning for NLP M. Baroni, What you can cram into a single $&!#*
(RepL4NLP 2023), Association for Computational vector: Probing sentence embeddings for linguis-
Linguistics, Toronto, Canada, 2023, pp. 22–39. URL: tic properties, in: I. Gurevych, Y. Miyao (Eds.),
https://aclanthology.org/2023.repl4nlp- 1.3. Proceedings of the 56th Annual Meeting of the As-
doi:10.18653/v1/2023.repl4nlp-1.3. sociation for Computational Linguistics (Volume
[19] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Elec- 1: Long Papers), Association for Computational
tra: Pre- training text encoders as discriminators Linguistics, Melbourne, Australia, 2018, pp. 2126–
rather than generators, in: ICLR, 2020, pp. 1–18. 2136. URL: https://aclanthology.org/P18-1198.
[20] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, doi:10.18653/v1/P18-1198.
R. T. McCoy, N. Kim, B. Van Durme, S. R. Bowman, [26] Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. Gold-
D. Das, et al., What do you learn from context? prob- berg, Fine-grained analysis of sentence embed-
ing for sentence structure in contextualized word dings using auxiliary prediction tasks, in: 5th Inter-
representations, in: The Seventh International Con- national Conference on Learning Representations,
ference on Learning Representations (ICLR), 2019, ICLR 2017, Toulon, France, April 24-26, 2017, Con-
pp. 235–249. ference Track Proceedings, OpenReview.net, 2017.
[21] J. Hewitt, C. D. Manning, A structural probe for URL: https://openreview.net/forum?id=BJh6Ztuxl.
finding syntax in word representations, in: Proceed- [27] M. Wilson, J. Petty, R. Frank, How abstract is lin-
ings of the 2019 Conference of the North American guistic generalization in large language models? ex-
Chapter of the Association for Computational Lin- periments with argument structure, Transactions
guistics: Human Language Technologies, Volume of the Association for Computational Linguistics
1 (Long and Short Papers), Association for Compu- 11 (2023) 1377–1395. URL: https://aclanthology.org
tational Linguistics, Minneapolis, Minnesota, 2019, /2023.tacl-1.78. doi:10.1162/tacl_a_00608.
pp. 4129–4138. URL: https://aclanthology.org/N19 [28] C. Wendler, V. Veselovsky, G. Monea, R. West,
-1419. doi:10.18653/v1/N19-1419. Do llamas work in English? on the latent lan-
[22] E. A. Chi, J. Hewitt, C. D. Manning, Finding univer- guage of multilingual transformers, in: L.-W.
sal grammatical relations in multilingual BERT, in: Ku, A. Martins, V. Srikumar (Eds.), Proceedings
D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), of the 62nd Annual Meeting of the Association
Proceedings of the 58th Annual Meeting of the As- for Computational Linguistics (Volume 1: Long Pa-
sociation for Computational Linguistics, Associa- pers), Association for Computational Linguistics,
tion for Computational Linguistics, Online, 2020, Bangkok, Thailand, 2024, pp. 15366–15394. URL:
pp. 5564–5577. URL: https://aclanthology.org/2020. https://aclanthology.org/2024.acl- long.820.
acl-main.493. doi:10.18653/v1/2020.acl-mai doi:10.18653/v1/2024.acl-long.820.
n.493. [29] T. Tang, W. Luo, H. Huang, D. Zhang, X. Wang,
[23] S. Conia, E. Barba, A. Scirè, R. Navigli, Semantic role X. Zhao, F. Wei, J.-R. Wen, Language-specific neu-
labeling meets definition modeling: Using natural rons: The key to multilingual capabilities in large
language to describe predicate-argument structures, language models, in: L.-W. Ku, A. Martins, V. Sriku-
in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Find- mar (Eds.), Proceedings of the 62nd Annual Meeting
ings of the Association for Computational Linguis- of the Association for Computational Linguistics
tics: EMNLP 2022, Association for Computational (Volume 1: Long Papers), Association for Compu-
Linguistics, Abu Dhabi, United Arab Emirates, 2022, tational Linguistics, Bangkok, Thailand, 2024, pp.
pp. 4253–4270. URL: https://aclanthology.org/202 5701–5715. URL: https://aclanthology.org/2024.ac
2.findings-emnlp.313. doi:10.18653/v1/2022.f l-long.309. doi:10.18653/v1/2024.acl-long.
indings-emnlp.313. 309.
[30] A. G. de Varda, M. Marelli, Data-driven cross- guistics (Volume 1: Long Papers), Association for
lingual syntax: An agreement study with massively Computational Linguistics, Toronto, Canada, 2023,
multilingual models, Computational Linguistics 49 pp. 5877–5891. URL: https://aclanthology.org/2023.
(2023) 261–299. URL: https://aclanthology.org/2023. acl-long.323. doi:10.18653/v1/2023.acl-lon
cl-2.1. doi:10.1162/coli_a_00472. g.323.
[31] P. Dhar, A. Bisazza, Understanding cross-lingual [37] A. Lenci, Understanding natural language un-
syntactic transfer in multilingual recurrent neural derstanding systems, Sistemi intelligenti, Rivista
networks, in: S. Dobnik, L. Øvrelid (Eds.), Proceed- quadrimestrale di scienze cognitive e di intelligenza
ings of the 23rd Nordic Conference on Computa- artificiale (2023) 277–302. URL: https://www.rivi
tional Linguistics (NoDaLiDa), Linköping Univer- steweb.it/doi/10.1422/107438. doi:10.1422/1074
sity Electronic Press, Sweden, Reykjavik, Iceland 38.
(Online), 2021, pp. 74–85. URL: https://aclantholo
gy.org/2021.nodalida-main.8.
[32] A. Mueller, G. Nicolai, P. Petrou-Zeniou, N. Talmina,
T. Linzen, Cross-linguistic syntactic evaluation of
word prediction models, in: D. Jurafsky, J. Chai,
N. Schluter, J. Tetreault (Eds.), Proceedings of the
58th Annual Meeting of the Association for Com-
putational Linguistics, Association for Computa-
tional Linguistics, Online, 2020, pp. 5523–5539. URL:
https://aclanthology.org/2020.acl- main.490.
doi:10.18653/v1/2020.acl-main.490.
[33] T. Pires, E. Schlinger, D. Garrette, How multi-
lingual is multilingual BERT?, in: A. Korhonen,
D. Traum, L. Màrquez (Eds.), Proceedings of the
57th Annual Meeting of the Association for Com-
putational Linguistics, Association for Computa-
tional Linguistics, Florence, Italy, 2019, pp. 4996–
5001. URL: https://aclanthology.org/P19-1493.
doi:10.18653/v1/P19-1493.
[34] G. I. Winata, A. Madotto, Z. Lin, R. Liu, J. Yosinski,
P. Fung, Language models are few-shot multilin-
gual learners, in: D. Ataman, A. Birch, A. Conneau,
O. Firat, S. Ruder, G. G. Sahin (Eds.), Proceedings of
the 1st Workshop on Multilingual Representation
Learning, Association for Computational Linguis-
tics, Punta Cana, Dominican Republic, 2021, pp.
1–15. URL: https://aclanthology.org/2021.mrl-1.1.
doi:10.18653/v1/2021.mrl-1.1.
[35] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat,
M. Johnson, XTREME: A massively multilingual
multi-task benchmark for evaluating cross-lingual
generalisation, in: H. D. III, A. Singh (Eds.), Pro-
ceedings of the 37th International Conference on
Machine Learning, volume 119 of Proceedings of
Machine Learning Research, PMLR, 2020, pp. 4411–
4421. URL: https://proceedings.mlr.press/v119/hu2
0b.html.
[36] F. Philippy, S. Guo, S. Haddadan, Towards a
common understanding of contributing factors for
cross-lingual transfer in multilingual language mod-
els: A review, in: A. Rogers, J. Boyd-Graber,
N. Okazaki (Eds.), Proceedings of the 61st Annual
Meeting of the Association for Computational Lin-
A. Generating data from a seed file
To build the sentence data, we use a seed file that was used to generate the subject-verb agreement data. A seed,
consisting of noun, prepositional and verb phrases with different grammatical numbers, can be combined to build
sentences consisting of different sequences of such chunks. Table 2 includes a partial line from the seed file. To
produce the data in the 4 languages, we translate the seed file, from which the sentences and BLM data are then
constructed.
Subj_sg Subj_pl P1_sg P1_pl P2_sg P2_pl V_sg V_pl
The computer The comput- with the pro- with the pro- of the experi- of the experi- is broken are broken
ers gram grams ment ments
a BLM instance
Context:
The computer with the program is broken.
Sent. with different chunks The computers with the program are broken.
The computer with the programs is broken.
The computer is broken. np-s vp-s
The computers with the programs are broken.
The computers are broken. np-p vp-p The computer with the program of the experiment is broken.
The computer with the pro- np-s pp1-s The computers with the program of the experiment are broken.
gram is broken. vp-s The computer with the programs of the experiment is broken.
... ... Answer set:
The computers with the pro- np-p pp1-p The computers with the programs of the experiment are broken.
grams of the experiments are pp2-p vp-p The computers with the programs of the experiments are broken.
broken. The computers with the program of the experiment are broken.
The computers with the program of the experiment is broken.
...
Table 2
A line from the seed file on top, and a set of individual sentences built from it, as well as one BLM instance.
B. Example of data for the agreement BLM
B.1. Example of BLM instances (type I) in different languages
English - Context English - Answers
1 The owner of the parrot is coming. 1 The owners of the parrots in the tree are coming.
2 The owners of the parrot are coming. 2 The owners of the parrots in the trees are coming.
3 The owner of the parrots is coming. 3 The owner of the parrots in the tree is coming.
4 The owners of the parrots are coming. 4 The owners of the parrots in the tree are coming.
5 The owner of the parrot in the tree is coming. 5 The owners of the parrot in the tree are coming.
6 The owners of the parrot in the tree are coming. 6 The owners of the parrots in the trees are coming.
7 The owner of the parrots in the tree is coming. 7 The owners of the parrots and the trees are coming.
? ??? ? The owners of the parrots in the tree in the gardens are coming.
French - Context French - Answers
1 Le proprietaire du perroquet viendra. 1 Les proprietaires des perroquets dans l’arbre viendront.
2 Les proprietaires du perroquet viendront. 2 Les proprietaires des perroquets dans les arbres viendront.
3 Le proprietaire des perroquets viendra. 3 Le proprietaire des perroquets dans l’arbre viendra.
4 Les proprietaires des perroquets viendront. 4 Les proprietaires des perroquets dans l’arbre viendront.
5 Le proprietaire du perroquet dans l’arbre viendra. 5 Les proprietaires du perroquet dans l’arbre viendront.
6 Les proprietaires du perroquet dans l’arbre viendront. 6 Les proprietaires des perroquets dans les arbres viendront.
7 Le proprietaire des perroquets dans l’arbre viendra. 7 Les proprietaires des perroquets et les arbres viendront.
? ??? ? Les proprietaires des perroquets dans l’arbre des jardins viendront.
Italian - Context Italian - Answers
1 Il padrone del pappagallo arriverà. 1 I padroni dei pappagalli sull’albero arriveranno.
2 I padroni del pappagallo arriveranno. 2 I padroni dei pappagalli sugli alberi arriveranno.
3 Il padrone dei pappagalli arriverà. 3 Il padrone dei pappagalli sull’albero arriverà.
4 I padroni dei pappagalli arriveranno. 4 I padroni dei pappagalli sull’albero arriveranno.
5 Il padrone del pappagallo sull’albero arriverà. 5 I padroni del pappagallo sull’albero arriveranno.
6 I padroni del pappagallo sull’albero arriveranno. 6 I padroni dei pappagalli sugli alberi arriveranno.
7 Il padrone dei pappagalli sull’albero arriverà. 7 I padroni dei pappagalli e gli alberi arriveranno.
? ??? ? I padroni dei pappagalli sull’albero dei giardini arriveranno.
Romanian - Context Romanian - Answers
1 Posesorul papagalului va veni. 1 Posesorii papagalilor din copac vor veni.
2 Posesorii papagalului vor veni. 2 Posesorii papagalilor din copaci vor veni.
3 Posesorul papagalilor va veni. 3 Posesorul papagalilor din copac va veni.
4 Posesorii papagalilor vor veni. 4 Posesorii papagalilor din copac vor veni.
5 Posesorul papagalului din copac va veni. 5 Posesorii papagalului din copac vor veni.
6 Posesorii papagalului din copac vor veni. 6 Posesorii papagalilor din copaci vor veni.
7 Posesorul papagalilor din copac va veni. 7 Posesorii papagalilor s, i copacii vor veni.
? ??? ? Posesorii papagalilor din copac din grădini vor veni.
Figure 6: Parallel examples of a type I data instance in English, French, Italian and Romanian
C. Results
C.1. Chunk sequence detection in sentences
test on
EN FR IT RO
train on
MultiLang 0.780 (0.039) 0.865 (0.036) 0.811 (0.012) 0.432 (0.025)
EN 0.975 (0.008) 0.160 (0.005) 0.141 (0.011) 0.144 (0.006)
FR 0.207 (0.018) 0.978 (0.008) 0.206 (0.016) 0.150 (0.010)
IT 0.179 (0.029) 0.372 (0.016) 0.982 (0.008) 0.161 (0.007)
RO 0.164 (0.004) 0.197 (0.021) 0.192 (0.011) 0.673 (0.038)
Table 3
Average F1 scores (standard deviation) for sentence chunk detection in sentences
C.2. Results on the BLM Agr* data
test on
type_I_EN type_I_FR type_I_IT type_I_RO
train on
type_I 0.839 (0.007) 0.938 (0.011) 0.868 (0.021) 0.462 (0.023)
type_II 0.696 (0.006) 0.944 (0.003) 0.759 (0.004) 0.409 (0.031)
type_III 0.558 (0.013) 0.791 (0.026) 0.641 (0.023) 0.290 (0.027)
type_II_EN type_II_FR type_II_IT type_II_RO
type_I 0.748 (0.001) 0.873 (0.006) 0.851 (0.015) 0.448 (0.015)
type_II 0.642 (0.002) 0.871 (0.012) 0.802 (0.002) 0.394 (0.012)
type_III 0.484 (0.023) 0.760 (0.027) 0.691 (0.023) 0.299 (0.010)
type_III_EN type_III_FR type_III_IT type_III_RO
type_I 0.643 (0.003) 0.768 (0.004) 0.696 (0.022) 0.236 (0.004)
type_II 0.585 (0.010) 0.797 (0.008) 0.693 (0.009) 0.240 (0.006)
type_III 0.480 (0.026) 0.739 (0.027) 0.691 (0.017) 0.262 (0.002)
Table 4
Multilingual learning results for the BLM agreement task in terms of average F1 over three runs, and standard deviation.
train on
type_I_EN type_I_FR type_I_IT type_I_RO
test on
type_I_EN 0.884 (0.002) 0.123 (0.032) 0.125 (0.046) 0.106 (0.034)
type_I_FR 0.103 (0.032) 0.948 (0.009) 0.466 (0.010) 0.164 (0.029)
type_I_IT 0.113 (0.033) 0.341 (0.018) 0.845 (0.010) 0.183 (0.021)
type_I_RO 0.113 (0.026) 0.186 (0.014) 0.188 (0.015) 0.733 (0.027)
type_II_EN 0.757 (0.015) 0.119 (0.009) 0.129 (0.029) 0.103 (0.019)
type_II_FR 0.132 (0.024) 0.868 (0.010) 0.433 (0.008) 0.187 (0.011)
type_II_IT 0.100 (0.020) 0.386 (0.016) 0.875 (0.004) 0.196 (0.009)
type_II_RO 0.088 (0.007) 0.174 (0.005) 0.173 (0.006) 0.726 (0.009)
type_III_EN 0.638 (0.025) 0.117 (0.007) 0.129 (0.028) 0.108 (0.013)
type_III_FR 0.114 (0.007) 0.820 (0.013) 0.406 (0.013) 0.169 (0.017)
type_III_IT 0.091 (0.009) 0.337 (0.016) 0.806 (0.009) 0.170 (0.013)
type_III_RO 0.086 (0.008) 0.170 (0.007) 0.174 (0.003) 0.314 (0.010)
type_II_EN type_II_FR type_II_IT type_II_RO
type_I_EN 0.772 (0.030) 0.154 (0.023) 0.103 (0.014) 0.090 (0.007)
type_I_FR 0.151 (0.006) 0.972 (0.006) 0.484 (0.015) 0.143 (0.018)
type_I_IT 0.106 (0.014) 0.417 (0.018) 0.791 (0.004) 0.151 (0.034)
type_I_RO 0.107 (0.002) 0.177 (0.020) 0.170 (0.009) 0.625 (0.014)
type_II_EN 0.670 (0.002) 0.158 (0.015) 0.106 (0.006) 0.100 (0.010)
type_II_FR 0.188 (0.009) 0.903 (0.007) 0.434 (0.010) 0.146 (0.013)
type_II_IT 0.100 (0.010) 0.448 (0.011) 0.840 (0.003) 0.152 (0.020)
type_II_RO 0.093 (0.013) 0.182 (0.008) 0.159 (0.011) 0.636 (0.006)
type_III_EN 0.620 (0.005) 0.150 (0.012) 0.116 (0.007) 0.092 (0.009)
type_III_FR 0.168 (0.007) 0.870 (0.005) 0.386 (0.008) 0.127 (0.012)
type_III_IT 0.091 (0.005) 0.387 (0.002) 0.770 (0.008) 0.132 (0.016)
type_III_RO 0.082 (0.014) 0.175 (0.007) 0.172 (0.003) 0.311 (0.017)
type_III_EN type_III_FR type_III_IT type_III_RO
type_I_EN 0.739 (0.012) 0.174 (0.023) 0.154 (0.013) 0.059 (0.009)
type_I_FR 0.160 (0.007) 0.923 (0.013) 0.434 (0.005) 0.196 (0.029)
type_I_IT 0.132 (0.011) 0.384 (0.016) 0.797 (0.009) 0.197 (0.005)
type_I_RO 0.091 (0.011) 0.164 (0.023) 0.170 (0.022) 0.280 (0.010)
type_II_EN 0.662 (0.008) 0.164 (0.009) 0.142 (0.015) 0.076 (0.010)
type_II_FR 0.202 (0.013) 0.883 (0.001) 0.454 (0.010) 0.203 (0.010)
type_II_IT 0.111 (0.004) 0.425 (0.005) 0.840 (0.002) 0.203 (0.006)
type_II_RO 0.086 (0.007) 0.158 (0.006) 0.158 (0.012) 0.379 (0.013)
type_III_EN 0.654 (0.010) 0.155 (0.006) 0.140 (0.016) 0.082 (0.007)
type_III_FR 0.183 (0.003) 0.860 (0.004) 0.431 (0.004) 0.191 (0.003)
type_III_IT 0.106 (0.003) 0.373 (0.003) 0.836 (0.005) 0.182 (0.004)
type_III_RO 0.082 (0.001) 0.156 (0.007) 0.155 (0.007) 0.353 (0.006)
Table 5
Results as average F1 (sd) over three runs, for the BLM subject-verb agreement task, in the monolingual training setting.