Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement Vivi Nastase1,* , Chunyang Jiang1,2 , Giuseppe Samo1 and Paola Merlo1,2 1 Idiap Research Institute, Martigny, Switzerland 2 University of Geneva, Geneva, Switzerland Abstract In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon – subject-verb agreement across a variety of sentence structures – in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps – detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences – we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages. Questo lavoro chiede se i modelli linguistici multilingue preaddestrati catturino rappresentazioni linguistiche astratte valide attraverso svariate lingue. Il nostro approccio sviluppa dati sintetici curati su larga scala, con proprietà specifiche, e li utilizza per studiare le rappresentazioni di frasi costruite con modelli linguistici preaddestrati. Utilizziamo un nuovo task a scelta multipla e i dati afferenti, le Blackbird Language Matrices (BLM), per concentrarci su uno specifico fenomeno strutturale grammaticale - l’accordo tra il soggetto e il verbo - in diverse lingue. Per trovare la soluzione corretta a questo task è necessario un sistema che rilevi modelli e paradigmi linguistici complessi nelle rappresentazioni testuali. Utilizzando un’architettura a due livelli che risolve il problema in due fasi - prima impara gli oggetti sintattici e le loro proprietà nelle singole frasi e poi ne ricava gli elementi comuni - dimostriamo che, nonostante siano stati addestrati su testi multilingue in modo coerente, i modelli linguistici multilingue preaddestrati presentano differenze specifiche per ogni lingua e inoltre la struttura sintattica non è condivisa, nemmeno tra lingue tipologicamente molto vicine. Keywords syntactic information, synthetic structured data, multi-lingual, cross-lingual, diagnostic studies of deep learning models 1. Introduction unstructured textual input, BERT [4] is able to infer POS, structural, entity-related, syntactic and semantic infor- Large language models, trained on huge amount of texts, mation at successively higher layers of the architecture, have reached a level of performance that rivals human mirroring the classical NLP pipeline [5]. We ask: How is capabilities on a range of established benchmarks [1]. this information encoded in the output layer of the model, Despite high performance on high-level language pro- i.e. the embeddings? Does it rely on surface information cessing tasks, it is not yet clear what kind of information – such as inflections, function words – and is assembled these language models encode, and how. For example, on the demands of the task/probes [6], or does it indeed transformer-based pretrained models have shown excel- reflect something deeper that the language model has lent performance in tasks that seem to require that the assembled through the progressive transformation of the model encodes syntactic information [2]. input through its many layers? All the knowledge that the LLMs encode comes from To investigate this question, we use a seemingly simple unstructured texts and the shallow regularities they are task – subject-verb agreement. Subject-verb agreement very good at detecting, and which they are able to lever- is often used to test the syntactic abilities of deep neural age into information that correlates to higher structures networks [7, 8, 9, 10], because, while apparently simple in language. Most notably, [3] have shown that from the and linear, it is in fact structurally, and theoretically, com- plex, and requires connecting the subject and the verb CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, across arbitrarily long or complex structural distance. Dec 04 — 06, 2024, Pisa, Italy It has an added useful dimension – it relies on syntac- * Corresponding author. $ vivi.a.nastase@gmail.com (V. Nastase); tic structure and grammatical number information that chunyang.jiang42@gmail.com (C. Jiang); giuseppe.samo@idiap.ch many languages share. (G. Samo); Paola.Merlo@unige.ch (P. Merlo) In previous work we have shown that simple struc- © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings tural information – the chunk structure of a sentence – Context 1 NP-sg PP1-sg VP-sg which can be leveraged to determine subject-verb agree- 2 NP-pl PP1-sg VP-pl ment, or to contribute towards more semantic tasks, can 3 NP-sg PP1-pl VP-sg be detected in the sentence embeddings obtained from 4 NP-pl PP1-pl VP-pl a pre-trained model [11]. This result, though, does not 5 NP-sg PP1-sg PP2-sg VP-sg cast light on whether the discovered structure is deeper 6 NP-pl PP1-sg PP2-sg VP-pl and more abstract, or it is rather just a reflection of sur- 7 NP-sg PP1-pl PP2-sg VP-sg face indicators, such as function words or morphological 8 ??? markers. Answers To tease apart these two options, we set up an experi- 1 NP-pl PP1-pl PP2-sg VP-pl Correct ment covering four languages: English, French, Italian 2 NP-pl PP1-pl et PP2-sg VP-pl Coord and Romanian. These languages, while different, have 3 NP-pl PP1-pl VP-pl WNA shared properties that make sharing of syntactic structure 4 NP-pl PP1-sg PP1-sg VP-pl WN1 a reasonable expectation, if the pretrained multilingual 5 NP-pl PP1-pl PP2-pl VP-pl WN2 6 NP-pl PP1-pl PP2-pl VP-sg AEV model does indeed discover and encode syntactic struc- 7 NP-pl PP1-sg PP2-pl VP-sg AEN1 ture. We use parallel datasets in the four languages, built 8 NP-pl PP1-pl PP2-sg VP-sg AEN2 by (approximately) translating the BLM-AgrF dataset [12], a multiple-choice linguistic test inspired from the Raven Progressive Matrices visual intelligence test, previ- Figure 1: BLM instances for verb-subject agreement, with ously used to explore subject-verb agreement in French. two attractors. The errors can be grouped in two types: (i) sequence errors: WNA= wrong nr. of attractors; WN1= Our work offers two contributions: (i) four parallel datasets – on English, French, Italian and Romanian, fo- wrong gram. nr. for 1𝑠𝑡 attractor noun (N1); WN2= wrong gram. nr. for 2𝑛𝑑 attractor noun (N2); (ii) grammatical errors: cused on subject-verb agreement; (ii) cross-lingual and AEV=agreement error on the verb; AEN1=agreement error on multilingual testing of a multilingual pretrained model, N1; AEN2=agreement error on N2. to explore the degree to which syntactic structure infor- mation is shared across different languages. Our cross- Such an approach can be very useful for probing lan- lingual and multilingual experiments show poor transfer guage models, as it allows to test whether they indeed across languages, even those most related, like Italian detect the relevant linguistic objects and their properties, and French. This result indicates that pretrained mod- and whether (or to what degree) they use this informa- els encode syntactic information based on shallow and tion to find larger patterns. We have developed BLMs language-specific clues, from which they are not yet able as a linguistic test. Figure 1 illustrates the template of a to take the step towards abstracting grammatical struc- BLM subject-verb agreement matrix, with the different ture. The datasets are available at https://www.idiap.ch linguistic objects – chunks/phrases – and their relevant /dataset/(blm-agre|blm-agrf|blm-agri|blm_agrr) and the properties, in this case grammatical number. Examples code at https://github.com/CLCL-Geneva/BLM-SNFDise in all languages under investigation are provided in Ap- ntangling. pendix B. 2. BLM task and BLM-Agr datasets BLM-Agr datasets A BLM problem for subject-verb agreement consists of a context set of seven sentences Inspired by existing IQ tests —Raven’s progressive ma- that share the subject-verb agreement phenomenon, but trices (RPMs)— we have developed a framework, called differ in other aspects – e.g. number of linearly interven- Blackbird Language Matrices (BLMs) [13] and several ing noun phrases between the subject and the verb (called datasets [12, 14]. RPMs consist of a sequence of images, attractors because they can interfere with the agreement), called the context, connected in a logical sequence by different grammatical numbers for these attractors, and underlying generative rules [15]. The task is to deter- different clause structures. The sequence is generated mine the missing element in this visual sequence, the by a rule of progression of number of attractors, and answer. The candidate answers are constructed to be alternation in the grammatical number of the different similar enough that the solution can be found only if the phrases. Each context is paired with a set of candidate rules are identified correctly. answers generated from the correct answer by altering Solving an RPM problem is usually done in two steps: it to produce minimally contrastive error types. We have (i) identify the relevant objects and their attributes; (ii) two types of errors (see Figure 1: (i) sequence errors – decompose the main problem into subproblems, based on these candidate answers are grammatically correct, but object and attribute identification, in a way that allows they are not the correct continuation of the sequence; (ii) detecting the global pattern or underlying rules [16]. agreement errors – these candidate answers are gram- matically erroneous, because the verb is in agreement sibilities (e.g., 𝑝 = "np-s pp1-s vp-s"), all corresponding with one of the intervening attractors. By constructing sentences are collected into a set 𝑆𝑝 . candidate answers with such specific error types, we can The dataset consists of triples (𝑖𝑛, 𝑜𝑢𝑡+ , 𝑂𝑢𝑡− ), investigate the kind of information and structure learned. where 𝑖𝑛 is an input sentence, 𝑜𝑢𝑡+ is the correct output – The seed data for French was created by manually a sentence different from 𝑖𝑛 but with the same chunk pat- completing data previously published data [17]. From this tern. 𝑂𝑢𝑡− are 𝑁𝑛𝑒𝑔𝑠 = 7 incorrect outputs, randomly initial data, we generated a dataset that comprises three chosen from the sentences that have a chunk pattern dif- subsets of increasing lexical complexity (details in [12]): ferent from 𝑖𝑛. For each language, we sample uniformly Types I, II, III, corresponding to different amounts of approx. 4000 instances from the generated data based on lexical variation within a problem instance. Each subset the pattern of the input sentence, randomly split 80:20 contains three clause structures uniformly distributed into train:test. The train part is split 80:20 into train:dev, within the data. The dataset used here is a variation of the resulting in a 2576:630:798 split for train:dev:test. BLM-AgrF [12] that separates sequence-based from other types of errors, to be able to perform deeper analyses into the behaviour of pretrained language models. 3. Probing the encoding of syntax The datasets in English, Italian and Romanian were cre- We aim to test whether the syntactic information detected ated by manually translating the seed French sentences in multilingual pretrained sentence embeddings is based into the other languages by native (Italian and Romanian) on shallow, language-specific clues, or whether it is more and near-native (English) speakers. The internal struc- abstract structural information. Using the subject-verb ture in these languages is very similar, so translations are agreement task and the parallel datasets in four languages approximately parallel. The differences lie in the treat- provides clues to the answer. ment of preposition and determiner sequences that must The datasets all share sentences with the same syntac- be conflated into one word in some cases in Italian and tic structures, as illustrated in Figure 1. However, there French, but not in English. French and Italian use number- are language specific differences, as in the structure of specific determiners and inflections, while Romanian and the chunks (noun or verb or prepositional phrases) and English encode grammatical number exclusively through each language has different ways to encode grammatical inflections. In English most plural forms are marked by number (see section 2). a suffix. Romanian has more variation, and noun inflec- If the grammatical information in the sentences in tions also encode case. Determiners are separate tokens, our dataset – i.e. the sequences of chunks with specific which are overt indicators of grammatical number and properties relevant to the subject-verb agreement task of phrase boundaries, whereas inflections may or may (Figure 1) – is an abstract form of knowledge within the not be tokenized separately. pretrained model, it will be shared across languages. We Table 1 shows the datasets statistics for the four BLM would then see a high level of performance for a model problems. After splitting each subset 90:10 into train:test trained on one of these languages, and tested on any subsets, we randomly sample 2000 instances as train data. of the other. Additionally, when training on a dataset 20% of the train data is used for development. consisting of data in the four languages, the model should English French Italian Romanian detect a shared parameter space that would lead to high Type I 230 252 230 230 results when testing on data for each language. Type II 4052 4927 4121 4571 If however the grammatical information is a reflection Type III 4052 4810 4121 4571 of shallow language indicators, we expect to see higher performance on languages that have overt grammatical Table 1 number and chunk indicators, such as French and Italian, Test data statistics. The amount of training data is always 2000 instances. and a low rate of cross-language transfer. A sentence dataset From the seed files for each lan- 3.1. System architectures guage we build a dataset to study sentence structure A sentence-level VAE To test whether chunk struc- independently of a task. The seed files contain noun, ture can be detected in sentence embeddings we use a verb and prepositional phrases, with singular and plural VAE-like system, which encodes a sentence, and decodes variations. From these chunks, we build sentences with a different sentence with the same chunk structure, us- all (grammatically correct) combinations of np [pp1 ing a set of contrastive negative examples – sentences [pp2 ]] vp1 . For each chunk pattern 𝑝 of the 14 pos- that have different chunk structures from the input – to 1 encourage the latent to encode the chunk structure. pp1 and pp2 may be included or not, pp2 may be included only if pp1 is included The architecture of the sentence-level VAE is similar to and several incorrect answers 𝑎𝑒𝑟𝑟 . Every sentence is a previously proposed system [18]: the encoder consists embedded using the pretrained model. To simplify the of a CNN layer with a 15x15 kernel, which is applied to a discussion, in the sections that follows, when we say 32x24-shaped sentence embedding, followed by a linear sentence we actually mean its embedding. layer that compresses the output of the CNN into a latent The two-level VAE system takes a BLM instance as layer of size 5. The decoder mirrors the encoder. input, decomposes its context sequence 𝑆 into sentences An instance consists of a triple (𝑖𝑛, 𝑜𝑢𝑡+ , 𝑂𝑢𝑡− ), and passes them individually as input to the sentence- where 𝑖𝑛 is an input sentence with embedding 𝑒𝑖𝑛 level VAE. For each sentence 𝑠𝑖 ∈ 𝑆, the system builds and chunk structure 𝑝, 𝑜𝑢𝑡+ is a sentence with embed- on-the-fly the candidate answers for the sentence level: ding 𝑒𝑜𝑢𝑡+ with same chunk structure 𝑝, and 𝑂𝑢𝑡− = the same sentence 𝑠𝑖 from input is used as the correct {𝑠𝑘 |𝑘 = 1, 𝑁𝑛𝑒𝑔𝑠 } is a set of 𝑁𝑛𝑒𝑔𝑠 = 7 sentences output, and a random selection of sentences from 𝑆 are with embeddings 𝑒𝑠𝑘 , each with chunk pattern different the negative answers. After an instance is processed by from 𝑝 (and different from each other). The input 𝑒𝑖𝑛 the sentence level, for each sentence 𝑠𝑖 ∈ 𝑆, we obtain its is encoded into latent representation 𝑧𝑖 , from which we representation from the latent layer 𝑙𝑠𝑖 , and reassemble sample a vector 𝑧˜𝑖 , which is decoded into the output ˆ𝑒𝑖𝑛 . the input sequence as 𝑆𝑙 = 𝑠𝑡𝑎𝑐𝑘[𝑙𝑠𝑖 ], and pass it as To encourage the latent to encode the structure of the in- input to the task-level VAE. The loss function combines put sentence we use a max-margin loss function, to push the losses on the two levels – a max-margin loss on the for a higher similarity score for ˆ𝑒𝑖𝑛 with the sentence sentence level that contrasts the sentence reconstructed that has the same chunk pattern as the input (𝑒𝑜𝑢𝑡+ ) than on the sentence level with the correct answer and the the ones that do not. At prediction time, the sentence erroneous ones, and a max-margin loss on the task level from the {𝑜𝑢𝑡+ } ∪ 𝑂𝑢𝑡− options that has the highest that contrasts the answer constructed by the decoder score relative to the decoded answer is taken as correct. with the answer set of the BLM instance (details in [11]). Two-level VAE for BLMs We use a two-level system 3.2. Experiments illustrated in Figure 2, which separates the solving of the BLM task on subject-verb agreement into two steps: To explore how syntactic information – in particular (i) compress sentence embeddings into a representation chunk structure – is encoded, we perform cross-language that captures the sentence chunk structure and the rele- and multi-language experiments, using first the sentences vant chunk properties (on the sentence level) (ii) use the dataset, and then the BLM agreement task. We report F1 compressed sentence representations to solve the BLM averages over three runs. agreement problems, by detecting the pattern across the Cross-lingual experiments – train on data from one lan- sequence of structures (on the task level). This archi- guage, test on all the others – show whether patterns de- tecture will allow us to test whether sentence structure tected in sentence embeddings that encode chunk struc- – in terms of chunks – is shared across languages in a ture are transferable across languages. The results on pretrained multilingual model. testing on the same language as the training provide sup- port for the experimental set-up – the high results show that the pretrained language model used does encode the necessary information, and the system architecture is adequate to distill it. The multilingual experiments, where we learn a model from data in all the languages, will provide additional clues – if the performance on testing on individual lan- guages is comparable to when training on each language Figure 2: A two-level VAE: the sentence level learns to com- alone, it means some information is shared across lan- press a sentence into a representation useful to solve the BLM guages and can be beneficial. problem on the task level. 3.2.1. Syntactic structure in sentences All reported experiments use Electra [19]2 , with the sentence representations the embedding of the [CLS] We use only the sentence level of the system illustrated token (details in [11]). in Figure 2 to explore chunk structure in sentences, using An instance for a BLM problem consists of an ordered the data described in Section 2. For the cross-lingual context sequence 𝑆 of sentences, 𝑆 = {𝑠𝑖 |𝑖 = 1, 7} as experiments, the training dataset for each language is input, and an answer set 𝐴 with one correct answer 𝑎𝑐 , used to train a model that is then tested on each test set. For the multilingual setup, we assemble a common 2 Electra pretrained model: google/electra-base-discriminator training data from the training data for all languages. 3.2.2. Solving the BLM agreement task compared to learning in a monolingual setting. This again indicates that the system could not detect a shared We solve the BLM agreement task using the two-level sys- parameter space for the information that is being learned, tem, where a compacted sentence representation learned the chunk structure, and thus this information is encoded on the sentence level should help detect patterns in the differently in the languages under study. input sequence of a BLM instance. Because the datasets are parallel, with shared sentence and sequence patterns, we test whether the added learning signal from the task level can help push the system to learn to map an input sentence into a representation that captures structure shared across languages. We perform cross-lingual ex- periments, where a model is trained on data from one language, and tested on all the test sets, and a multilin- gual experiment, where for each type I/II/III data, we assemble a training dataset from the training sets of the same type from the other languages. The model is then tested on the separate test sets. 3.3. Evaluation For each training set we build three models, and plot the Figure 4: tSNE projection of the latent representation of average F1 score. The standard deviation is very small, sentences from the training data, coloured by their chunk so we do not include it in the plot, but it is reported in pattern. Different markers indicate the languages: "o" for the results Tables in Appendix C. English, "x" for French, "+" for Italian, "*" for Romanian. We note that while representations cluster by the pattern, the clusters for different languages are disjoint. 4. Results An additional interesting insight comes from the anal- Structure in sentences Figure 3 shows the results for ysis of the latent layer representations. Figure 4 shows the experiments on detecting chunk structure in sentence the tSNE projection of the latent representations of the embeddings, in cross-lingual and multilingual training sentences in the training data after multilingual train- setups, for comparison (detailed results in Table 3). ing. Different colours show different chunk patterns, and different markers show different languages. Had the in- formation encoding syntactic structure been shared, the clusters for the same pattern in the different languages would overlap. Instead, we note that each language seems to have its own quite separate pattern clusters. Structure in sentences for the BLM agreement task When the sentence structure detection is embedded in the system for solving the BLM agreement task, where an additional supervision signals comes from the task, we note a similar result as when processing the sentences Figure 3: Cross-language testing for detecting chunk struc- individually. Figure 5 shows the results for the multi- ture in sentence embeddings. lingual and monolingual training setups for the type I data. Complete results are in Tables 4-5 in the appendix. Two observations are relevant to our investigation: (i) while training and testing on the same language leads to Discussion and related work Pretrained language good performance – indicating that Electra sentence em- models are learned from shallow cooccurrences through beddings do contain relevant information about chunks, a lexical prediction task. The input information is trans- and that the system does detect the chunk pattern in formed through several transformer layers, various parts these representations – there is very little transfer effect. boosting each other through self-attention. Analysis of A slight effect is detected for the model learned on Ital- the architecture of transformer models, like BERT [4], ian and tested on French; (ii) learning using multilingual have localised and followed the flow of specific types training data leads to a deterioration of the performance, of linguistic information through the system [20, 3], to languages chosen share commonalities – French, Italian and Romanian are all Romance languages, English and French share much lexical material – but there are also differences: French and Italian use a similar manner to encode grammatical number, mainly through articles that can also signal phrase boundaries. English has a very lim- ited form of nominal plural morphology, but determiners are useful for signaling phrase boundaries. In Romanian, number is expressed through inflection, suffixation and case, and articles are also often expressed through specific Figure 5: Average F1 performance on training on type I data suffixes, thus overt phrase boundaries are less common over three runs – cross-language and multi-language than in French, Italian and English. These commonal- ities and differences help us interpret the results, and the degree that the classical NLP pipeline seems to be provide clues on how the targeted syntactic information reflected in the succession of the model’s layers. Analysis is encoded. of contextualized token embeddings shows that they can Previous experiments have shown that syntactic infor- encode specific linguistic information, such as sentence mation – chunk sequences and their properties – can be structure [21] (including in a multilingual set-up [22]), accessed in transformer-based pretrained sentence em- predicate argument structure [23], subjecthood and ob- beddings [11]. In this multilingual setup, we test whether jecthood [24], among others. Sentence embeddings have this information has been identified based on language- also been probed using classifiers, and determined to specific shallow features, or whether the system has un- encode specific types of linguistic information, such as covered and encoded more abstract structures. subject-verb agreement [9], word order, tree depth, con- The low rate of transfer for the monolingual training stituent information [25], auxiliaries[26] and argument setup and the decreased performance for the multilingual structure [27]. training setup for both our experimental configurations Generative models like LLAMA seem to use English as indicate that the chunk sequence information is language the latent language in the middle layers [28], while other specific and is assembled by the system based on shallow analyses of internal model parameters has lead to uncov- features. Further clues come from the fact that the only ering language agnostic and language specific networks transfer happens between French and Italian, which en- of parameters [29], or neurons encoding cross-language code phrases and grammatical number in a very similar number agreement information across several internal manner. Embedding the sentence structure detection into layers [30]. It has also been shown that subject-verb a larger system, where it receives an additional learning agreement information is not shared by BiLSTM mod- signal (shared across languages) does not help to push els [31] or multilingual BERT [32]. Testing the degree towards finding a shared sentence representation space to which word/sentence embeddings are multilingual that encodes in a uniform manner the sentence structure has usually been done using a classification probe, for shared across languages. tasks like NER, POS tagging [33], language identification [34], or more complex tasks like question answering and sentence retrieval [35]. There are contradictory results 5. Conclusions on various cross-lingual model transfers, some of which can be explained by factors such as domain and size of We have aimed to add some evidence to the question training data, typological closeness of languages [36], or How do state-of-the-art systems ≪know≫ what they by the power of the classification probes. Generative or ≪know≫? [37] by projecting the subject-verb agree- classification probes do not provide insights into whether ment problem in a multilingual space. We chose lan- the pretrained model finds deeper regularities and en- guages that share syntactic structures, and have partic- codes abstract structures, or the predictions are based on ular differences that can provide clues about whether shallower features that the probe used assembles for the the models learned rely on shallower indicators, or the specific test it is used for [37, 6]. pretrained models encode deeper knowledge. Our ex- We aimed to answer this question by using a multi- periments show that pretrained language models do not lingual setup, and a simple syntactic structure detection encode abstract syntactic structures, but rather this infor- task in an indirectly supervised setting. The datasets mation is assembled "upon request" – by the probe or task used – in English, French, Italian and Romanian – are – based on language-specific indicators. Understanding (approximately) lexically parallel, and are parallel in syn- how information is encoded in large language models can tactic structure. The property of interest is grammatical help determine the next necessary step towards making number, and the task is subject-verb agreement. The language models truly deep. Acknowledgments We gratefully acknowledge the https://aclanthology.org/D19-1275. doi:10.18653 partial support of this work by the Swiss National Science /v1/D19-1275. Foundation, through grant SNF Advanced grant TMAG- [7] T. Linzen, E. Dupoux, Y. Goldberg, Assessing 1_209426 to PM. the ability of LSTMs to learn syntax-sensitive de- pendencies, Transactions of the Association of Computational Linguistics 4 (2016) 521–535. URL: References https://www.mitpressjournals.org/doi/abs/10.1162 /tacl_a_00115. [1] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, [8] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, J. Michael, F. Hill, O. Levy, S. Bowman, Super- M. Baroni, Colorless green recurrent networks glue: A stickier benchmark for general-purpose dream hierarchically, in: Proceedings of the 2018 language understanding systems, in: H. Wal- Conference of the North American Chapter of the lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, Association for Computational Linguistics: Hu- E. Fox, R. Garnett (Eds.), Advances in Neural In- man Language Technologies, Association for Com- formation Processing Systems, volume 32, Curran putational Linguistics, 2018, pp. 1195–1205. URL: Associates, Inc., 2019. URL: https://proceedings.ne http://aclweb.org/anthology/N18-1108. doi:10.1 urips.cc/paper/2019/file/4496bf24afe7fab6f046bf 8653/v1/N18-1108. 4923da8de6-Paper.pdf. [9] Y. Goldberg, Assessing bert’s syntactic abilities, [2] C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal, arXiv preprint arXiv:1901.05287 (2019). O. Levy, Emergent linguistic structure in artificial [10] T. Linzen, M. Baroni, Syntactic structure from deep neural networks trained by self-supervision, Pro- learning, Annual Review of Linguistics 7 (2021) ceedings of the National Academy of Sciences 117 195–212. doi:10.1146/annurev-linguistics (2020) 30046 – 30054. -032020-051035. [3] A. Rogers, O. Kovaleva, A. Rumshisky, A primer [11] V. Nastase, P. Merlo, Are there identifiable struc- in BERTology: What we know about how BERT tural parts in the sentence embedding whole?, in: works, Transactions of the Association for Compu- Proceedings of the Workshop on analyzing and in- tational Linguistics 8 (2020) 842–866. URL: https: terpreting neural networks for NLP (BlackBoxNLP), //aclanthology.org/2020.tacl-1.54. doi:10.1162/ 2024. tacl_a_00349. [12] A. An, C. Jiang, M. A. Rodriguez, V. Nastase, [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: P. Merlo, BLM-AgrF: A new French benchmark Pre-training of deep bidirectional transformers for to investigate generalization of agreement in neu- language understanding, in: Proceedings of the ral networks, in: Proceedings of the 17th Confer- 2019 Conference of the North American Chapter of ence of the European Chapter of the Association for the Association for Computational Linguistics: Hu- Computational Linguistics, Association for Compu- man Language Technologies, Volume 1 (Long and tational Linguistics, Dubrovnik, Croatia, 2023, pp. Short Papers), Association for Computational Lin- 1363–1374. URL: https://aclanthology.org/2023.eacl guistics, Minneapolis, Minnesota, 2019, pp. 4171– -main.99. 4186. URL: https://aclanthology.org/N19-1423. [13] P. Merlo, Blackbird language matrices (BLM), a new doi:10.18653/v1/N19-1423. task for rule-like generalization in neural networks: [5] I. Tenney, D. Das, E. Pavlick, BERT rediscov- Motivations and formal specifications, ArXiv cs.CL ers the classical NLP pipeline, in: A. Korhonen, 2306.11444 (2023). URL: https://doi.org/10.48550/a D. Traum, L. Màrquez (Eds.), Proceedings of the rXiv.2306.11444. doi:10.48550/arXiv.2306.11 57th Annual Meeting of the Association for Com- 444. putational Linguistics, Association for Computa- [14] G. Samo, V. Nastase, C. Jiang, P. Merlo, BLM-s/lE: tional Linguistics, Florence, Italy, 2019, pp. 4593– A structured dataset of English spray-load verb al- 4601. URL: https://aclanthology.org/P19-1452. ternations for testing generalization in LLMs, in: doi:10.18653/v1/P19-1452. Findings of the 2023 Conference on Empirical Meth- [6] J. Hewitt, P. Liang, Designing and interpreting ods in Natural Language Processing, 2023. probes with control tasks, in: K. Inui, J. Jiang, [15] J. C. Raven, Standardization of progressive matrices, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Con- British Journal of Medical Psychology 19 (1938) 137– ference on Empirical Methods in Natural Language 150. Processing and the 9th International Joint Con- [16] P. A. Carpenter, M. A. Just, P. Shell, What one ference on Natural Language Processing (EMNLP- intelligence test measures: a theoretical account of IJCNLP), Association for Computational Linguis- the processing in the raven progressive matrices tics, Hong Kong, China, 2019, pp. 2733–2743. URL: test., Psychological review 97 (1990) 404. [24] I. Papadimitriou, E. A. Chi, R. Futrell, K. Mahowald, [17] J. Franck, G. Vigliocco, J. Nicol, Subject-verb agree- Deep subjecthood: Higher-order grammatical fea- ment errors in french and english: The role of syn- tures in multilingual BERT, in: P. Merlo, J. Tiede- tactic hierarchy, Language and cognitive processes mann, R. Tsarfaty (Eds.), Proceedings of the 16th 17 (2002) 371–404. Conference of the European Chapter of the Associ- [18] V. Nastase, P. Merlo, Grammatical information in ation for Computational Linguistics: Main Volume, BERT sentence embeddings as two-dimensional Association for Computational Linguistics, Online, arrays, in: B. Can, M. Mozes, S. Cahyawijaya, 2021, pp. 2522–2532. URL: https://aclanthology.org N. Saphra, N. Kassner, S. Ravfogel, A. Ravichan- /2021.eacl-main.215. doi:10.18653/v1/2021.e der, C. Zhao, I. Augenstein, A. Rogers, K. Cho, acl-main.215. E. Grefenstette, L. Voita (Eds.), Proceedings of the [25] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, 8th Workshop on Representation Learning for NLP M. Baroni, What you can cram into a single $&!#* (RepL4NLP 2023), Association for Computational vector: Probing sentence embeddings for linguis- Linguistics, Toronto, Canada, 2023, pp. 22–39. URL: tic properties, in: I. Gurevych, Y. Miyao (Eds.), https://aclanthology.org/2023.repl4nlp- 1.3. Proceedings of the 56th Annual Meeting of the As- doi:10.18653/v1/2023.repl4nlp-1.3. sociation for Computational Linguistics (Volume [19] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Elec- 1: Long Papers), Association for Computational tra: Pre- training text encoders as discriminators Linguistics, Melbourne, Australia, 2018, pp. 2126– rather than generators, in: ICLR, 2020, pp. 1–18. 2136. URL: https://aclanthology.org/P18-1198. [20] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, doi:10.18653/v1/P18-1198. R. T. McCoy, N. Kim, B. Van Durme, S. R. Bowman, [26] Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. Gold- D. Das, et al., What do you learn from context? prob- berg, Fine-grained analysis of sentence embed- ing for sentence structure in contextualized word dings using auxiliary prediction tasks, in: 5th Inter- representations, in: The Seventh International Con- national Conference on Learning Representations, ference on Learning Representations (ICLR), 2019, ICLR 2017, Toulon, France, April 24-26, 2017, Con- pp. 235–249. ference Track Proceedings, OpenReview.net, 2017. [21] J. Hewitt, C. D. Manning, A structural probe for URL: https://openreview.net/forum?id=BJh6Ztuxl. finding syntax in word representations, in: Proceed- [27] M. Wilson, J. Petty, R. Frank, How abstract is lin- ings of the 2019 Conference of the North American guistic generalization in large language models? ex- Chapter of the Association for Computational Lin- periments with argument structure, Transactions guistics: Human Language Technologies, Volume of the Association for Computational Linguistics 1 (Long and Short Papers), Association for Compu- 11 (2023) 1377–1395. URL: https://aclanthology.org tational Linguistics, Minneapolis, Minnesota, 2019, /2023.tacl-1.78. doi:10.1162/tacl_a_00608. pp. 4129–4138. URL: https://aclanthology.org/N19 [28] C. Wendler, V. Veselovsky, G. Monea, R. West, -1419. doi:10.18653/v1/N19-1419. Do llamas work in English? on the latent lan- [22] E. A. Chi, J. Hewitt, C. D. Manning, Finding univer- guage of multilingual transformers, in: L.-W. sal grammatical relations in multilingual BERT, in: Ku, A. Martins, V. Srikumar (Eds.), Proceedings D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), of the 62nd Annual Meeting of the Association Proceedings of the 58th Annual Meeting of the As- for Computational Linguistics (Volume 1: Long Pa- sociation for Computational Linguistics, Associa- pers), Association for Computational Linguistics, tion for Computational Linguistics, Online, 2020, Bangkok, Thailand, 2024, pp. 15366–15394. URL: pp. 5564–5577. URL: https://aclanthology.org/2020. https://aclanthology.org/2024.acl- long.820. acl-main.493. doi:10.18653/v1/2020.acl-mai doi:10.18653/v1/2024.acl-long.820. n.493. [29] T. Tang, W. Luo, H. Huang, D. Zhang, X. Wang, [23] S. Conia, E. Barba, A. Scirè, R. Navigli, Semantic role X. Zhao, F. Wei, J.-R. Wen, Language-specific neu- labeling meets definition modeling: Using natural rons: The key to multilingual capabilities in large language to describe predicate-argument structures, language models, in: L.-W. Ku, A. Martins, V. Sriku- in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Find- mar (Eds.), Proceedings of the 62nd Annual Meeting ings of the Association for Computational Linguis- of the Association for Computational Linguistics tics: EMNLP 2022, Association for Computational (Volume 1: Long Papers), Association for Compu- Linguistics, Abu Dhabi, United Arab Emirates, 2022, tational Linguistics, Bangkok, Thailand, 2024, pp. pp. 4253–4270. URL: https://aclanthology.org/202 5701–5715. URL: https://aclanthology.org/2024.ac 2.findings-emnlp.313. doi:10.18653/v1/2022.f l-long.309. doi:10.18653/v1/2024.acl-long. indings-emnlp.313. 309. [30] A. G. de Varda, M. Marelli, Data-driven cross- guistics (Volume 1: Long Papers), Association for lingual syntax: An agreement study with massively Computational Linguistics, Toronto, Canada, 2023, multilingual models, Computational Linguistics 49 pp. 5877–5891. URL: https://aclanthology.org/2023. (2023) 261–299. URL: https://aclanthology.org/2023. acl-long.323. doi:10.18653/v1/2023.acl-lon cl-2.1. doi:10.1162/coli_a_00472. g.323. [31] P. Dhar, A. Bisazza, Understanding cross-lingual [37] A. Lenci, Understanding natural language un- syntactic transfer in multilingual recurrent neural derstanding systems, Sistemi intelligenti, Rivista networks, in: S. Dobnik, L. Øvrelid (Eds.), Proceed- quadrimestrale di scienze cognitive e di intelligenza ings of the 23rd Nordic Conference on Computa- artificiale (2023) 277–302. URL: https://www.rivi tional Linguistics (NoDaLiDa), Linköping Univer- steweb.it/doi/10.1422/107438. doi:10.1422/1074 sity Electronic Press, Sweden, Reykjavik, Iceland 38. (Online), 2021, pp. 74–85. URL: https://aclantholo gy.org/2021.nodalida-main.8. [32] A. Mueller, G. Nicolai, P. Petrou-Zeniou, N. Talmina, T. Linzen, Cross-linguistic syntactic evaluation of word prediction models, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics, Association for Computa- tional Linguistics, Online, 2020, pp. 5523–5539. URL: https://aclanthology.org/2020.acl- main.490. doi:10.18653/v1/2020.acl-main.490. [33] T. Pires, E. Schlinger, D. Garrette, How multi- lingual is multilingual BERT?, in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, Association for Computa- tional Linguistics, Florence, Italy, 2019, pp. 4996– 5001. URL: https://aclanthology.org/P19-1493. doi:10.18653/v1/P19-1493. [34] G. I. Winata, A. Madotto, Z. Lin, R. Liu, J. Yosinski, P. Fung, Language models are few-shot multilin- gual learners, in: D. Ataman, A. Birch, A. Conneau, O. Firat, S. Ruder, G. G. Sahin (Eds.), Proceedings of the 1st Workshop on Multilingual Representation Learning, Association for Computational Linguis- tics, Punta Cana, Dominican Republic, 2021, pp. 1–15. URL: https://aclanthology.org/2021.mrl-1.1. doi:10.18653/v1/2021.mrl-1.1. [35] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, M. Johnson, XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation, in: H. D. III, A. Singh (Eds.), Pro- ceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 4411– 4421. URL: https://proceedings.mlr.press/v119/hu2 0b.html. [36] F. Philippy, S. Guo, S. Haddadan, Towards a common understanding of contributing factors for cross-lingual transfer in multilingual language mod- els: A review, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Lin- A. Generating data from a seed file To build the sentence data, we use a seed file that was used to generate the subject-verb agreement data. A seed, consisting of noun, prepositional and verb phrases with different grammatical numbers, can be combined to build sentences consisting of different sequences of such chunks. Table 2 includes a partial line from the seed file. To produce the data in the 4 languages, we translate the seed file, from which the sentences and BLM data are then constructed. Subj_sg Subj_pl P1_sg P1_pl P2_sg P2_pl V_sg V_pl The computer The comput- with the pro- with the pro- of the experi- of the experi- is broken are broken ers gram grams ment ments a BLM instance Context: The computer with the program is broken. Sent. with different chunks The computers with the program are broken. The computer with the programs is broken. The computer is broken. np-s vp-s The computers with the programs are broken. The computers are broken. np-p vp-p The computer with the program of the experiment is broken. The computer with the pro- np-s pp1-s The computers with the program of the experiment are broken. gram is broken. vp-s The computer with the programs of the experiment is broken. ... ... Answer set: The computers with the pro- np-p pp1-p The computers with the programs of the experiment are broken. grams of the experiments are pp2-p vp-p The computers with the programs of the experiments are broken. broken. The computers with the program of the experiment are broken. The computers with the program of the experiment is broken. ... Table 2 A line from the seed file on top, and a set of individual sentences built from it, as well as one BLM instance. B. Example of data for the agreement BLM B.1. Example of BLM instances (type I) in different languages English - Context English - Answers 1 The owner of the parrot is coming. 1 The owners of the parrots in the tree are coming. 2 The owners of the parrot are coming. 2 The owners of the parrots in the trees are coming. 3 The owner of the parrots is coming. 3 The owner of the parrots in the tree is coming. 4 The owners of the parrots are coming. 4 The owners of the parrots in the tree are coming. 5 The owner of the parrot in the tree is coming. 5 The owners of the parrot in the tree are coming. 6 The owners of the parrot in the tree are coming. 6 The owners of the parrots in the trees are coming. 7 The owner of the parrots in the tree is coming. 7 The owners of the parrots and the trees are coming. ? ??? ? The owners of the parrots in the tree in the gardens are coming. French - Context French - Answers 1 Le proprietaire du perroquet viendra. 1 Les proprietaires des perroquets dans l’arbre viendront. 2 Les proprietaires du perroquet viendront. 2 Les proprietaires des perroquets dans les arbres viendront. 3 Le proprietaire des perroquets viendra. 3 Le proprietaire des perroquets dans l’arbre viendra. 4 Les proprietaires des perroquets viendront. 4 Les proprietaires des perroquets dans l’arbre viendront. 5 Le proprietaire du perroquet dans l’arbre viendra. 5 Les proprietaires du perroquet dans l’arbre viendront. 6 Les proprietaires du perroquet dans l’arbre viendront. 6 Les proprietaires des perroquets dans les arbres viendront. 7 Le proprietaire des perroquets dans l’arbre viendra. 7 Les proprietaires des perroquets et les arbres viendront. ? ??? ? Les proprietaires des perroquets dans l’arbre des jardins viendront. Italian - Context Italian - Answers 1 Il padrone del pappagallo arriverà. 1 I padroni dei pappagalli sull’albero arriveranno. 2 I padroni del pappagallo arriveranno. 2 I padroni dei pappagalli sugli alberi arriveranno. 3 Il padrone dei pappagalli arriverà. 3 Il padrone dei pappagalli sull’albero arriverà. 4 I padroni dei pappagalli arriveranno. 4 I padroni dei pappagalli sull’albero arriveranno. 5 Il padrone del pappagallo sull’albero arriverà. 5 I padroni del pappagallo sull’albero arriveranno. 6 I padroni del pappagallo sull’albero arriveranno. 6 I padroni dei pappagalli sugli alberi arriveranno. 7 Il padrone dei pappagalli sull’albero arriverà. 7 I padroni dei pappagalli e gli alberi arriveranno. ? ??? ? I padroni dei pappagalli sull’albero dei giardini arriveranno. Romanian - Context Romanian - Answers 1 Posesorul papagalului va veni. 1 Posesorii papagalilor din copac vor veni. 2 Posesorii papagalului vor veni. 2 Posesorii papagalilor din copaci vor veni. 3 Posesorul papagalilor va veni. 3 Posesorul papagalilor din copac va veni. 4 Posesorii papagalilor vor veni. 4 Posesorii papagalilor din copac vor veni. 5 Posesorul papagalului din copac va veni. 5 Posesorii papagalului din copac vor veni. 6 Posesorii papagalului din copac vor veni. 6 Posesorii papagalilor din copaci vor veni. 7 Posesorul papagalilor din copac va veni. 7 Posesorii papagalilor s, i copacii vor veni. ? ??? ? Posesorii papagalilor din copac din grădini vor veni. Figure 6: Parallel examples of a type I data instance in English, French, Italian and Romanian C. Results C.1. Chunk sequence detection in sentences test on EN FR IT RO train on MultiLang 0.780 (0.039) 0.865 (0.036) 0.811 (0.012) 0.432 (0.025) EN 0.975 (0.008) 0.160 (0.005) 0.141 (0.011) 0.144 (0.006) FR 0.207 (0.018) 0.978 (0.008) 0.206 (0.016) 0.150 (0.010) IT 0.179 (0.029) 0.372 (0.016) 0.982 (0.008) 0.161 (0.007) RO 0.164 (0.004) 0.197 (0.021) 0.192 (0.011) 0.673 (0.038) Table 3 Average F1 scores (standard deviation) for sentence chunk detection in sentences C.2. Results on the BLM Agr* data test on type_I_EN type_I_FR type_I_IT type_I_RO train on type_I 0.839 (0.007) 0.938 (0.011) 0.868 (0.021) 0.462 (0.023) type_II 0.696 (0.006) 0.944 (0.003) 0.759 (0.004) 0.409 (0.031) type_III 0.558 (0.013) 0.791 (0.026) 0.641 (0.023) 0.290 (0.027) type_II_EN type_II_FR type_II_IT type_II_RO type_I 0.748 (0.001) 0.873 (0.006) 0.851 (0.015) 0.448 (0.015) type_II 0.642 (0.002) 0.871 (0.012) 0.802 (0.002) 0.394 (0.012) type_III 0.484 (0.023) 0.760 (0.027) 0.691 (0.023) 0.299 (0.010) type_III_EN type_III_FR type_III_IT type_III_RO type_I 0.643 (0.003) 0.768 (0.004) 0.696 (0.022) 0.236 (0.004) type_II 0.585 (0.010) 0.797 (0.008) 0.693 (0.009) 0.240 (0.006) type_III 0.480 (0.026) 0.739 (0.027) 0.691 (0.017) 0.262 (0.002) Table 4 Multilingual learning results for the BLM agreement task in terms of average F1 over three runs, and standard deviation. train on type_I_EN type_I_FR type_I_IT type_I_RO test on type_I_EN 0.884 (0.002) 0.123 (0.032) 0.125 (0.046) 0.106 (0.034) type_I_FR 0.103 (0.032) 0.948 (0.009) 0.466 (0.010) 0.164 (0.029) type_I_IT 0.113 (0.033) 0.341 (0.018) 0.845 (0.010) 0.183 (0.021) type_I_RO 0.113 (0.026) 0.186 (0.014) 0.188 (0.015) 0.733 (0.027) type_II_EN 0.757 (0.015) 0.119 (0.009) 0.129 (0.029) 0.103 (0.019) type_II_FR 0.132 (0.024) 0.868 (0.010) 0.433 (0.008) 0.187 (0.011) type_II_IT 0.100 (0.020) 0.386 (0.016) 0.875 (0.004) 0.196 (0.009) type_II_RO 0.088 (0.007) 0.174 (0.005) 0.173 (0.006) 0.726 (0.009) type_III_EN 0.638 (0.025) 0.117 (0.007) 0.129 (0.028) 0.108 (0.013) type_III_FR 0.114 (0.007) 0.820 (0.013) 0.406 (0.013) 0.169 (0.017) type_III_IT 0.091 (0.009) 0.337 (0.016) 0.806 (0.009) 0.170 (0.013) type_III_RO 0.086 (0.008) 0.170 (0.007) 0.174 (0.003) 0.314 (0.010) type_II_EN type_II_FR type_II_IT type_II_RO type_I_EN 0.772 (0.030) 0.154 (0.023) 0.103 (0.014) 0.090 (0.007) type_I_FR 0.151 (0.006) 0.972 (0.006) 0.484 (0.015) 0.143 (0.018) type_I_IT 0.106 (0.014) 0.417 (0.018) 0.791 (0.004) 0.151 (0.034) type_I_RO 0.107 (0.002) 0.177 (0.020) 0.170 (0.009) 0.625 (0.014) type_II_EN 0.670 (0.002) 0.158 (0.015) 0.106 (0.006) 0.100 (0.010) type_II_FR 0.188 (0.009) 0.903 (0.007) 0.434 (0.010) 0.146 (0.013) type_II_IT 0.100 (0.010) 0.448 (0.011) 0.840 (0.003) 0.152 (0.020) type_II_RO 0.093 (0.013) 0.182 (0.008) 0.159 (0.011) 0.636 (0.006) type_III_EN 0.620 (0.005) 0.150 (0.012) 0.116 (0.007) 0.092 (0.009) type_III_FR 0.168 (0.007) 0.870 (0.005) 0.386 (0.008) 0.127 (0.012) type_III_IT 0.091 (0.005) 0.387 (0.002) 0.770 (0.008) 0.132 (0.016) type_III_RO 0.082 (0.014) 0.175 (0.007) 0.172 (0.003) 0.311 (0.017) type_III_EN type_III_FR type_III_IT type_III_RO type_I_EN 0.739 (0.012) 0.174 (0.023) 0.154 (0.013) 0.059 (0.009) type_I_FR 0.160 (0.007) 0.923 (0.013) 0.434 (0.005) 0.196 (0.029) type_I_IT 0.132 (0.011) 0.384 (0.016) 0.797 (0.009) 0.197 (0.005) type_I_RO 0.091 (0.011) 0.164 (0.023) 0.170 (0.022) 0.280 (0.010) type_II_EN 0.662 (0.008) 0.164 (0.009) 0.142 (0.015) 0.076 (0.010) type_II_FR 0.202 (0.013) 0.883 (0.001) 0.454 (0.010) 0.203 (0.010) type_II_IT 0.111 (0.004) 0.425 (0.005) 0.840 (0.002) 0.203 (0.006) type_II_RO 0.086 (0.007) 0.158 (0.006) 0.158 (0.012) 0.379 (0.013) type_III_EN 0.654 (0.010) 0.155 (0.006) 0.140 (0.016) 0.082 (0.007) type_III_FR 0.183 (0.003) 0.860 (0.004) 0.431 (0.004) 0.191 (0.003) type_III_IT 0.106 (0.003) 0.373 (0.003) 0.836 (0.005) 0.182 (0.004) type_III_RO 0.082 (0.001) 0.156 (0.007) 0.155 (0.007) 0.353 (0.006) Table 5 Results as average F1 (sd) over three runs, for the BLM subject-verb agreement task, in the monolingual training setting.