1. Introduction

Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement

Vivi Nastase

Chunyang Jiang

0 1

Giuseppe Samo

Paola Merlo

0 1 0 Idiap Research Institute , Martigny , Switzerland 1 University of Geneva , Geneva , Switzerland

In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon - subject-verb agreement across a variety of sentence structures - in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps - detect syntactic objects and their properties in individual sentences, and ifnd patterns across an input sequence of sentences - we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific diferences, and syntactic structure is not shared, even across closely related languages. Questo lavoro chiede se i modelli linguistici multilingue preaddestrati catturino rappresentazioni linguistiche astratte valide attraverso svariate lingue. Il nostro approccio sviluppa dati sintetici curati su larga scala, con proprietà specifiche, e li utilizza per studiare le rappresentazioni di frasi costruite con modelli linguistici preaddestrati. Utilizziamo un nuovo task a scelta multipla e i dati aferenti, le Blackbird Language Matrices (BLM), per concentrarci su uno specifico fenomeno strutturale grammaticale - l'accordo tra il soggetto e il verbo - in diverse lingue. Per trovare la soluzione corretta a questo task è necessario un sistema che rilevi modelli e paradigmi linguistici complessi nelle rappresentazioni testuali. Utilizzando un'architettura a due livelli che risolve il problema in due fasi - prima impara gli oggetti sintattici e le loro proprietà nelle singole frasi e poi ne ricava gli elementi comuni - dimostriamo che, nonostante siano stati addestrati su testi multilingue in modo coerente, i modelli linguistici multilingue preaddestrati presentano diferenze specifiche per ogni lingua e inoltre la struttura sintattica non è condivisa, nemmeno tra lingue tipologicamente molto vicine.

eol>syntactic information synthetic structured data multi-lingual cross-lingual diagnostic studies of deep learning models

1. Introduction Large language models, trained on huge amount of texts,

have reached a level of performance that rivals human capabilities on a range of established benchmarks [1].

Despite high performance on high-level language pro

cessing tasks, it is not yet clear what kind of information these language models encode, and how. For example, transformer-based pretrained models have shown excellent performance in tasks that seem to require that the model encodes syntactic information [2].

All the knowledge that the LLMs encode comes from

unstructured texts and the shallow regularities they are very good at detecting, and which they are able to leverage into information that correlates to higher structures in language. Most notably, [3] have shown that from the unstructured textual input, BERT [4] is able to infer POS, structural, entity-related, syntactic and semantic information at successively higher layers of the architecture, mirroring the classical NLP pipeline [5]. We ask: How is this information encoded in the output layer of the model, i.e. the embeddings? Does it rely on surface information – such as inflections, function words – and is assembled on the demands of the task/probes [6], or does it indeed reflect something deeper that the language model has assembled through the progressive transformation of the input through its many layers?

To investigate this question, we use a seemingly simple task – subject-verb agreement. Subject-verb agreement is often used to test the syntactic abilities of deep neural networks [7, 8, 9, 10], because, while apparently simple and linear, it is in fact structurally, and theoretically, complex, and requires connecting the subject and the verb across arbitrarily long or complex structural distance.

It has an added useful dimension – it relies on syntactic structure and grammatical number information that many languages share. In previous work we have shown that simple struc Context

PP1-sg PP1-sg PP1-pl PP1-pl PP1-sg PP1-sg PP1-pl tural information – the chunk structure of a sentence – which can be leveraged to determine subject-verb agree- 1 NP-sg VP-sg ment, or to contribute towards more semantic tasks, can 23 NNPP--spgl VVPP--spgl be detected in the sentence embeddings obtained from 4 NP-pl VP-pl a pre-trained model [11]. This result, though, does not 5 NP-sg PP2-sg VP-sg cast light on whether the discovered structure is deeper 6 NP-pl PP2-sg VP-pl and more abstract, or it is rather just a reflection of sur- 7 NP-sg PP2-sg VP-sg face indicators, such as function words or morphological 8 ??? markers. Answers

To tease apart these two options, we set up an experi- 1 NP-pl PP1-pl PP2-sg VP-pl Correct ment covering four languages: English, French, Italian 2 NP-pl PP1-pl et PP2-sg VP-pl Coord and Romanian. These languages, while diferent, have 3 NP-pl PP1-pl VP-pl WNA shared properties that make sharing of syntactic structure 4 NP-pl PP1-sg PP1-sg VP-pl WN1 a reasonable expectation, if the pretrained multilingual 5 NP-pl PP1-pl PP2-pl VP-pl WN2 model does indeed discover and encode syntactic struc- 6 NP-pl PP1-pl PP2-pl VP-sg AEV ture. We use parallel datasets in the four languages, built 78 NNPP--ppll PPPP11--spgl PPPP22--spgl VVPP--ssgg AAEENN12 by (approximately) translating the BLM-AgrF dataset [12], a multiple-choice linguistic test inspired from the Raven Progressive Matrices visual intelligence test, previ- Figure 1: BLM instances for verb-subject agreement, with ously used to explore subject-verb agreement in French. two attractors. The errors can be grouped in two types:

Our work ofers two contributions: (i) four parallel (i) sequence errors: WNA= wrong nr. of attractors; WN1= datasets – on English, French, Italian and Romanian, fo- wrong gram. nr. for 1 attractor noun (N1); WN2= wrong cmuusletdilionngusualbtjeesctti-nvgerobfaagmreuelmtielinntg;u(iail) pcrreotsrsa-ilninegdumaloadnedl, AgNrE1aV;mA=.aEngNrr.2efe=omarge2rnetemearertotnrrtaocetnrorotrhrneoonvuenNrb(2N;.A2E);N(i1i)=gargarmeemmaetnictaelrerorrroorns: to explore the degree to which syntactic structure information is shared across diferent languages. Our crosslingual and multilingual experiments show poor transfer across languages, even those most related, like Italian and French. This result indicates that pretrained models encode syntactic information based on shallow and language-specific clues, from which they are not yet able to take the step towards abstracting grammatical structure. The datasets are available at https://www.idiap.ch /dataset/(blm-agre|blm-agrf|blm-agri|blm_agrr) and the code at https://github.com/CLCL-Geneva/BLM-SNFDise ntangling.

Such an approach can be very useful for probing lan

guage models, as it allows to test whether they indeed detect the relevant linguistic objects and their properties, and whether (or to what degree) they use this information to find larger patterns. We have developed BLMs as a linguistic test. Figure 1 illustrates the template of a

BLM subject-verb agreement matrix, with the diferent

linguistic objects – chunks/phrases – and their relevant properties, in this case grammatical number. Examples in all languages under investigation are provided in Appendix B.

2. BLM task and BLM-Agr datasets BLM-Agr datasets A BLM problem for subject-verb

agreement consists of a context set of seven sentences Inspired by existing IQ tests —Raven’s progressive ma- that share the subject-verb agreement phenomenon, but trices (RPMs)— we have developed a framework, called difer in other aspects – e.g. number of linearly intervenBlackbird Language Matrices (BLMs) [13] and several ing noun phrases between the subject and the verb (called datasets [12, 14]. RPMs consist of a sequence of images, attractors because they can interfere with the agreement), called the context, connected in a logical sequence by diferent grammatical numbers for these attractors, and underlying generative rules [15]. The task is to deter- diferent clause structures. The sequence is generated mine the missing element in this visual sequence, the by a rule of progression of number of attractors, and answer. The candidate answers are constructed to be alternation in the grammatical number of the diferent similar enough that the solution can be found only if the phrases. Each context is paired with a set of candidate rules are identified correctly. answers generated from the correct answer by altering

Solving an RPM problem is usually done in two steps: it to produce minimally contrastive error types. We have (i) identify the relevant objects and their attributes; (ii) two types of errors (see Figure 1: (i) sequence errors – decompose the main problem into subproblems, based on these candidate answers are grammatically correct, but object and attribute identification, in a way that allows they are not the correct continuation of the sequence; (ii) detecting the global pattern or underlying rules [16]. agreement errors – these candidate answers are grammatically erroneous, because the verb is in agreement sibilities (e.g., = "np-s pp1-s vp-s"), all corresponding with one of the intervening attractors. By constructing sentences are collected into a set . candidate answers with such specific error types, we can The dataset consists of triples (, +, − ), investigate the kind of information and structure learned. where is an input sentence, + is the correct output –

The seed data for French was created by manually a sentence diferent from but with the same chunk patcompleting data previously published data [17]. From this tern. − are = 7 incorrect outputs, randomly initial data, we generated a dataset that comprises three chosen from the sentences that have a chunk pattern difsubsets of increasing lexical complexity (details in [12]): ferent from . For each language, we sample uniformly Types I, II, III, corresponding to diferent amounts of approx. 4000 instances from the generated data based on lexical variation within a problem instance. Each subset the pattern of the input sentence, randomly split 80:20 contains three clause structures uniformly distributed into train:test. The train part is split 80:20 into train:dev, within the data. The dataset used here is a variation of the resulting in a 2576:630:798 split for train:dev:test.

BLM-AgrF [12] that separates sequence-based from other

types of errors, to be able to perform deeper analyses into the behaviour of pretrained language models. 3. Probing the encoding of syntax

The datasets in English, Italian and Romanian were created by manually translating the seed French sentences We aim to test whether the syntactic information detected into the other languages by native (Italian and Romanian) in multilingual pretrained sentence embeddings is based and near-native (English) speakers. The internal struc- on shallow, language-specific clues, or whether it is more ture in these languages is very similar, so translations are abstract structural information. Using the subject-verb approximately parallel. The diferences lie in the treat- agreement task and the parallel datasets in four languages ment of preposition and determiner sequences that must provides clues to the answer. be conflated into one word in some cases in Italian and The datasets all share sentences with the same syntacFrench, but not in English. French and Italian use number- tic structures, as illustrated in Figure 1. However, there specific determiners and inflections, while Romanian and are language specific diferences, as in the structure of English encode grammatical number exclusively through the chunks (noun or verb or prepositional phrases) and inflections. In English most plural forms are marked by each language has diferent ways to encode grammatical a sufix. Romanian has more variation, and noun inflec- number (see section 2). tions also encode case. Determiners are separate tokens, If the grammatical information in the sentences in which are overt indicators of grammatical number and our dataset – i.e. the sequences of chunks with specific of phrase boundaries, whereas inflections may or may properties relevant to the subject-verb agreement task not be tokenized separately. (Figure 1) – is an abstract form of knowledge within the

Table 1 shows the datasets statistics for the four BLM pretrained model, it will be shared across languages. We problems. After splitting each subset 90:10 into train:test would then see a high level of performance for a model subsets, we randomly sample 2000 instances as train data. trained on one of these languages, and tested on any 20% of the train data is used for development. of the other. Additionally, when training on a dataset consisting of data in the four languages, the model should

English French Italian Romanian detect a shared parameter space that would lead to high Type I 230 252 230 230 results when testing on data for each language. Type II 4052 4927 4121 4571 If however the grammatical information is a reflection Type III 4052 4810 4121 4571 of shallow language indicators, we expect to see higher performance on languages that have overt grammatical number and chunk indicators, such as French and Italian, and a low rate of cross-language transfer.

A sentence dataset From the seed files for each language we build a dataset to study sentence structure independently of a task. The seed files contain noun, verb and prepositional phrases, with singular and plural variations. From these chunks, we build sentences with all (grammatically correct) combinations of np [pp1 [pp2]] vp1. For each chunk pattern of the 14 pos

1pp1 and pp2 may be included or not, pp2 may be included only if

pp1 is included

3.1. System architectures A sentence-level VAE To test whether chunk struc

ture can be detected in sentence embeddings we use a

VAE-like system, which encodes a sentence, and decodes

a diferent sentence with the same chunk structure, using a set of contrastive negative examples – sentences that have diferent chunk structures from the input – to encourage the latent to encode the chunk structure.

The architecture of the sentence-level VAE is similar to and several incorrect answers . Every sentence is a previously proposed system [18]: the encoder consists embedded using the pretrained model. To simplify the of a CNN layer with a 15x15 kernel, which is applied to a discussion, in the sections that follows, when we say 32x24-shaped sentence embedding, followed by a linear sentence we actually mean its embedding. layer that compresses the output of the CNN into a latent The two-level VAE system takes a BLM instance as layer of size 5. The decoder mirrors the encoder. input, decomposes its context sequence into sentences

An instance consists of a triple (, +, − ), and passes them individually as input to the sentencewhere is an input sentence with embedding level VAE. For each sentence ∈ , the system builds and chunk structure , + is a sentence with embed- on-the-fly the candidate answers for the sentence level: ding + with same chunk structure , and − = the same sentence from input is used as the correct {| = 1, } is a set of = 7 sentences output, and a random selection of sentences from are with embeddings , each with chunk pattern diferent the negative answers. After an instance is processed by from (and diferent from each other). The input the sentence level, for each sentence ∈ , we obtain its is encoded into latent representation , from which we representation from the latent layer , and reassemble sample a vector ˜, which is decoded into the output ˆ. the input sequence as = [ ], and pass it as To encourage the latent to encode the structure of the in- input to the task-level VAE. The loss function combines put sentence we use a max-margin loss function, to push the losses on the two levels – a max-margin loss on the for a higher similarity score for ˆ with the sentence sentence level that contrasts the sentence reconstructed that has the same chunk pattern as the input (+ ) than on the sentence level with the correct answer and the the ones that do not. At prediction time, the sentence erroneous ones, and a max-margin loss on the task level from the {+} ∪ − options that has the highest that contrasts the answer constructed by the decoder score relative to the decoded answer is taken as correct. with the answer set of the BLM instance (details in [11]). Two-level VAE for BLMs We use a two-level system 3.2. Experiments illustrated in Figure 2, which separates the solving of the BLM task on subject-verb agreement into two steps: To explore how syntactic information – in particular (i) compress sentence embeddings into a representation chunk structure – is encoded, we perform cross-language that captures the sentence chunk structure and the rele- and multi-language experiments, using rfist the sentences vant chunk properties (on the sentence level) (ii) use the dataset, and then the BLM agreement task. We report F1 compressed sentence representations to solve the BLM averages over three runs. agreement problems, by detecting the pattern across the Cross-lingual experiments – train on data from one lansequence of structures (on the task level). This archi- guage, test on all the others – show whether patterns detecture will allow us to test whether sentence structure tected in sentence embeddings that encode chunk struc– in terms of chunks – is shared across languages in a ture are transferable across languages. The results on pretrained multilingual model. testing on the same language as the training provide support for the experimental set-up – the high results show that the pretrained language model used does encode the necessary information, and the system architecture is adequate to distill it.

The multilingual experiments, where we learn a model

from data in all the languages, will provide additional clues – if the performance on testing on individual languages is comparable to when training on each language alone, it means some information is shared across languages and can be beneficial.

All reported experiments use Electra [19]2, with the sentence representations the embedding of the [CLS] token (details in [11]).

An instance for a BLM problem consists of an ordered

context sequence of sentences, = {| = 1, 7} as input, and an answer set with one correct answer ,

2Electra pretrained model: google/electra-base-discriminator

3.2.1. Syntactic structure in sentences We use only the sentence level of the system illustrated in Figure 2 to explore chunk structure in sentences, using the data described in Section 2. For the cross-lingual experiments, the training dataset for each language is used to train a model that is then tested on each test set. For the multilingual setup, we assemble a common training data from the training data for all languages. 3.2.2. Solving the BLM agreement task We solve the BLM agreement task using the two-level system, where a compacted sentence representation learned on the sentence level should help detect patterns in the input sequence of a BLM instance. Because the datasets are parallel, with shared sentence and sequence patterns, we test whether the added learning signal from the task level can help push the system to learn to map an input sentence into a representation that captures structure shared across languages. We perform cross-lingual experiments, where a model is trained on data from one language, and tested on all the test sets, and a multilingual experiment, where for each type I/II/III data, we assemble a training dataset from the training sets of the same type from the other languages. The model is then tested on the separate test sets.

3.3. Evaluation For each training set we build three models, and plot the average F1 score. The standard deviation is very small, so we do not include it in the plot, but it is reported in the results Tables in Appendix C. 4. Results

Structure in sentences Figure 3 shows the results for the experiments on detecting chunk structure in sentence embeddings, in cross-lingual and multilingual training setups, for comparison (detailed results in Table 3). compared to learning in a monolingual setting. This again indicates that the system could not detect a shared parameter space for the information that is being learned, the chunk structure, and thus this information is encoded diferently in the languages under study.

An additional interesting insight comes from the analysis of the latent layer representations. Figure 4 shows the tSNE projection of the latent representations of the sentences in the training data after multilingual training. Diferent colours show diferent chunk patterns, and diferent markers show diferent languages. Had the information encoding syntactic structure been shared, the clusters for the same pattern in the diferent languages would overlap. Instead, we note that each language seems to have its own quite separate pattern clusters.

Two observations are relevant to our investigation: (i)

while training and testing on the same language leads to good performance – indicating that Electra sentence embeddings do contain relevant information about chunks, and that the system does detect the chunk pattern in these representations – there is very little transfer efect.

A slight efect is detected for the model learned on Italian and tested on French; (ii) learning using multilingual training data leads to a deterioration of the performance,

Discussion and related work Pretrained language models are learned from shallow cooccurrences through a lexical prediction task. The input information is transformed through several transformer layers, various parts boosting each other through self-attention. Analysis of the architecture of transformer models, like BERT [4], have localised and followed the flow of specific types of linguistic information through the system [20, 3], to the degree that the classical NLP pipeline seems to be reflected in the succession of the model’s layers. Analysis of contextualized token embeddings shows that they can encode specific linguistic information, such as sentence structure [21] (including in a multilingual set-up [22]), predicate argument structure [23], subjecthood and objecthood [24], among others. Sentence embeddings have also been probed using classifiers, and determined to encode specific types of linguistic information, such as subject-verb agreement [9], word order, tree depth, constituent information [25], auxiliaries[26] and argument structure [27].

Generative models like LLAMA seem to use English as the latent language in the middle layers [28], while other analyses of internal model parameters has lead to uncovering language agnostic and language specific networks of parameters [29], or neurons encoding cross-language number agreement information across several internal layers [30]. It has also been shown that subject-verb agreement information is not shared by BiLSTM models [31] or multilingual BERT [32]. Testing the degree to which word/sentence embeddings are multilingual has usually been done using a classification probe, for tasks like NER, POS tagging [33], language identification [34], or more complex tasks like question answering and sentence retrieval [35]. There are contradictory results on various cross-lingual model transfers, some of which can be explained by factors such as domain and size of training data, typological closeness of languages [36], or by the power of the classification probes. Generative or classification probes do not provide insights into whether the pretrained model finds deeper regularities and encodes abstract structures, or the predictions are based on shallower features that the probe used assembles for the specific test it is used for [37, 6].

We aimed to answer this question by using a multilingual setup, and a simple syntactic structure detection task in an indirectly supervised setting. The datasets used – in English, French, Italian and Romanian – are (approximately) lexically parallel, and are parallel in syntactic structure. The property of interest is grammatical number, and the task is subject-verb agreement. The languages chosen share commonalities – French, Italian and Romanian are all Romance languages, English and French share much lexical material – but there are also diferences: French and Italian use a similar manner to encode grammatical number, mainly through articles that can also signal phrase boundaries. English has a very limited form of nominal plural morphology, but determiners are useful for signaling phrase boundaries. In Romanian, number is expressed through inflection, sufixation and case, and articles are also often expressed through specific sufixes, thus overt phrase boundaries are less common than in French, Italian and English. These commonalities and diferences help us interpret the results, and provide clues on how the targeted syntactic information is encoded.

Previous experiments have shown that syntactic information – chunk sequences and their properties – can be accessed in transformer-based pretrained sentence embeddings [11]. In this multilingual setup, we test whether this information has been identified based on languagespecific shallow features, or whether the system has uncovered and encoded more abstract structures.

The low rate of transfer for the monolingual training setup and the decreased performance for the multilingual training setup for both our experimental configurations indicate that the chunk sequence information is language specific and is assembled by the system based on shallow features. Further clues come from the fact that the only transfer happens between French and Italian, which encode phrases and grammatical number in a very similar manner. Embedding the sentence structure detection into a larger system, where it receives an additional learning signal (shared across languages) does not help to push towards finding a shared sentence representation space that encodes in a uniform manner the sentence structure shared across languages.

5. Conclusions We have aimed to add some evidence to the question

How do state-of-the-art systems ≪ know≫ what they ≪ know≫ ? [37] by projecting the subject-verb agreement problem in a multilingual space. We chose languages that share syntactic structures, and have particular diferences that can provide clues about whether the models learned rely on shallower indicators, or the pretrained models encode deeper knowledge. Our experiments show that pretrained language models do not encode abstract syntactic structures, but rather this information is assembled "upon request" – by the probe or task – based on language-specific indicators. Understanding how information is encoded in large language models can help determine the next necessary step towards making language models truly deep.

Acknowledgments We gratefully acknowledge the partial support of this work by the Swiss National Science Foundation, through grant SNF Advanced grant TMAG1_209426 to PM.

test., Psychological review 97 (1990) 404. [24] I. Papadimitriou, E. A. Chi, R. Futrell, K. Mahowald, [17] J. Franck, G. Vigliocco, J. Nicol, Subject-verb agree- Deep subjecthood: Higher-order grammatical feament errors in french and english: The role of syn- tures in multilingual BERT, in: P. Merlo, J. Tiedetactic hierarchy, Language and cognitive processes mann, R. Tsarfaty (Eds.), Proceedings of the 16th 17 (2002) 371–404. Conference of the European Chapter of the Associ[18] V. Nastase, P. Merlo, Grammatical information in ation for Computational Linguistics: Main Volume, BERT sentence embeddings as two-dimensional Association for Computational Linguistics, Online, arrays, in: B. Can, M. Mozes, S. Cahyawijaya, 2021, pp. 2522–2532. URL: https://aclanthology.org N. Saphra, N. Kassner, S. Ravfogel, A. Ravichan- /2021.eacl-main.215. doi:10.18653/v1/2021.e der, C. Zhao, I. Augenstein, A. Rogers, K. Cho, acl-main.215.

E. Grefenstette, L. Voita (Eds.), Proceedings of the [25] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, 8th Workshop on Representation Learning for NLP M. Baroni, What you can cram into a single $&!#* (RepL4NLP 2023), Association for Computational vector: Probing sentence embeddings for linguisLinguistics, Toronto, Canada, 2023, pp. 22–39. URL: tic properties, in: I. Gurevych, Y. Miyao (Eds.), https://aclanthology.org/2023.repl4nlp- 1.3. Proceedings of the 56th Annual Meeting of the Asdoi:10.18653/v1/2023.repl4nlp-1.3. sociation for Computational Linguistics (Volume [19] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Elec- 1: Long Papers), Association for Computational tra: Pre- training text encoders as discriminators Linguistics, Melbourne, Australia, 2018, pp. 2126– rather than generators, in: ICLR, 2020, pp. 1–18. 2136. URL: https://aclanthology.org/P18- 1198. [20] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, doi:10.18653/v1/P18-1198.

R. T. McCoy, N. Kim, B. Van Durme, S. R. Bowman, [26] Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. GoldD. Das, et al., What do you learn from context? prob- berg, Fine-grained analysis of sentence embeding for sentence structure in contextualized word dings using auxiliary prediction tasks, in: 5th Interrepresentations, in: The Seventh International Con- national Conference on Learning Representations, ference on Learning Representations (ICLR), 2019, ICLR 2017, Toulon, France, April 24-26, 2017, Conpp. 235–249. ference Track Proceedings, OpenReview.net, 2017. [21] J. Hewitt, C. D. Manning, A structural probe for URL: https://openreview.net/f orum?id=BJh6Ztuxl. ifnding syntax in word representations, in: Proceed- [27] M. Wilson, J. Petty, R. Frank, How abstract is linings of the 2019 Conference of the North American guistic generalization in large language models? exChapter of the Association for Computational Lin- periments with argument structure, Transactions guistics: Human Language Technologies, Volume of the Association for Computational Linguistics 1 (Long and Short Papers), Association for Compu- 11 (2023) 1377–1395. URL: https://aclanthology.org tational Linguistics, Minneapolis, Minnesota, 2019, /2023.tacl-1.78. doi:10.1162/tacl_a_00608. pp. 4129–4138. URL: https://aclanthology.org/N19 [28] C. Wendler, V. Veselovsky, G. Monea, R. West, -1419. doi:10.18653/v1/N19-1419. Do llamas work in English? on the latent lan[22] E. A. Chi, J. Hewitt, C. D. Manning, Finding univer- guage of multilingual transformers, in: L.-W. sal grammatical relations in multilingual BERT, in: Ku, A. Martins, V. Srikumar (Eds.), Proceedings D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), of the 62nd Annual Meeting of the Association Proceedings of the 58th Annual Meeting of the As- for Computational Linguistics (Volume 1: Long Pasociation for Computational Linguistics, Associa- pers), Association for Computational Linguistics, tion for Computational Linguistics, Online, 2020, Bangkok, Thailand, 2024, pp. 15366–15394. URL: pp. 5564–5577. URL: https://aclanthology.org/2020. https://aclanthology.org/2024.acl- long.820. acl-main.493. doi:10.18653/v1/2020.acl-mai doi:10.18653/v1/2024.acl-long.820. n.493. [29] T. Tang, W. Luo, H. Huang, D. Zhang, X. Wang, [23] S. Conia, E. Barba, A. Scirè, R. Navigli, Semantic role X. Zhao, F. Wei, J.-R. Wen, Language-specific neulabeling meets definition modeling: Using natural rons: The key to multilingual capabilities in large language to describe predicate-argument structures, language models, in: L.-W. Ku, A. Martins, V. Srikuin: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Find- mar (Eds.), Proceedings of the 62nd Annual Meeting ings of the Association for Computational Linguis- of the Association for Computational Linguistics tics: EMNLP 2022, Association for Computational (Volume 1: Long Papers), Association for CompuLinguistics, Abu Dhabi, United Arab Emirates, 2022, tational Linguistics, Bangkok, Thailand, 2024, pp. pp. 4253–4270. URL: https://aclanthology.org/202 5701–5715. URL: https://aclanthology.org/2024.ac 2.findings-emnlp.313. doi: 10.18653/v1/2022.f l-long.309. doi:10.18653/v1/2024.acl-long. indings-emnlp.313. 309. [30] A. G. de Varda, M. Marelli, Data-driven cross- guistics (Volume 1: Long Papers), Association for lingual syntax: An agreement study with massively Computational Linguistics, Toronto, Canada, 2023, multilingual models, Computational Linguistics 49 pp. 5877–5891. URL: https://aclanthology.org/2023. (2023) 261–299. URL: https://aclanthology.org/2023. acl-long.323. doi:10.18653/v1/2023.acl-lon cl-2.1. doi:10.1162/coli_a_00472. g.323. [31] P. Dhar, A. Bisazza, Understanding cross-lingual [37] A. Lenci, Understanding natural language unsyntactic transfer in multilingual recurrent neural derstanding systems, Sistemi intelligenti, Rivista networks, in: S. Dobnik, L. Øvrelid (Eds.), Proceed- quadrimestrale di scienze cognitive e di intelligenza ings of the 23rd Nordic Conference on Computa- artificiale (2023) 277–302. URL: https://www.rivi tional Linguistics (NoDaLiDa), Linköping Univer- steweb.it/doi/10.1422/107438. doi:10.1422/1074 sity Electronic Press, Sweden, Reykjavik, Iceland 38. (Online), 2021, pp. 74–85. URL: https://aclantholo gy.org/2021.nodalida-main.8. [32] A. Mueller, G. Nicolai, P. Petrou-Zeniou, N. Talmina,

T. Linzen, Cross-linguistic syntactic evaluation of

word prediction models, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 5523–5539. URL: https://aclanthology.org/2020.acl- main.490. doi:10.18653/v1/2020.acl-main.490. [33] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual BERT?, in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 4996– 5001. URL: https://aclanthology.org/P19- 1493. doi:10.18653/v1/P19-1493. [34] G. I. Winata, A. Madotto, Z. Lin, R. Liu, J. Yosinski,

P. Fung, Language models are few-shot multilin

gual learners, in: D. Ataman, A. Birch, A. Conneau,

O. Firat, S. Ruder, G. G. Sahin (Eds.), Proceedings of

the 1st Workshop on Multilingual Representation

Learning, Association for Computational Linguis

tics, Punta Cana, Dominican Republic, 2021, pp.

1–15. URL: https://aclanthology.org/2021.mrl-1.1.

doi:10.18653/v1/2021.mrl-1.1. [35] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat,

M. Johnson, XTREME: A massively multilingual

multi-task benchmark for evaluating cross-lingual generalisation, in: H. D. III, A. Singh (Eds.), Proceedings of the 37th International Conference on

Machine Learning, volume 119 of Proceedings of

Machine Learning Research, PMLR, 2020, pp. 4411– 4421. URL: https://proceedings.mlr.press/v119/hu2 0b.html. [36] F. Philippy, S. Guo, S. Haddadan, Towards a common understanding of contributing factors for cross-lingual transfer in multilingual language models: A review, in: A. Rogers, J. Boyd-Graber,

N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Lin A. Generating data from a seed file

To build the sentence data, we use a seed file that was used to generate the subject-verb agreement data. A seed, consisting of noun, prepositional and verb phrases with diferent grammatical numbers, can be combined to build sentences consisting of diferent sequences of such chunks. Table 2 includes a partial line from the seed file. To produce the data in the 4 languages, we translate the seed file, from which the sentences and BLM data are then constructed.

B. Example of data for the agreement BLM

B.1. Example of BLM instances (type I) in diferent languages

C.2. Results on the BLM Agr* data C. Results C.1. Chunk sequence detection in sentences

0.884 (0.002) 0.103 (0.032) 0.113 (0.033) 0.113 (0.026) 0.757 (0.015) 0.132 (0.024) 0.100 (0.020) 0.088 (0.007) 0.638 (0.025) 0.114 (0.007) 0.091 (0.009) 0.086 (0.008) type_II_EN 0.772 (0.030) 0.151 (0.006) 0.106 (0.014) 0.107 (0.002) 0.670 (0.002) 0.188 (0.009) 0.100 (0.010) 0.093 (0.013) 0.620 (0.005) 0.168 (0.007) 0.091 (0.005) 0.082 (0.014) type_III_EN 0.739 (0.012) 0.160 (0.007) 0.132 (0.011) 0.091 (0.011) 0.662 (0.008) 0.202 (0.013) 0.111 (0.004) 0.086 (0.007) 0.654 (0.010) 0.183 (0.003) 0.106 (0.003) 0.082 (0.001) 0.123 (0.032) 0.948 (0.009) 0.341 (0.018) 0.186 (0.014) 0.119 (0.009) 0.868 (0.010) 0.386 (0.016) 0.174 (0.005) 0.117 (0.007) 0.820 (0.013) 0.337 (0.016) 0.170 (0.007) type_II_FR 0.154 (0.023) 0.972 (0.006) 0.417 (0.018) 0.177 (0.020) 0.158 (0.015) 0.903 (0.007) 0.448 (0.011) 0.182 (0.008) 0.150 (0.012) 0.870 (0.005) 0.387 (0.002) 0.175 (0.007) type_III_FR 0.174 (0.023) 0.923 (0.013) 0.384 (0.016) 0.164 (0.023) 0.164 (0.009) 0.883 (0.001) 0.425 (0.005) 0.158 (0.006) 0.155 (0.006) 0.860 (0.004) 0.373 (0.003) 0.156 (0.007) 0.125 (0.046) 0.466 (0.010) 0.845 (0.010) 0.188 (0.015) 0.129 (0.029) 0.433 (0.008) 0.875 (0.004) 0.173 (0.006) 0.129 (0.028) 0.406 (0.013) 0.806 (0.009) 0.174 (0.003) type_II_IT 0.103 (0.014) 0.484 (0.015) 0.791 (0.004) 0.170 (0.009) 0.106 (0.006) 0.434 (0.010) 0.840 (0.003) 0.159 (0.011) 0.116 (0.007) 0.386 (0.008) 0.770 (0.008) 0.172 (0.003) 0.106 (0.034) 0.164 (0.029) 0.183 (0.021) 0.733 (0.027) 0.103 (0.019) 0.187 (0.011) 0.196 (0.009) 0.726 (0.009) 0.108 (0.013) 0.169 (0.017) 0.170 (0.013) 0.314 (0.010) type_II_RO 0.090 (0.007) 0.143 (0.018) 0.151 (0.034) 0.625 (0.014) 0.100 (0.010) 0.146 (0.013) 0.152 (0.020) 0.636 (0.006) 0.092 (0.009) 0.127 (0.012) 0.132 (0.016) 0.311 (0.017)