Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement

Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement ViviNastase nastase@gmail.com Idiap Research Institute

Martigny Switzerland

ChunyangJiang chunyang.jiang42@gmail.com Idiap Research Institute

Martigny Switzerland

University of Geneva

Geneva Switzerland

GiuseppeSamo giuseppe.samo@idiap.ch Idiap Research Institute

Martigny Switzerland

PaolaMerlo paola.merlo@unige.ch Idiap Research Institute

Martigny Switzerland

University of Geneva

Geneva Switzerland

Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement 1613-0073 C8A3E542D00EC21D66316FAED0131086 GROBID - A machine learning software for extracting information from scholarly documents syntactic information synthetic structured data multi-lingual cross-lingual diagnostic studies of deep learning models

In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon -subject-verb agreement across a variety of sentence structures -in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps -detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences -we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.

Questo lavoro chiede se i modelli linguistici multilingue preaddestrati catturino rappresentazioni linguistiche astratte valide attraverso svariate lingue. Il nostro approccio sviluppa dati sintetici curati su larga scala, con proprietà specifiche, e li utilizza per studiare le rappresentazioni di frasi costruite con modelli linguistici preaddestrati. Utilizziamo un nuovo task a scelta multipla e i dati afferenti, le Blackbird Language Matrices (BLM), per concentrarci su uno specifico fenomeno strutturale grammaticale -l'accordo tra il soggetto e il verbo -in diverse lingue. Per trovare la soluzione corretta a questo task è necessario un sistema che rilevi modelli e paradigmi linguistici complessi nelle rappresentazioni testuali. Utilizzando un'architettura a due livelli che risolve il problema in due fasi -prima impara gli oggetti sintattici e le loro proprietà nelle singole frasi e poi ne ricava gli elementi comuni -dimostriamo che, nonostante siano stati addestrati su testi multilingue in modo coerente, i modelli linguistici multilingue preaddestrati presentano differenze specifiche per ogni lingua e inoltre la struttura sintattica non è condivisa, nemmeno tra lingue tipologicamente molto vicine.

Introduction

Large language models, trained on huge amount of texts, have reached a level of performance that rivals human capabilities on a range of established benchmarks [1]. Despite high performance on high-level language processing tasks, it is not yet clear what kind of information these language models encode, and how. For example, transformer-based pretrained models have shown excellent performance in tasks that seem to require that the model encodes syntactic information [2].

All the knowledge that the LLMs encode comes from unstructured texts and the shallow regularities they are very good at detecting, and which they are able to leverage into information that correlates to higher structures in language. Most notably, [3] have shown that from the unstructured textual input, BERT [4] is able to infer POS, structural, entity-related, syntactic and semantic information at successively higher layers of the architecture, mirroring the classical NLP pipeline [5]. We ask: How is this information encoded in the output layer of the model, i.e. the embeddings? Does it rely on surface information -such as inflections, function words -and is assembled on the demands of the task/probes [6], or does it indeed reflect something deeper that the language model has assembled through the progressive transformation of the input through its many layers?

To investigate this question, we use a seemingly simple task -subject-verb agreement. Subject-verb agreement is often used to test the syntactic abilities of deep neural networks [7,8,9,10], because, while apparently simple and linear, it is in fact structurally, and theoretically, complex, and requires connecting the subject and the verb across arbitrarily long or complex structural distance. It has an added useful dimension -it relies on syntactic structure and grammatical number information that many languages share.

In previous work we have shown that simple struc-tural information -the chunk structure of a sentencewhich can be leveraged to determine subject-verb agreement, or to contribute towards more semantic tasks, can be detected in the sentence embeddings obtained from a pre-trained model [11]. This result, though, does not cast light on whether the discovered structure is deeper and more abstract, or it is rather just a reflection of surface indicators, such as function words or morphological markers.

To tease apart these two options, we set up an experiment covering four languages: English, French, Italian and Romanian. These languages, while different, have shared properties that make sharing of syntactic structure a reasonable expectation, if the pretrained multilingual model does indeed discover and encode syntactic structure. We use parallel datasets in the four languages, built by (approximately) translating the BLM-AgrF dataset [12], a multiple-choice linguistic test inspired from the Raven Progressive Matrices visual intelligence test, previously used to explore subject-verb agreement in French.

Our work offers two contributions: (i) four parallel datasets -on English, French, Italian and Romanian, focused on subject-verb agreement; (ii) cross-lingual and multilingual testing of a multilingual pretrained model, to explore the degree to which syntactic structure information is shared across different languages. Our crosslingual and multilingual experiments show poor transfer across languages, even those most related, like Italian and French. This result indicates that pretrained models encode syntactic information based on shallow and language-specific clues, from which they are not yet able to take the step towards abstracting grammatical structure. The datasets are available at https://www.idiap.ch /dataset/(blm-agre|blm-agrf|blm-agri|blm_agrr) and the code at https://github.com/CLCL-Geneva/BLM-SNFDise ntangling.

BLM task and BLM-Agr datasets

Inspired by existing IQ tests -Raven's progressive matrices (RPMs)-we have developed a framework, called Blackbird Language Matrices (BLMs) [13] and several datasets [12,14]. RPMs consist of a sequence of images, called the context, connected in a logical sequence by underlying generative rules [15]. The task is to determine the missing element in this visual sequence, the answer. The candidate answers are constructed to be similar enough that the solution can be found only if the rules are identified correctly.

Solving an RPM problem is usually done in two steps: (i) identify the relevant objects and their attributes; (ii) decompose the main problem into subproblems, based on object and attribute identification, in a way that allows detecting the global pattern or underlying rules [16]. Such an approach can be very useful for probing language models, as it allows to test whether they indeed detect the relevant linguistic objects and their properties, and whether (or to what degree) they use this information to find larger patterns. We have developed BLMs as a linguistic test. Figure 1 illustrates the template of a BLM subject-verb agreement matrix, with the different linguistic objects -chunks/phrases -and their relevant properties, in this case grammatical number. Examples in all languages under investigation are provided in Appendix B.

BLM-Agr datasets

A BLM problem for subject-verb agreement consists of a context set of seven sentences that share the subject-verb agreement phenomenon, but differ in other aspects -e.g. number of linearly intervening noun phrases between the subject and the verb (called attractors because they can interfere with the agreement), different grammatical numbers for these attractors, and different clause structures. The sequence is generated by a rule of progression of number of attractors, and alternation in the grammatical number of the different phrases. Each context is paired with a set of candidate answers generated from the correct answer by altering it to produce minimally contrastive error types. We have two types of errors (see Figure 1: (i) sequence errorsthese candidate answers are grammatically correct, but they are not the correct continuation of the sequence; (ii) agreement errors -these candidate answers are gram-matically erroneous, because the verb is in agreement with one of the intervening attractors. By constructing candidate answers with such specific error types, we can investigate the kind of information and structure learned.

The seed data for French was created by manually completing data previously published data [17]. From this initial data, we generated a dataset that comprises three subsets of increasing lexical complexity (details in [12]): Types I, II, III, corresponding to different amounts of lexical variation within a problem instance. Each subset contains three clause structures uniformly distributed within the data. The dataset used here is a variation of the BLM-AgrF [12] that separates sequence-based from other types of errors, to be able to perform deeper analyses into the behaviour of pretrained language models.

The datasets in English, Italian and Romanian were created by manually translating the seed French sentences into the other languages by native (Italian and Romanian) and near-native (English) speakers. The internal structure in these languages is very similar, so translations are approximately parallel. The differences lie in the treatment of preposition and determiner sequences that must be conflated into one word in some cases in Italian and French, but not in English. French and Italian use numberspecific determiners and inflections, while Romanian and English encode grammatical number exclusively through inflections. In English most plural forms are marked by a suffix. Romanian has more variation, and noun inflections also encode case. Determiners are separate tokens, which are overt indicators of grammatical number and of phrase boundaries, whereas inflections may or may not be tokenized separately.

Table 1 shows the datasets statistics for the four BLM problems. After splitting each subset 90:10 into train:test subsets, we randomly sample 2000 instances as train data. 20% of the train data is used for development. A sentence dataset From the seed files for each language we build a dataset to study sentence structure independently of a task. The seed files contain noun, verb and prepositional phrases, with singular and plural variations. From these chunks, we build sentences with all (grammatically correct) combinations of np [pp1 [pp2]] vp1 . For each chunk pattern 𝑝 of the 14 pos-sibilities (e.g., 𝑝 = "np-s pp1-s vp-s"), all corresponding sentences are collected into a set 𝑆𝑝.

English French Italian Romanian

The dataset consists of triples (𝑖𝑛, 𝑜𝑢𝑡 + , 𝑂𝑢𝑡 − ), where 𝑖𝑛 is an input sentence, 𝑜𝑢𝑡 + is the correct outputa sentence different from 𝑖𝑛 but with the same chunk pattern. 𝑂𝑢𝑡 − are 𝑁𝑛𝑒𝑔𝑠 = 7 incorrect outputs, randomly chosen from the sentences that have a chunk pattern different from 𝑖𝑛. For each language, we sample uniformly approx. 4000 instances from the generated data based on the pattern of the input sentence, randomly split 80:20 into train:test. The train part is split 80:20 into train:dev, resulting in a 2576:630:798 split for train:dev:test.

Probing the encoding of syntax

We aim to test whether the syntactic information detected in multilingual pretrained sentence embeddings is based on shallow, language-specific clues, or whether it is more abstract structural information. Using the subject-verb agreement task and the parallel datasets in four languages provides clues to the answer.

The datasets all share sentences with the same syntactic structures, as illustrated in Figure 1. However, there are language specific differences, as in the structure of the chunks (noun or verb or prepositional phrases) and each language has different ways to encode grammatical number (see section 2).

If the grammatical information in the sentences in our dataset -i.e. the sequences of chunks with specific properties relevant to the subject-verb agreement task (Figure 1) -is an abstract form of knowledge within the pretrained model, it will be shared across languages. We would then see a high level of performance for a model trained on one of these languages, and tested on any of the other. Additionally, when training on a dataset consisting of data in the four languages, the model should detect a shared parameter space that would lead to high results when testing on data for each language.

If however the grammatical information is a reflection of shallow language indicators, we expect to see higher performance on languages that have overt grammatical number and chunk indicators, such as French and Italian, and a low rate of cross-language transfer.

System architectures

A sentence-level VAE To test whether chunk structure can be detected in sentence embeddings we use a VAE-like system, which encodes a sentence, and decodes a different sentence with the same chunk structure, using a set of contrastive negative examples -sentences that have different chunk structures from the input -to encourage the latent to encode the chunk structure.

The architecture of the sentence-level VAE is similar to a previously proposed system [18]: the encoder consists of a CNN layer with a 15x15 kernel, which is applied to a 32x24-shaped sentence embedding, followed by a linear layer that compresses the output of the CNN into a latent layer of size 5. The decoder mirrors the encoder.

An instance consists of a triple (𝑖𝑛, 𝑜𝑢𝑡 + , 𝑂𝑢𝑡 − ), where 𝑖𝑛 is an input sentence with embedding 𝑒𝑖𝑛 and chunk structure 𝑝, 𝑜𝑢𝑡 + is a sentence with embedding 𝑒 𝑜𝑢𝑡 + with same chunk structure 𝑝, and 𝑂𝑢𝑡 − = {𝑠 𝑘 |𝑘 = 1, 𝑁𝑛𝑒𝑔𝑠} is a set of 𝑁𝑛𝑒𝑔𝑠 = 7 sentences with embeddings 𝑒𝑠 𝑘 , each with chunk pattern different from 𝑝 (and different from each other). The input 𝑒𝑖𝑛 is encoded into latent representation 𝑧𝑖, from which we sample a vector 𝑧 ˜𝑖, which is decoded into the output 𝑒 ˆ𝑖𝑛. To encourage the latent to encode the structure of the input sentence we use a max-margin loss function, to push for a higher similarity score for 𝑒 ˆ𝑖𝑛 with the sentence that has the same chunk pattern as the input (𝑒 𝑜𝑢𝑡 + ) than the ones that do not. At prediction time, the sentence from the {𝑜𝑢𝑡 + } ∪ 𝑂𝑢𝑡 − options that has the highest score relative to the decoded answer is taken as correct.

Two-level VAE for BLMs

We use a two-level system illustrated in Figure 2, which separates the solving of the BLM task on subject-verb agreement into two steps: (i) compress sentence embeddings into a representation that captures the sentence chunk structure and the relevant chunk properties (on the sentence level) (ii) use the compressed sentence representations to solve the BLM agreement problems, by detecting the pattern across the sequence of structures (on the task level). This architecture will allow us to test whether sentence structure -in terms of chunks -is shared across languages in a pretrained multilingual model. All reported experiments use Electra [19] 2 , with the sentence representations the embedding of the [CLS] token (details in [11]).

An instance for a BLM problem consists of an ordered context sequence 𝑆 of sentences, 𝑆 = {𝑠𝑖|𝑖 = 1, 7} as input, and an answer set 𝐴 with one correct answer 𝑎𝑐, 2 Electra pretrained model: google/electra-base-discriminator and several incorrect answers 𝑎𝑒𝑟𝑟. Every sentence is embedded using the pretrained model. To simplify the discussion, in the sections that follows, when we say sentence we actually mean its embedding.

The two-level VAE system takes a BLM instance as input, decomposes its context sequence 𝑆 into sentences and passes them individually as input to the sentencelevel VAE. For each sentence 𝑠𝑖 ∈ 𝑆, the system builds on-the-fly the candidate answers for the sentence level: the same sentence 𝑠𝑖 from input is used as the correct output, and a random selection of sentences from 𝑆 are the negative answers. After an instance is processed by the sentence level, for each sentence 𝑠𝑖 ∈ 𝑆, we obtain its representation from the latent layer 𝑙𝑠 𝑖 , and reassemble the input sequence as 𝑆 𝑙 = 𝑠𝑡𝑎𝑐𝑘[𝑙𝑠 𝑖 ], and pass it as input to the task-level VAE. The loss function combines the losses on the two levels -a max-margin loss on the sentence level that contrasts the sentence reconstructed on the sentence level with the correct answer and the erroneous ones, and a max-margin loss on the task level that contrasts the answer constructed by the decoder with the answer set of the BLM instance (details in [11]).

Experiments

To explore how syntactic information -in particular chunk structure -is encoded, we perform cross-language and multi-language experiments, using first the sentences dataset, and then the BLM agreement task. We report F1 averages over three runs.

Cross-lingual experiments -train on data from one language, test on all the others -show whether patterns detected in sentence embeddings that encode chunk structure are transferable across languages. The results on testing on the same language as the training provide support for the experimental set-up -the high results show that the pretrained language model used does encode the necessary information, and the system architecture is adequate to distill it.

The multilingual experiments, where we learn a model from data in all the languages, will provide additional clues -if the performance on testing on individual languages is comparable to when training on each language alone, it means some information is shared across languages and can be beneficial.

Syntactic structure in sentences

We use only the sentence level of the system illustrated in Figure 2 to explore chunk structure in sentences, using the data described in Section 2. For the cross-lingual experiments, the training dataset for each language is used to train a model that is then tested on each test set. For the multilingual setup, we assemble a common training data from the training data for all languages.

Solving the BLM agreement task

We solve the BLM agreement task using the two-level system, where a compacted sentence representation learned on the sentence level should help detect patterns in the input sequence of a BLM instance. Because the datasets are parallel, with shared sentence and sequence patterns, we test whether the added learning signal from the task level can help push the system to learn to map an input sentence into a representation that captures structure shared across languages. We perform cross-lingual experiments, where a model is trained on data from one language, and tested on all the test sets, and a multilingual experiment, where for each type I/II/III data, we assemble a training dataset from the training sets of the same type from the other languages. The model is then tested on the separate test sets.

Evaluation

For each training set we build three models, and plot the average F1 score. The standard deviation is very small, so we do not include it in the plot, but it is reported in the results Tables in Appendix C.

Results

Structure in sentences Figure 3 shows the results for the experiments on detecting chunk structure in sentence embeddings, in cross-lingual and multilingual training setups, for comparison (detailed results in Table 3). Two observations are relevant to our investigation: (i) while training and testing on the same language leads to good performance -indicating that Electra sentence embeddings do contain relevant information about chunks, and that the system does detect the chunk pattern in these representations -there is very little transfer effect. A slight effect is detected for the model learned on Italian and tested on French; (ii) learning using multilingual training data leads to a deterioration of the performance, compared to learning in a monolingual setting. This again indicates that the system could not detect a shared parameter space for the information that is being learned, the chunk structure, and thus this information is encoded differently in the languages under study. , "x" for French, "+" for Italian, "*" for Romanian. We note that while representations cluster by the pattern, the clusters for different languages are disjoint.

An additional interesting insight comes from the analysis of the latent layer representations. Figure 4 shows the tSNE projection of the latent representations of the sentences in the training data after multilingual training. Different colours show different chunk patterns, and different markers show different languages. Had the information encoding syntactic structure been shared, the clusters for the same pattern in the different languages would overlap. Instead, we note that each language seems to have its own quite separate pattern clusters.

Structure in sentences for the BLM agreement task

When the sentence structure detection is embedded in the system for solving the BLM agreement task, where an additional supervision signals comes from the task, we note a similar result as when processing the sentences individually. Figure 5 shows the results for the multilingual and monolingual training setups for the type I data. Complete results are in Tables 4-5 in the appendix.

Discussion and related work Pretrained language models are learned from shallow cooccurrences through a lexical prediction task. The input information is transformed through several transformer layers, various parts boosting each other through self-attention. Analysis of the architecture of transformer models, like BERT [4], have localised and followed the flow of specific types of linguistic information through the system [20,3], to the degree that the classical NLP pipeline seems to be reflected in the succession of the model's layers. Analysis of contextualized token embeddings shows that they can encode specific linguistic information, such as sentence structure [21] (including in a multilingual set-up [22]), predicate argument structure [23], subjecthood and objecthood [24], among others. Sentence embeddings have also been probed using classifiers, and determined to encode specific types of linguistic information, such as subject-verb agreement [9], word order, tree depth, constituent information [25], auxiliaries [26] and argument structure [27].

Generative models like LLAMA seem to use English as the latent language in the middle layers [28], while other analyses of internal model parameters has lead to uncovering language agnostic and language specific networks of parameters [29], or neurons encoding cross-language number agreement information across several internal layers [30]. It has also been shown that subject-verb agreement information is not shared by BiLSTM models [31] or multilingual BERT [32]. Testing the degree to which word/sentence embeddings are multilingual has usually been done using a classification probe, for tasks like NER, POS tagging [33], language identification [34], or more complex tasks like question answering and sentence retrieval [35]. There are contradictory results on various cross-lingual model transfers, some of which can be explained by factors such as domain and size of training data, typological closeness of languages [36], or by the power of the classification probes. Generative or classification probes do not provide insights into whether the pretrained model finds deeper regularities and encodes abstract structures, or the predictions are based on shallower features that the probe used assembles for the specific test it is used for [37,6].

We aimed to answer this question by using a multilingual setup, and a simple syntactic structure detection task in an indirectly supervised setting. The datasets used -in English, French, Italian and Romanian -are (approximately) lexically parallel, and are parallel in syntactic structure. The property of interest is grammatical number, and the task is subject-verb agreement. The languages chosen share commonalities -French, Italian and Romanian are all Romance languages, English and French share much lexical material -but there are also differences: French and Italian use a similar manner to encode grammatical number, mainly through articles that can also signal phrase boundaries. English has a very limited form of nominal plural morphology, but determiners are useful for signaling phrase boundaries. In Romanian, number is expressed through inflection, suffixation and case, and articles are also often expressed through specific suffixes, thus overt phrase boundaries are less common than in French, Italian and English. These commonalities and differences help us interpret the results, and provide clues on how the targeted syntactic information is encoded.

Previous experiments have shown that syntactic information -chunk sequences and their properties -can be accessed in transformer-based pretrained sentence embeddings [11]. In this multilingual setup, we test whether this information has been identified based on languagespecific shallow features, or whether the system has uncovered and encoded more abstract structures.

The low rate of transfer for the monolingual training setup and the decreased performance for the multilingual training setup for both our experimental configurations indicate that the chunk sequence information is language specific and is assembled by the system based on shallow features. Further clues come from the fact that the only transfer happens between French and Italian, which encode phrases and grammatical number in a very similar manner. Embedding the sentence structure detection into a larger system, where it receives an additional learning signal (shared across languages) does not help to push towards finding a shared sentence representation space that encodes in a uniform manner the sentence structure shared across languages.

Conclusions

We have aimed to add some evidence to the question How do state-of-the-art systems ≪know≫ what they ≪know≫? [37] by projecting the subject-verb agreement problem in a multilingual space. We chose languages that share syntactic structures, and have particular differences that can provide clues about whether the models learned rely on shallower indicators, or the pretrained models encode deeper knowledge. Our experiments show that pretrained language models do not encode abstract syntactic structures, but rather this information is assembled "upon request" -by the probe or task -based on language-specific indicators. Understanding how information is encoded in large language models can help determine the next necessary step towards making language models truly deep.

C. Results

C.1. Chunk sequence detection in sentences

Figure 2 :2Figure 2: A two-level VAE: the sentence level learns to compress a sentence into a representation useful to solve the BLM problem on the task level.

Figure 3 :3Figure 3: Cross-language testing for detecting chunk structure in sentence embeddings.

Figure 4 :4Figure 4: tSNE projection of the latent representation of sentences from the training data, coloured by their chunk pattern. Different markers indicate the languages: "o" for English, "x" for French, "+" for Italian, "*" for Romanian. We note that while representations cluster by the pattern, the clusters for different languages are disjoint.

Figure 5 :5Figure 5: Average F1 performance on training on type I data over three runs -cross-language and multi-language

Figure 6 :6Figure 6: Parallel examples of a type I data instance in English, French, Italian and Romanian

Table 11Test data statistics. The amount of training data is always 2000 instances.Type I230252230230Type II4052492741214571Type III4052481041214571

Table 33Average F1 scores (standard deviation) for sentence chunk detection in sentencesC.2. Results on the BLM Agr* datatrain ontest ontype_I_ENtype_I_FRtype_I_ITtype_I_ROtype_I0.839 (0.007)0.938 (0.011)0.868 (0.021) 0.462 (0.023)type_II0.696 (0.006)0.944 (0.003)0.759 (0.004)0.409 (0.031)type_III0.558 (0.013)0.791 (0.026)0.641 (0.023)0.290 (0.027)type_II_ENtype_II_FRtype_II_ITtype_II_ROtype_I0.748 (0.001) 0.873 (0.006) 0.851 (0.015) 0.448 (0.015)type_II0.642 (0.002)0.871 (0.012)0.802 (0.002)0.394 (0.012)type_III0.484 (0.023)0.760 (0.027)0.691 (0.023)0.299 (0.010)type_III_ENtype_III_FRtype_III_ITtype_III_ROtype_I0.643 (0.003)(0.022)0.236 (0.004)type_II0.585 (0.010)0.797 (0.008)0.693 (0.009)0.240 (0.006)type_III0.480 (0.026)0.739 (0.027)0.691 (0.017)0.262 (0.002)

Table 44Multilingual learning results for the BLM agreement task in terms of average F1 over three runs, and standard deviation.test ontrain ontype_I_ENtype_I_FRtype_I_ITtype_I_ROtype_I_EN0.884 (0.002)0.123 (0.032)0.125 (0.046)0.106 (0.034)type_I_FR0.103 (0.032)0.948 (0.009)0.466 (0.010)0.164 (0.029)type_I_IT0.113 (0.033)0.341 (0.018)0.845 (0.010)0.183 (0.021)type_I_RO0.113 (0.026)0.186 (0.014)0.188 (0.015)0.733 (0.027)type_II_EN0.757 (0.015)0.119 (0.009)0.129 (0.029)0.103 (0.019)type_II_FR0.132 (0.024)0.868 (0.010)0.433 (0.008)0.187 (0.011)type_II_IT0.100 (0.020)0.386 (0.016)0.875 (0.004)0.196 (0.009)type_II_RO0.088 (0.007)0.174 (0.005)0.173 (0.006)0.726 (0.009)type_III_EN0.638 (0.025)0.117 (0.007)0.129 (0.028)0.108 (0.013)type_III_FR0.114 (0.007)0.820 (0.013)0.406 (0.013)0.169 (0.017)type_III_IT0.091 (0.009)0.337 (0.016)0.806 (0.009)0.170 (0.013)type_III_RO0.086 (0.008)0.170 (0.007)0.174 (0.003)0.314 (0.010)type_II_ENtype_II_FRtype_II_ITtype_II_ROtype_I_EN0.772 (0.030)0.154 (0.023)0.103 (0.014)0.090 (0.007)type_I_FR0.151 (0.006)0.972 (0.006)0.484 (0.015)0.143 (0.018)type_I_IT0.106 (0.014)0.417 (0.018)0.791 (0.004)0.151 (0.034)type_I_RO0.107 (0.002)0.177 (0.020)0.170 (0.009)0.625 (0.014)type_II_EN0.670 (0.002)0.158 (0.015)0.106 (0.006)0.100 (0.010)type_II_FR0.188 (0.009)0.903 (0.007)0.434 (0.010)0.146 (0.013)type_II_IT0.100 (0.010)0.448 (0.011)0.840 (0.003)0.152 (0.020)type_II_RO0.093 (0.013)0.182 (0.008)0.159 (0.011)0.636 (0.006)type_III_EN0.620 (0.005)0.150 (0.012)0.116 (0.007)0.092 (0.009)type_III_FR0.168 (0.007)0.870 (0.005)0.386 (0.008)0.127 (0.012)type_III_IT0.091 (0.005)0.387 (0.002)0.770 (0.008)0.132 (0.016)type_III_RO0.082 (0.014)0.175 (0.007)0.172 (0.003)0.311 (0.017)type_III_ENtype_III_FRtype_III_ITtype_III_ROtype_I_EN0.739 (0.012)0.174 (0.023)0.154 (0.013)0.059 (0.009)type_I_FR0.160 (0.007)0.923 (0.013)0.434 (0.005)0.196 (0.029)type_I_IT0.132 (0.011)0.384 (0.016)0.797 (0.009)0.197 (0.005)type_I_RO0.091 (0.011)0.164 (0.023)0.170 (0.022)0.280 (0.010)type_II_EN0.662 (0.008)0.164 (0.009)0.142 (0.015)0.076 (0.010)type_II_FR0.202 (0.013)0.883 (0.001)0.454 (0.010)0.203 (0.010)type_II_IT0.111 (0.004)0.425 (0.005)0.840 (0.002)0.203 (0.006)type_II_RO0.086 (0.007)0.158 (0.006)0.158 (0.012)0.379 (0.013)type_III_EN0.654 (0.010)0.155 (0.006)0.140 (0.016)0.082 (0.007)type_III_FR0.183 (0.003)0.860 (0.004)0.431 (0.004)0.191 (0.003)type_III_IT0.106 (0.003)0.373 (0.003)0.836 (0.005)0.182 (0.004)type_III_RO0.082 (0.001)0.156 (0.007)0.155 (0.007)0.353 (0.006)

Table 55Results as average F1 (sd) over three runs, for the BLM subject-verb agreement task, in the monolingual training setting.pp1 and pp2 may be included or not, pp2 may be included only if pp1 is included

Acknowledgments

We gratefully acknowledge the partial support of this work by the Swiss National Science Foundation, through grant SNF Advanced grant TMAG-1_209426 to PM.

A. Generating data from a seed file

To build the sentence data, we use a seed file that was used to generate the subject-verb agreement data. A seed, consisting of noun, prepositional and verb phrases with different grammatical numbers, can be combined to build sentences consisting of different sequences of such chunks. Table 2 includes a partial line from the seed file. To produce the data in the 4 languages, we translate the seed file, from which the sentences and BLM data are then constructed. The computers with the programs of the experiment are broken.

The computers with the programs of the experiments are broken.

The computers with the program of the experiment are broken.

The computers with the program of the experiment is broken. ...

Table 2

A line from the seed file on top, and a set of individual sentences built from it, as well as one BLM instance.

B. Example of data for the agreement BLM B.1. Example of BLM instances (type I) in different languages

English -Context 1 The owner of the parrot is coming. 2 The owners of the parrot are coming. 3 The owner of the parrots is coming. 4 The owners of the parrots are coming. 5 The owner of the parrot in the tree is coming. 6 The owners of the parrot in the tree are coming. 7 The owner of the parrots in the tree is coming. ? ??? English -Answers 1 The owners of the parrots in the tree are coming. 2 The owners of the parrots in the trees are coming. 3 The owner of the parrots in the tree is coming. 4 The owners of the parrots in the tree are coming. 5 The owners of the parrot in the tree are coming. 6 The owners of the parrots in the trees are coming. 7 The owners of the parrots and the trees are coming. ? The owners of the parrots in the tree in the gardens are coming.

French

Superglue: A stickier benchmark for general-purpose language understanding systems AWang YPruksachatkun NNangia ASingh JMichael FHill OLevy SBowman Advances in Neural Information Processing Systems HWallach HLarochelle ABeygelzimer FAlché-Buc EFox RGarnett Curran Associates, Inc 2019 32 Emergent linguistic structure in artificial neural networks trained by self-supervision CDManning KClark JHewitt UKhandelwal OLevy Proceedings of the National Academy of Sciences 117 2020 A primer in BERTology: What we know about how BERT works ARogers OKovaleva ARumshisky 10.1162/tacl_a_00349 Transactions of the Association for Computational Linguistics 8 2020 BERT: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova 10.18653/v1/N19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

2019 1 Association for Computational Linguistics BERT rediscovers the classical NLP pipeline ITenney DDas EPavlick 10.18653/v1/P19-1452 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics AKorhonen DTraum LMàrquez the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Florence, Italy

2019 Designing and interpreting probes with control tasks JHewitt PLiang 10.18653/v1/D19-1275 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics KInui JJiang VNg XWan the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics

Hong Kong, China

2019 Assessing the ability of LSTMs to learn syntax-sensitive dependencies TLinzen EDupoux YGoldberg 10.1162/tacl_a_00115 Transactions of the Association of Computational Linguistics 4 2016 Colorless green recurrent networks dream hierarchically KGulordava PBojanowski EGrave TLinzen MBaroni 10.18653/v1/N18-1108 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Association for Computational Linguistics 2018 Assessing bert's syntactic abilities YGoldberg arXiv:1901.05287 2019 arXiv preprint Syntactic structure from deep learning TLinzen MBaroni 10.1146/annurev-linguistics-032020-051035 Annual Review of Linguistics 7 2021 Are there identifiable structural parts in the sentence embedding whole? VNastase PMerlo Proceedings of the Workshop on analyzing and interpreting neural networks for NLP (BlackBoxNLP) the Workshop on analyzing and interpreting neural networks for NLP (BlackBoxNLP) 2024 BLM-AgrF: A new French benchmark to investigate generalization of agreement in neural networks AAn CJiang MARodriguez VNastase PMerlo Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics

Dubrovnik, Croatia

2023 Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Motivations and formal specifications PMerlo 10.48550/arXiv.2306.11444 ArXiv cs.CL 2306.11444 2023 BLM-s/lE: A structured dataset of English spray-load verb alternations for testing generalization in LLMs GSamo VNastase CJiang PMerlo Findings of the 2023 Conference on Empirical Methods in Natural Language Processing 2023 Standardization of progressive matrices JCRaven British Journal of Medical Psychology 19 1938 What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test PACarpenter MAJust PShell Psychological review 97 404 1990 Subject-verb agreement errors in french and english: The role of syntactic hierarchy JFranck GVigliocco JNicol Language and cognitive processes 17 2002 Grammatical information in BERT sentence embeddings as two-dimensional arrays VNastase PMerlo 10.18653/v1/2023.repl4nlp-1.3 Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), Association for Computational Linguistics BCan MMozes SCahyawijaya NSaphra NKassner SRavfogel ARavichander CZhao IAugenstein ARogers KCho EGrefenstette LVoita the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), Association for Computational Linguistics

Toronto, Canada

2023 Electra: Pre-training text encoders as discriminators rather than generators KClark M.-TLuong QVLe CDManning ICLR 2020 What do you learn from context? probing for sentence structure in contextualized word representations ITenney PXia BChen AWang APoliak RTMccoy NKim BVan Durme SRBowman DDas The Seventh International Conference on Learning Representations (ICLR) 2019 A structural probe for finding syntax in word representations JHewitt CDManning 10.18653/v1/N19-1419 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

Association for Computational Linguistics 2019 1 Finding universal grammatical relations in multilingual BERT EAChi JHewitt CDManning 10.18653/v1/2020.acl-main.493 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Semantic role labeling meets definition modeling: Using natural language to describe predicate-argument structures SConia EBarba AScirè RNavigli 10.18653/v1/2022.findings-emnlp.313 Findings of the Association for Computational Linguistics: EMNLP 2022, Association for Computational Linguistics YGoldberg ZKozareva YZhang

Abu Dhabi, United Arab Emirates

2022 Deep subjecthood: Higher-order grammatical features in multilingual BERT IPapadimitriou EAChi RFutrell KMahowald 10.18653/v1/2021.eacl-main.215 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics PMerlo JTiedemann RTsarfaty the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics 2021 What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties AConneau GKruszewski GLample LBarrault MBaroni 10.18653/v1/P18-1198 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics IGurevych YMiyao the 56th Annual Meeting of the Association for Computational Linguistics

Melbourne, Australia

2018 1 : Long Papers), Association for Computational Linguistics Fine-grained analysis of sentence embeddings using auxiliary prediction tasks YAdi EKermany YBelinkov OLavi YGoldberg 5th International Conference on Learning Representations, ICLR 2017 Conference Track Proceedings

Toulon, France; OpenReview

April 24-26, 2017. 2017 How abstract is linguistic generalization in large language models? experiments with argument structure MWilson JPetty RFrank 10.1162/tacl_a_00608 Transactions of the Association for Computational Linguistics 11 2023 CWendler VVeselovsky GMonea RWest Do llamas work in English? on the latent language of multilingual transformers L.-W 10.18653/v1/2024.acl-long.820 Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics AKu VMartins Srikumar the 62nd Annual Meeting of the Association for Computational Linguistics

Bangkok, Thailand

2024 1 Association for Computational Linguistics Language-specific neurons: The key to multilingual capabilities in large language models TTang WLuo HHuang DZhang XWang XZhao FWei J.-RWen 10.18653/v1/2024.acl-long.309 Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics L.-WKu AMartins VSrikumar the 62nd Annual Meeting of the Association for Computational Linguistics

Bangkok, Thailand

2024 1 : Long Papers), Association for Computational Linguistics Data-driven crosslingual syntax: An agreement study with massively multilingual models AGDe Varda MMarelli 10.1162/coli_a_00472 Computational Linguistics 49 2023 Understanding cross-lingual syntactic transfer in multilingual recurrent neural networks PDhar ABisazza Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) SDobnik LØvrelid the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Sweden, Reykjavik, Iceland (Online

Linköping University Electronic Press 2021 Cross-linguistic syntactic evaluation of word prediction models AMueller GNicolai PPetrou-Zeniou NTalmina TLinzen 10.18653/v1/2020.acl-main.490 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 How multilingual is multilingual BERT? TPires ESchlinger DGarrette 10.18653/v1/P19-1493 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics AKorhonen DTraum LMàrquez the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Florence, Italy

2019 Language models are few-shot multilingual learners GIWinata AMadotto ZLin RLiu JYosinski PFung 10.18653/v1/2021.mrl-1.1 Proceedings of the 1st Workshop on Multilingual Representation Learning, Association for Computational Linguistics DAtaman ABirch AConneau OFirat SRuder GGSahin the 1st Workshop on Multilingual Representation Learning, Association for Computational Linguistics

Punta Cana, Dominican Republic

2021 XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation JHu SRuder ASiddhant GNeubig OFirat MJohnson Proceedings of the 37th International Conference on Machine Learning HDIii ASingh the 37th International Conference on Machine Learning

PMLR

2020 119 Proceedings of Machine Learning Research Towards a common understanding of contributing factors for cross-lingual transfer in multilingual language models: A review FPhilippy SGuo SHaddadan 10.18653/v1/2023.acl-long.323 Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ARogers JBoyd-Graber NOkazaki the 61st Annual Meeting of the Association for Computational Linguistics

Toronto, Canada

2023 1 : Long Papers), Association for Computational Linguistics Understanding natural language understanding systems, Sistemi intelligenti ALenci 10.1422/107438 Rivista quadrimestrale di scienze cognitive e di intelligenza artificiale 2023