1. Introduction

Multilingual vs. monolingual transformer models in encoding linguistic structure and lexical abstraction

Vivi Nastase

Giuseppe Samo

Chunyang Jiang

0 1

Paola Merlo

0 1 0 Idiap Research Institute , Martigny , Switzerland 1 University of Geneva , Geneva , Switzerland

2025

Multilingual language models are attractive, as they allow us to cross linguistic boundaries, and solve tasks in diferent languages in the same mathematical space. They come, however, at a cost: in the quest to find a shared space that satisfies (to a certain degree) all languages, the resulting representations lose, or fail to capture, properties specific to each language. We present an investigation into detecting linguistic structure through lexical abstraction. We study both a multilingual and a monolingual language model, and quantify the loss of information between them. I modelli di linguaggio multilingue permettono di oltrepassare i confini linguistici e di risolvere task in lingue diverse mantenendo lo stesso spazio matematico. Tuttavia, questi modelli hanno un costo: nella ricerca di uno spazio condiviso che soddisfi (in una certa misura) tutte le lingue, le rappresentazioni risultanti perdono, o non riescono a catturare, le proprietà specifiche di ciascuna lingua. Usando il fenomeno di astrazione lessicale, presentiamo qui un'indagine su come la struttura linguistica venga individuata: analizziamo sia un modello linguistico multilingue che un modello monolingue, e quantifichiamo la perdita di informazioni tra di essi.

eol>multilingual and monolingual models linguistic abstraction functional words

1. Introduction

to this picture. We investigate how accessible sentence structure is in sentence representations, comparing the Multilingual models are attractive because they project representations obtained from a multilingual encoder all languages represented in the training data into the model to its monolingual counterpart. We conduct this same -dimensional space. This makes it easy to plug exploration on the problem of lexical abstraction, the prothem into tasks in diferent languages. cess of reducing a sentence to its syntactic and semantic

The abilities of multilingual models are being actively "skeleton" by replacing noun and prepositional phrases debated. The first large-scale multilingual models suf- with functional words, as in the example: The authors fered from the curse of multilinguality: “more languages wrote the paper. and They wrote it. We expect that lexileads to better cross-lingual performance on low-resource cal abstraction has occurred if we can detect the same languages up until a point, after which the overall per- syntactic structure in the embeddings of lexicalized and formance on monolingual and cross-lingual benchmarks functional versions of pairs of sentences. This setup veridegrade" [1, p. 1], which could be remedied by increasing ifes whether the multilingual model or the monolingual the capacity of the models [ 1 ], or by training bilingual models perform better. The former results would indimodels for low-resource languages, where each such lan- cate that training on several languages is beneficial to guage is paired with a linguistically-related language [ 2 ]. discovering shared structures. The latter result, instead, Forcing many languages to share the parameter space, would indicate that sentence structure is encoded in a may lead to the emergence of language universal rep- more language-specific manner, and is encoded better resentations in pretrained encoder models [ 3 ], possibly by a monolingual model, as the model does not need to even grammatical structure [ 4, 5 ]. However, these mod- reconcile the diferent ways the same type of grammatiels do not encode structure in a language-independent, cal information is expressed in diferent languages (e.g. abstract, way, but rather encode language-specific token- number, case, gender, definiteness). level clues [ 6 ]. To further explore multilingual models, we also perThe work presented in this paper adds more detail form experiments with generative LLMs, as they have been shown to favour English as an "internal" language CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- [7, 8]. Here, we test whether a multilingual LLM detects *tiCcso,rSreepstpeomnbdeirng24a—uth2o6,r.2025, Cagliari, Italy (and generates) sentence structure better in English sen$ vivi.a.nastase@gmail.com (V. Nastase); giuseppe.samo@idiap.ch tences than Italian ones, by prompting the model with (G. Samo); chunyang.jiang42@gmail.com (C. Jiang); English, and separately with Italian sentences, asking it Paola.Merlo@unige.ch (P. Merlo) to produce the Italian functional form. © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License

Attribution 4.0 International (CC BY 4.0).

2. Data

To investigate how accessible sentence structure is in representations built by large language models, we use the Italian portion of a dataset that models the verb alternations change-of-state (CoS) and object drop (OD) [9]. The CoS verb class can undergo the transitive/in- COS answers OD answers transitive causative alternation, where the object of the PAagteinentt SSII AAccttiivvee bbyy--NNPP IC-Ionrtrect PAagteinentt AAccttiivvee bbyy--NNPP IC-Ionrtrect transitive verb bears the same semantic role (Patient) Patient Passive by-Agent ER-Pass Patient Passive by-Agent IER-Pass as the subject of the intransitive verb (The tourist broke PAagteinentt PAacstisvivee bAyg-ePnattient IRE-RTr-Paansss PAagteinentt PAacstisvivee bAyg-ePnattient IE-RTr-Paansss the vase/The vase broke). The transitive form of the verb Agent Active Patient IR-Trans Agent Active Patient R-Trans has a causative meaning. In contrast, for OD verbs the PAagteinentt AAccttiivvee bbyy--PAagteinentt IEE--WWrrBByy PAagteinentt AAccttiivvee bbyy--PAagteinentt IEE--WWrrBByy subject bears the same semantic role (Agent) in both the Patient Active by-NP NoSi Patient SI Active by-NP I-SI transitive and intransitive forms and the verb does not Agent Active by-NP I-NoSI Agent SI Active by-NP SI have a causative meaning (The artist was paiting this fres- Figure 1: Context and answer sentence structures for changeco/ The artist was painting) [10, 11]. Italian shows the of-state (CoS) verbs (left), and object drop (OD) verbs (right). same asymmetry but marks the intransitive alternant for CoS with a reflexive-like element SI ( Il turista ruppe il vaso/Il vaso si ruppe; L’artista stava dipingendo questo Figure 1). From these, we sample 6000 sentences, uniafresco/l’artista stava dipingendo ). formly distributed over the eight syntactic-semantic pat

These verb classes constitute an ideal test-bed for our terns. These are split into 4800:1200 training and test research question, because their combination of syntactic instances and 20% of the training data is used for validaand semantic structure allows us not only to test whether tion (train:dev:test – 3840:960:1200). sentences with diferent syntactic structures can be distinguished, but also whether sentences with the same BLM data Of the thirty verbs for each class, change of syntactic structure but difering in the semantic roles can state and object drop, three are selected for testing and the be distinguished. other 27 for training. All instances for the three testing

The data, described in detail in [12], consists of in- verbs are used. Two-thousand instances of the other 27 stances of a Blackbird Language Matrices (BLM), a lin- verbs are randomly sampled for training. Ten percent of guistic puzzle [13]. Each instance consists of an input the training data is dynamically selected for validation. context of seven sentences that illustrate several varia- The same 27:3 verb split is used for all FUN/LEX and tions of CoS/OD verbs, and an answer set that contains a type I/type II/type III variations. All variations have 2000 correct answer, and nine wrong answer candidates, each instances for training, 300 for testing. In the experiments of which is erroneous in specific ways. Figure 1 shows reported here we use a variation where the CoS and the syntactic-semantic structure of the sentences in a OD subtasks are merged. The data is split in a similar BLM instance. Lexicalized and functional instances are manner for training and testing (and using the same verbs shown in tables 4 and 5 in the appendix. for training and testing as in the split of the individual

Each BLM instance has a lexicalized (LEX) and a func- subsets). tional (FUN) form. In addition, there are three variations – type I, type II, type III – with increasing levels of lexical variation. The dataset is built based on thirty (manually 3. Experiments chosen) verbs from each of the two classes discussed in Levin [10]. The functional lexicon has been manually We aim to quantify to what degree multilingual and selected by the authors to maintain the syntactic and monolingual language models encode syntactic structure semantic acceptability of the sentences. by using the lexical abstraction property of pronouns and

We build two variations starting from this dataset that adverbs relative to nouns and noun phrases. We explore allow us to test, from several angles, whether sentence encoder models, and test whether the same syntactic structure is encoded in a sentence embedding in an ab- structure and semantic role information is encoded in stract manner. the embeddings of lexicalized sentences and their functional versions. With generative LLMs, we compare the performance of a model in generating the functional version of an input sentence, when this input is either in English or Italian, and the output is constrained to be Sentences We compile parallel versions of the sentences in their lexicalized and functional word forms from the FUN and LEX subsets of the type I BLM dataset. Each sentence has associated its syntactic pattern (the syntac- Italian. tic version of the syntactic-semantic template shown in

3.1. Sentence structure in encoder models

We perform two analyses to test whether the representation of functional and lexicalized sentences encode the same grammatical structure, in the same way: (i) we analyze individual sentences and test to what degree their Table 1 grammatical structure (phrases and their semantic roles) F1 scores (averages over three runs) on predicting the sentence can be detected (Section 3.1.1); (ii) we deploy the BLM lin- with the same structure as the input, through a variational guistic puzzles, whose solution relies on detecting shared encoder-decoder system, for sentences encoded with (multistructure at the level of input sequence and within each lingual) Electra (e) or (monolingual) Electra-It (e-It). sentence (Section 3.1.2).

We obtain word and sentence representations (as averaged token embeddings) from an Electra pretrained whether the same patterns in lexicalized and functional model [14]1. We choose Electra because it has been forms are detected as being the same, and, thus, mapped shown to perform better than models from the BERT onto the same representation on the latent layer. We esfamily on the Holmes benchmark2, and to also encode in- timate similarity of representations by visualising them formation about syntactic and argument structure better on the latent layer. Sentence embeddings from Electra [15, 16]. We use the Italian Electra3 as our monolingual have size 768, and the latent layer in the used system has model. size five.

Table 1 shows the averaged F1 scores over three ex3.1.1. Grammatical structure in sentence periments. We note first that training and testing on embeddings the same type (FUN or LEX) leads to high results, thus validating the experimental set-up.

Syntactic structure and semantic roles represent com- The results on test data of the same type as the training plex information, which may be encoded by weighted are very diferent from those on the test of the other type. combinations of subsets of dimensions [17, 18]. This indicates that for each of the FUN and LEX data

We mine the sentence repesentations for this infor- variations, the system discovers diferent clues to match mation following the approach described in Nastase and two sentences with the same structure. The high results Merlo [16]. Using a variational encoder-decoder, an in- when training on the sentences with functional words put sentence is compressed into a representation that may also indicate overfitting because of the repetitive vocaptures syntactic and semantic role information, by im- cabulary. We note that, consistently, the results obtained posing that the system reconstructs a sentence with the when using a monolingual model are higher than those same syntactic and semantic information. An instance when using the multilingual one, despite the assumption consists of an input sentence with structure , and that a multilingual model must learn more abstract repa set of candidate outputs, with a sentence ̸= that resentations to satisfy the constraints of modeling many has the same structure ( = ), and N negative ex- languages. amples that have diferent structures ( ̸= ). In Additional information comes from the analysis of the our experiments we use N = 7. The structure information compressed representations on the latent layer, which is used to build the dataset and obtain a deeper evaluation are expected to capture the sentence structure that is of the results, but is not provided to the system. shared by the functional and lexicalized data. We show

Using the sentences datasets described in section 2, the projection on the latent layer of the sentence reprewe built datasets consisting of a mix of FUN and LEX sentations in Figure 2, when sentence representations instances (an instace will only contain either FUN or LEX are obtained from Electra (left) and Electra-It (right). We sentences), and use the above-mentioned set-up to test: note that these latent projections cluster by the syntactic (i) how well a system reconstructs a sentence with the structure and semantic roles of the sentences, and that desired syntactic and semantic information, measured at using Electra-It representations leads to a tighter mix of the output through F1 score4, and (ii) how well the sys- lexicalized and functional sentences that have the same tem identifies the diferent patterns. Specifically, we ask syntactic structure. This adds depth to the results in Table 1 – showing that when trained on a mix of functionalized and lexicalized instances, the system is able to discover a shared space of clues about the grammatical structure – and also shows that in the representations obtained from Electra-It there are stronger shared clues about grammatical structure in both functionalized and lexicalized sentences compared to the multilingual Electra model. 1google/electra-base-discriminator 2The HOLMES benchmark leaderboard: https://holmes-leaderboard. streamlit.app/. At the time of writing, the ranks were: Electra - 16, DeBERTa - 21, BERT - 41, RoBERTa - 45. 3dbmdz/electra-base-italian-xxl-cased-discriminator 4When processing each instance, the system chooses among 8 options, of which one is correct. The F1 score of the "positive" class provides the most balanced measure of performance. 3.1.2. Task solving It might be objected that the previous experiments and visualisations do not conclusively show that latent representations encode structure, as opposed to just distinguishing seven distinct but amorphous classes. We use the BLM data to provide additional support to the conclusion that structure is represented. The BLM task frames a linguistic phenomenon as a linguistic puzzle. Solving this puzzle relies on detecting the linguistic objects, their Figure 3: Comparison between the multilingual (left) and relevant properties, and the structure both within each monolingual (right) electra models for solving the BLM task: sentence, and across the input sequence. average F1 over three runs. x-axis shows the traininng data:

Our BLM dataset has several levels of complexity: (i) a trraateinl yinognoFnUFNUNanadnLdELXEX instances jointly vs. training sepamixture of change-of-state and object-drop verbs, which exhibit diferent semantic frames for the intransitive answers (patient vs agent subjects), and share other frames (see Figure 1); (ii) lexicalized and functional instances; objective is to maximize the score of the correct answer (iii) maximal level of lexical variation in each instance. from the candidate answer set, and minimize that of the This set-up will allow us to test whether syntactic struc- incorrect ones. During testing, the system constructs the ture and semantic roles are encoded similarly in the rep- representation of an answer, then chooses the closest one resentation of lexicalized and functional sentences by from the given options. All potential answers consist of monolingual and multilingual encoder models. a verb frame filled with phrases that play specific roles

We use the system described by Nastase and Merlo [16], (Section 2). The correct one consists of the combination that solves the BLM problem in two steps: compresses the of phrases whose roles fit together for the given verb, sentence into a representation that encodes the structure while the other contain similar pieces, but which violate relevant to the BLM puzzle – linguistic objects and their some semantic, syntactic (or both) rules. This set-up alsyntactic and semantic role properties –, and uses these lows us to test whether specific elements in the sentences compressed representations to solve the multiple-choice from the input sequence, and their semantic roles have puzzle. The system’s two steps are encoded through in- been detected and used properly in building the correct terconnected variational encoder-decoders, as illustrated answer. in Figure 4, which are trained together. The learning Figure 3 shows the F1 results (as averages over three

3.2. Generating functional variations of sentences

Multilingual generative models are not exposed to the same amounts of training data across languages and probably for that reason they do not appear to treat every language in their training data equally. In fact, evidence has shown that English serves as a latent language for The prompt with Italian input sentence, and requestgenerative models (LlaMa 2). Tracking an input in lan- ing an Italian functional version is shown below. guages other than English through the intermediate layers of the transformer, it has been shown that from the input the representations drift more and more towards English, with a switch towards the input language’s representation only at the last layers [7, 8]. We test whether this implies that the structure of an English sentence is 2. Input: "the local languages are studied by some linguists" Output: ...

4. Discussion

Replace noun phrases with pronouns and prepositional phrases with adverbs. Preserve the exact syntactic structure, word order, and verb forms.

We aimed to explore the impact of encoding together

multiple language, with English dominating the training data, for encoder and decoder language models.

Examples: The comparison of detecting syntactic-semantic structure using a multilingual and a monolingual encoder Input: "i suoi giocattoli erano intagliati dai suoi model has shown that the monolingual Italian model genitori nella baita" → Output: "questi erano intagliati encodes both structural and linguistic abstraction inforda loro là" mation in a cleaner and more accessible way compared to a multilingual model, contrary to previous hypotheses Now convert these: about multilingual training leading to the encoding of more abstract linguistic structures. We have shown this 1. Input: "quella canzone era canticchiata dai miei efect through an exploration of individual sentences, as amici da qualche settimana" well as when the sentence structure was required to solve Output:

a more complex linguistic puzzle. Adding the lexical ab2. Input: "le lingue del luogo sono studiate da straction level to the structure exploration allows us to alcuni linguisti" reach the shared structures of lexicalized and functional Output: sentence variations. ... Using a decoder transformer model, we have explored sentence structure encoding through the generative lens: 3.2.2. Evaluation how well does a system recognize and preserve the syntactic and semantic structure of an input sentence. BeTo evaluate the outputs, we use three complementary cause it has been shown that English functions as a latent measures: (i) perfect match (ident) the percentage of in- language, it would be expected that the structure of an stances for which the system generation matches the English sentence is more readily detected and preserved. gold standard (ii) structure match (struct), for each output We found that that is not the case, and mapping a lexicalwe compute an F1 score that quantifies how well the sys- ized Italian input sentence into its functional form leads tem has predicted the structure 5 and (iii) pronoun/adverb to better results, both in terms of preserving the strucratio (pron), where we compute the ratio of pronouns and ture, and in the generation of pronominal and adverbial adverbs in the system output and the pronouns and ad- replacements for noun and prepositional phrases. verbs in the gold standard. All these measures are rough approximations, and overestimate the performance, but in a consistent manner. Table 2 shows these measures 5. Related work for the four experimental set-ups.

Similarly to the experiments on the monolingual and multilingual encoder models, the experiments on the generative LLM has shown that forcing multiple languages to share the parameter space leads to the loss of syntactic, semantic and lexical language-specific information. The 5We obtain dependency relations for the system output and the gold standard using spaCy (https://spacy.io/v.3.8.7), and computed the F1 based on the true positive count (how many relations overlap), false positive (how many additional relations the system answer has relative to the gold standard) and false negatives (how many dependencies the gold standard has that do not appear in the system output).

Multilingual models project many languages in the same

parameter space. This brings some clear advantages: the model can be moved easily between diferent language applications, and it allows for low-resource languages to be bootstrapped by their connections to other languages. It has been surmised that forcing multiple languages to share the same parameter space will lead to the emergence of linguistic universals. It has been shown that that LLMs generalize across languages through implicitly learned vector alignment, which is less robust for generative models [20]. Some work using cross-lingual structural priming finds evidence that grammatical representations are abstract and shared in multilingual language models [ 5 ] . Further exploration has found, however, that this efect depends on the similarity between the included languages [21]. It has also been shown that models encode grammatical information, such as chunks and structure, in a language-specific manner [ 6 ]. Overall, it is dificult to draw a conclusion on the performance of multilingual models, because it can be overestimated due to skewed language selection [22].

There are also downsides to building a multilingual model, as language particularities may be lost in the shared space, particularly when there is a dominant language. This may lead to language confusion in generation [23], and a decrease in the faithfulness of the multilingual models compared to monolingual ones, assessed in terms of feature attribution [24]. An asymmetrical efect of recall in monolingual and multilingual models depending on the syntactic role (subject vs. object) has also been found [25]. Finally, the language of the prompt afects a multilingual model’s performance on binary questions about sentence grammaticality [26].

6. Conclusions

The current work aimed to explore the costs or advantages of multilingual and monolingual models, in a linguistic problem that involves a form of abstraction in language models. In particular, we focused on the issue of lexical abstraction through functional words – pronouns and adverbs standing in for noun and prepositional phrases. Lexicalized and functional versions of the same sentence share syntactic structure and semantic roles, information which should be encoded by language models. We tested whether this information is identifiable and whether lexicalized and functional parallel sentences encode this information in a similar manner. We explored multilingual models, testing the assumption that forcing many languages to share the same parameter space leads to a more abstract encoding of information. We found that this assumption does not hold in either encoder or decoder models.

Acknowledgments We gratefully acknowledge the support of this work by the Swiss National Science Foundation, through grant SNF Advanced grant TMAG-1_209426 to PM.

[7] C. Wendler, V. Veselovsky, G. Monea, R. West, boxNLP Workshop on Analyzing and Interpreting Do llamas work in English? on the latent lan- Neural Networks for NLP, Association for Comguage of multilingual transformers, in: L.-W. putational Linguistics, Abu Dhabi, United Arab Ku, A. Martins, V. Srikumar (Eds.), Proceedings Emirates (Hybrid), 2022, pp. 142–152. URL: https: of the 62nd Annual Meeting of the Association //aclanthology.org/2022.blackboxnlp-1.12. for Computational Linguistics (Volume 1: Long Pa- [16] V. Nastase, P. Merlo, Are there identifiable strucpers), Association for Computational Linguistics, tural parts in the sentence embedding whole?, Bangkok, Thailand, 2024, pp. 15366–15394. URL: in: Y. Belinkov, N. Kim, J. Jumelet, H. Mohttps://aclanthology.org/2024.acl-long.820. doi:10. hebbi, A. Mueller, H. Chen (Eds.), Proceedings 18653/v1/2024.acl-long.820. of the 7th BlackboxNLP Workshop: Analyz[8] I. Papadimitriou, K. Lopez, D. Jurafsky, Multilingual ing and Interpreting Neural Networks for NLP, BERT has an accent: Evaluating English influences Association for Computational Linguistics, Mion fluency in multilingual models, in: A. Vlachos, ami, Florida, US, 2024, pp. 23–42. URL: https: I. Augenstein (Eds.), Findings of the Association for //aclanthology.org/2024.blackboxnlp-1.3/. doi:10. Computational Linguistics: EACL 2023, Association 18653/v1/2024.blackboxnlp-1.3. for Computational Linguistics, Dubrovnik, Croatia, [17] Y. Bengio, A. Courville, P. Vincent, Representation 2023, pp. 1194–1200. URL: https://aclanthology.org/ learning: A review and new perspectives, IEEE 2023.findings-eacl.89/. doi: 10.18653/v1/2023. Transactions on Pattern Analysis and Machine Infindings-eacl.89. telligence 35 (2013) 1798–1828. [9] G. Samo, A structured synthetic dataset of En- [18] N. Elhage, T. Hume, C. Olsson, N. Schiefer, glish and Italian verb alternations for testing lex- T. Henighan, S. Kravec, Z. Hatfield-Dodds, ical abstraction via functional lexicon in LLMs, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCan2025. URL: https://ling.auf.net/lingbuzz/009085. dlish, J. Kaplan, D. Amodei, M. Wattenberg, C. Olah, arXiv:lingbuzz/009085, preprint available at Toy models of superposition, 2022. URL: https: lingbuzz/009085. //arxiv.org/abs/2209.10652. arXiv:2209.10652. [10] B. Levin, English verb classes and alternations: A [19] A. Jones, W. Y. Wang, K. Mahowald, A massively preliminary investigation, University of Chicago multilingual analysis of cross-linguality in shared Press, 1993. embedding space, in: M.-F. Moens, X. Huang, [11] P. Merlo, S. Stevenson, Automatic verb classifi- L. Specia, S. W.-t. Yih (Eds.), Proceedings of the cation based on statistical distributions of argu- 2021 Conference on Empirical Methods in Natument structure, Computational Linguistics 27 (2001) ral Language Processing, Association for Compu373–408. URL: https://aclanthology.org/J01-3003/. tational Linguistics, Online and Punta Cana, Dodoi:10.1162/089120101317066122. minican Republic, 2021, pp. 5833–5847. URL: https: [12] G. Samo, A structured synthetic dataset of English //aclanthology.org/2021.emnlp-main.471/. doi:10. and Italian verb alternations for testing lexical ab- 18653/v1/2021.emnlp-main.471. straction via functional lexicon in LLMs, 2025. URL: [20] Q. Peng, A. Søgaard, Concept space alignment in https://ling.auf.net/lingbuzz/009085, preprint avail- multilingual LLMs, in: Y. Al-Onaizan, M. Bansal, Y.able at lingbuzz/009085. N. Chen (Eds.), Proceedings of the 2024 Conference [13] P. Merlo, Blackbird language matrices (BLM), on Empirical Methods in Natural Language Processa new task for rule-like generalization in neural ing, Association for Computational Linguistics, Minetworks: Motivations and formal specifications, ami, Florida, USA, 2024, pp. 5511–5526. URL: https: ArXiv cs.CL 2306.11444 (2023). URL: https://doi.org/ //aclanthology.org/2024.emnlp-main.315/. doi:10. 10.48550/arXiv.2306.11444. doi:10.48550/arXiv. 18653/v1/2024.emnlp-main.315. 2306.11444. [21] C. Arnett, T. A. Chang, J. A. Michaelov, B. Bergen, [14] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, On the acquisition of shared grammatical represenELECTRA: Pre-training text encoders as discrimi- tations in bilingual language models, in: W. Che, nators rather than generators, in: ICLR, 2020. URL: J. Nabende, E. Shutova, M. T. Pilehvar (Eds.), Prohttps://openreview.net/pdf?id=r1xMH1BtvB. ceedings of the 63rd Annual Meeting of the Asso[15] D. Yi, J. Bruno, J. Han, P. Zukerman, S. Steinert- ciation for Computational Linguistics (Volume 1: Threlkeld, Probing for understanding of English Long Papers), Association for Computational Linverb classes and alternations in large pre-trained guistics, Vienna, Austria, 2025, pp. 20707–20726. language models, in: Proceedings of the Fifth Black- URL: https://aclanthology.org/2025.acl-long.1010/. doi:10.18653/v1/2025.acl-long.1010. [22] E. Ploeger, W. Poelman, M. de Lhoneux, J. Bjerva, man Language Technologies (Volume 1: Long What is “typological diversity” in NLP?, in: Y. Al- Papers), Association for Computational LinguisOnaizan, M. Bansal, Y.-N. Chen (Eds.), Proceed- tics, Mexico City, Mexico, 2024, pp. 3226–3244. ings of the 2024 Conference on Empirical Methods URL: https://aclanthology.org/2024.naacl-long.178/. in Natural Language Processing, Association for doi:10.18653/v1/2024.naacl-long.178. Computational Linguistics, Miami, Florida, USA, [25] C. Fierro, N. Foroutan, D. Elliott, A. Søgaard, 2024, pp. 5681–5700. URL: https://aclanthology.org/ How do multilingual language models remem2024.emnlp-main.326/. doi:10.18653/v1/2024. ber facts?, in: W. Che, J. Nabende, E. Shutova, emnlp-main.326. M. T. Pilehvar (Eds.), Findings of the Associa[23] K. Marchisio, W.-Y. Ko, A. Berard, T. Dehaze, tion for Computational Linguistics: ACL 2025, S. Ruder, Understanding and mitigating language Association for Computational Linguistics, Viconfusion in LLMs, in: Y. Al-Onaizan, M. Bansal, Y.- enna, Austria, 2025, pp. 16052–16106. URL: https: N. Chen (Eds.), Proceedings of the 2024 Conference //aclanthology.org/2025.findings-acl.827/. doi: 10. on Empirical Methods in Natural Language Process- 18653/v1/2025.findings-acl.827. ing, Association for Computational Linguistics, Mi- [26] S. Behzad, A. Zeldes, N. Schneider, To ask LLMs ami, Florida, USA, 2024, pp. 6653–6677. URL: https: about English grammaticality, prompt them in a dif//aclanthology.org/2024.emnlp-main.380/. doi:10. ferent language, in: Y. Al-Onaizan, M. Bansal, Y.-N. 18653/v1/2024.emnlp-main.380. Chen (Eds.), Findings of the Association for Compu[24] Z. Zhao, N. Aletras, Comparing explanation faith- tational Linguistics: EMNLP 2024, Association for fulness between multilingual and monolingual fine- Computational Linguistics, Miami, Florida, USA, tuned language models, in: K. Duh, H. Gomez, 2024, pp. 15622–15634. URL: https://aclanthology. S. Bethard (Eds.), Proceedings of the 2024 Con- org/2024.findings-emnlp.916/. doi: 10.18653/v1/ ference of the North American Chapter of the 2024.findings-emnlp.916.

Association for Computational Linguistics: Hu

A. Blackbird Language Matrices data

Verb split between train and test for the COS and OD subsets. For the sentence representation analysis, the data respects the same split.

A.1. Data split A.2. BLM task instances for change-of-state verbs A.3. BLM task instances for object-drop verbs

Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1]

Conneau ,

Khandelwal ,

Goyal ,

Chaudhary ,

Wenzek ,

Guzmán , E. Grave,

Ott ,

Zettlemoyer ,

Stoyanov , Unsupervised crosslingual representation learning at scale , in: D. Jurafsky , J.

Chai , N.

Schluter , J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 8440 - 8451 . URL: https://aclanthology.org/ 2020 .acl-main. 747 /. doi: 10 .18653/v1/ 2020 .acl-main. 747 .

[2]

Wu ,

Dredze , Are all languages created equal in multilingual BERT? , in: S. Gella,

Welbl ,

Rei ,

Petroni ,

Lewis ,

Strubell ,

Seo , H. Hajishirzi (Eds.), Proceedings of the 5th Workshop on Representation Learning for NLP, Association for Computational Linguistics , Online, 2020 , pp. 120 - 130 . URL: https://aclanthology.org/ 2020 .repl4nlp- 1 .16/. doi: 10 .18653/v1/ 2020 .repl4nlp- 1 . 16 .

[3]

Conneau ,

Wu ,

Li ,

Zettlemoyer ,

Stoyanov , Emerging cross-lingual structure in pretrained language models , in: D. Jurafsky , J.

Chai , N.

Schluter , J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 6022 - 6034 . URL: https://aclanthology.org/ 2020 .acl-main. 536 /. doi: 10 .18653/v1/ 2020 .acl-main. 536 .

[4]

Sinclair ,

Jumelet ,

Zuidema ,

Fernández , Structural persistence in language models: Priming as a window into abstract language representations , Transactions of the Association for Computational Linguistics 10 ( 2022 ) 1031 - 1050 . URL: https: //aclanthology.org/ 2022 .tacl- 1 .60/. doi: 10 .1162/ tacl_a_ 00504 .

[5]

Michaelov ,

Arnett ,

Chang ,

Bergen , Structural priming demonstrates abstract grammatical representations in multilingual language models , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 3703 - 3720 . URL: https://aclanthology.org/ 2023 .emnlp-main. 227 /. doi: 10 .18653/v1/ 2023 . emnlp-main. 227 .

[6]

Nastase , G. Samo,

Jiang ,

Merlo , Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement , in: F. Dell'Orletta , A.

Lenci , S.

Montemagni , R. Sprugnoli (Eds.), Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), CEUR Workshop Proceedings, Pisa, Italy, 2024 , pp. 631 - 643 . URL: https://aclanthology.org/ 2024 . clicit- 1 .71/.