<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual vs. monolingual transformer models in encoding linguistic structure and lexical abstraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vivi Nastase</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Samo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chunyang Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paola Merlo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Idiap Research Institute</institution>
          ,
          <addr-line>Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Geneva</institution>
          ,
          <addr-line>Geneva</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Multilingual language models are attractive, as they allow us to cross linguistic boundaries, and solve tasks in diferent languages in the same mathematical space. They come, however, at a cost: in the quest to find a shared space that satisfies (to a certain degree) all languages, the resulting representations lose, or fail to capture, properties specific to each language. We present an investigation into detecting linguistic structure through lexical abstraction. We study both a multilingual and a monolingual language model, and quantify the loss of information between them. I modelli di linguaggio multilingue permettono di oltrepassare i confini linguistici e di risolvere task in lingue diverse mantenendo lo stesso spazio matematico. Tuttavia, questi modelli hanno un costo: nella ricerca di uno spazio condiviso che soddisfi (in una certa misura) tutte le lingue, le rappresentazioni risultanti perdono, o non riescono a catturare, le proprietà specifiche di ciascuna lingua. Usando il fenomeno di astrazione lessicale, presentiamo qui un'indagine su come la struttura linguistica venga individuata: analizziamo sia un modello linguistico multilingue che un modello monolingue, e quantifichiamo la perdita di informazioni tra di essi.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multilingual and monolingual models</kwd>
        <kwd>linguistic abstraction</kwd>
        <kwd>functional words</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>to this picture. We investigate how accessible sentence
structure is in sentence representations, comparing the
Multilingual models are attractive because they project representations obtained from a multilingual encoder
all languages represented in the training data into the model to its monolingual counterpart. We conduct this
same -dimensional space. This makes it easy to plug exploration on the problem of lexical abstraction, the
prothem into tasks in diferent languages. cess of reducing a sentence to its syntactic and semantic</p>
      <p>
        The abilities of multilingual models are being actively "skeleton" by replacing noun and prepositional phrases
debated. The first large-scale multilingual models suf- with functional words, as in the example: The authors
fered from the curse of multilinguality: “more languages wrote the paper. and They wrote it. We expect that
lexileads to better cross-lingual performance on low-resource cal abstraction has occurred if we can detect the same
languages up until a point, after which the overall per- syntactic structure in the embeddings of lexicalized and
formance on monolingual and cross-lingual benchmarks functional versions of pairs of sentences. This setup
veridegrade" [1, p. 1], which could be remedied by increasing ifes whether the multilingual model or the monolingual
the capacity of the models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], or by training bilingual models perform better. The former results would
indimodels for low-resource languages, where each such lan- cate that training on several languages is beneficial to
guage is paired with a linguistically-related language [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. discovering shared structures. The latter result, instead,
Forcing many languages to share the parameter space, would indicate that sentence structure is encoded in a
may lead to the emergence of language universal rep- more language-specific manner, and is encoded better
resentations in pretrained encoder models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], possibly by a monolingual model, as the model does not need to
even grammatical structure [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. However, these mod- reconcile the diferent ways the same type of
grammatiels do not encode structure in a language-independent, cal information is expressed in diferent languages (e.g.
abstract, way, but rather encode language-specific token- number, case, gender, definiteness).
level clues [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. To further explore multilingual models, we also
perThe work presented in this paper adds more detail form experiments with generative LLMs, as they have
been shown to favour English as an "internal" language
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- [7, 8]. Here, we test whether a multilingual LLM detects
*tiCcso,rSreepstpeomnbdeirng24a—uth2o6,r.2025, Cagliari, Italy (and generates) sentence structure better in English
sen$ vivi.a.nastase@gmail.com (V. Nastase); giuseppe.samo@idiap.ch tences than Italian ones, by prompting the model with
(G. Samo); chunyang.jiang42@gmail.com (C. Jiang); English, and separately with Italian sentences, asking it
Paola.Merlo@unige.ch (P. Merlo) to produce the Italian functional form.
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
      </p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>To investigate how accessible sentence structure is in
representations built by large language models, we use
the Italian portion of a dataset that models the verb
alternations change-of-state (CoS) and object drop (OD)
[9]. The CoS verb class can undergo the transitive/in- COS answers OD answers
transitive causative alternation, where the object of the PAagteinentt SSII AAccttiivvee bbyy--NNPP IC-Ionrtrect PAagteinentt AAccttiivvee bbyy--NNPP IC-Ionrtrect
transitive verb bears the same semantic role (Patient) Patient Passive by-Agent ER-Pass Patient Passive by-Agent IER-Pass
as the subject of the intransitive verb (The tourist broke PAagteinentt PAacstisvivee bAyg-ePnattient IRE-RTr-Paansss PAagteinentt PAacstisvivee bAyg-ePnattient IE-RTr-Paansss
the vase/The vase broke). The transitive form of the verb Agent Active Patient IR-Trans Agent Active Patient R-Trans
has a causative meaning. In contrast, for OD verbs the PAagteinentt AAccttiivvee bbyy--PAagteinentt IEE--WWrrBByy PAagteinentt AAccttiivvee bbyy--PAagteinentt IEE--WWrrBByy
subject bears the same semantic role (Agent) in both the Patient Active by-NP NoSi Patient SI Active by-NP I-SI
transitive and intransitive forms and the verb does not Agent Active by-NP I-NoSI Agent SI Active by-NP SI
have a causative meaning (The artist was paiting this fres- Figure 1: Context and answer sentence structures for
changeco/ The artist was painting) [10, 11]. Italian shows the of-state (CoS) verbs (left), and object drop (OD) verbs (right).
same asymmetry but marks the intransitive alternant
for CoS with a reflexive-like element SI ( Il turista ruppe
il vaso/Il vaso si ruppe; L’artista stava dipingendo questo Figure 1). From these, we sample 6000 sentences,
uniafresco/l’artista stava dipingendo ). formly distributed over the eight syntactic-semantic
pat</p>
      <p>These verb classes constitute an ideal test-bed for our terns. These are split into 4800:1200 training and test
research question, because their combination of syntactic instances and 20% of the training data is used for
validaand semantic structure allows us not only to test whether tion (train:dev:test – 3840:960:1200).
sentences with diferent syntactic structures can be
distinguished, but also whether sentences with the same BLM data Of the thirty verbs for each class, change of
syntactic structure but difering in the semantic roles can state and object drop, three are selected for testing and the
be distinguished. other 27 for training. All instances for the three testing</p>
      <p>The data, described in detail in [12], consists of in- verbs are used. Two-thousand instances of the other 27
stances of a Blackbird Language Matrices (BLM), a lin- verbs are randomly sampled for training. Ten percent of
guistic puzzle [13]. Each instance consists of an input the training data is dynamically selected for validation.
context of seven sentences that illustrate several varia- The same 27:3 verb split is used for all FUN/LEX and
tions of CoS/OD verbs, and an answer set that contains a type I/type II/type III variations. All variations have 2000
correct answer, and nine wrong answer candidates, each instances for training, 300 for testing. In the experiments
of which is erroneous in specific ways. Figure 1 shows reported here we use a variation where the CoS and
the syntactic-semantic structure of the sentences in a OD subtasks are merged. The data is split in a similar
BLM instance. Lexicalized and functional instances are manner for training and testing (and using the same verbs
shown in tables 4 and 5 in the appendix. for training and testing as in the split of the individual</p>
      <p>Each BLM instance has a lexicalized (LEX) and a func- subsets).
tional (FUN) form. In addition, there are three variations
– type I, type II, type III – with increasing levels of lexical
variation. The dataset is built based on thirty (manually 3. Experiments
chosen) verbs from each of the two classes discussed in
Levin [10]. The functional lexicon has been manually We aim to quantify to what degree multilingual and
selected by the authors to maintain the syntactic and monolingual language models encode syntactic structure
semantic acceptability of the sentences. by using the lexical abstraction property of pronouns and</p>
      <p>We build two variations starting from this dataset that adverbs relative to nouns and noun phrases. We explore
allow us to test, from several angles, whether sentence encoder models, and test whether the same syntactic
structure is encoded in a sentence embedding in an ab- structure and semantic role information is encoded in
stract manner. the embeddings of lexicalized sentences and their
functional versions. With generative LLMs, we compare the
performance of a model in generating the functional
version of an input sentence, when this input is either in
English or Italian, and the output is constrained to be
Sentences We compile parallel versions of the
sentences in their lexicalized and functional word forms from
the FUN and LEX subsets of the type I BLM dataset. Each
sentence has associated its syntactic pattern (the syntac- Italian.
tic version of the syntactic-semantic template shown in</p>
      <sec id="sec-2-1">
        <title>3.1. Sentence structure in encoder models</title>
        <p>We perform two analyses to test whether the
representation of functional and lexicalized sentences encode the
same grammatical structure, in the same way: (i) we
analyze individual sentences and test to what degree their Table 1
grammatical structure (phrases and their semantic roles) F1 scores (averages over three runs) on predicting the sentence
can be detected (Section 3.1.1); (ii) we deploy the BLM lin- with the same structure as the input, through a variational
guistic puzzles, whose solution relies on detecting shared encoder-decoder system, for sentences encoded with
(multistructure at the level of input sequence and within each lingual) Electra (e) or (monolingual) Electra-It (e-It).
sentence (Section 3.1.2).</p>
        <p>We obtain word and sentence representations (as
averaged token embeddings) from an Electra pretrained whether the same patterns in lexicalized and functional
model [14]1. We choose Electra because it has been forms are detected as being the same, and, thus, mapped
shown to perform better than models from the BERT onto the same representation on the latent layer. We
esfamily on the Holmes benchmark2, and to also encode in- timate similarity of representations by visualising them
formation about syntactic and argument structure better on the latent layer. Sentence embeddings from Electra
[15, 16]. We use the Italian Electra3 as our monolingual have size 768, and the latent layer in the used system has
model. size five.</p>
        <p>Table 1 shows the averaged F1 scores over three
ex3.1.1. Grammatical structure in sentence periments. We note first that training and testing on
embeddings the same type (FUN or LEX) leads to high results, thus
validating the experimental set-up.</p>
        <p>Syntactic structure and semantic roles represent com- The results on test data of the same type as the training
plex information, which may be encoded by weighted are very diferent from those on the test of the other type.
combinations of subsets of dimensions [17, 18]. This indicates that for each of the FUN and LEX data</p>
        <p>We mine the sentence repesentations for this infor- variations, the system discovers diferent clues to match
mation following the approach described in Nastase and two sentences with the same structure. The high results
Merlo [16]. Using a variational encoder-decoder, an in- when training on the sentences with functional words
put sentence is compressed into a representation that may also indicate overfitting because of the repetitive
vocaptures syntactic and semantic role information, by im- cabulary. We note that, consistently, the results obtained
posing that the system reconstructs a sentence with the when using a monolingual model are higher than those
same syntactic and semantic information. An instance when using the multilingual one, despite the assumption
consists of an input sentence  with structure , and that a multilingual model must learn more abstract
repa set of candidate outputs, with a sentence  ̸=  that resentations to satisfy the constraints of modeling many
has the same structure ( = ), and N negative ex- languages.
amples  that have diferent structures (  ̸= ). In Additional information comes from the analysis of the
our experiments we use N = 7. The structure information compressed representations on the latent layer, which
is used to build the dataset and obtain a deeper evaluation are expected to capture the sentence structure that is
of the results, but is not provided to the system. shared by the functional and lexicalized data. We show</p>
        <p>Using the sentences datasets described in section 2, the projection on the latent layer of the sentence
reprewe built datasets consisting of a mix of FUN and LEX sentations in Figure 2, when sentence representations
instances (an instace will only contain either FUN or LEX are obtained from Electra (left) and Electra-It (right). We
sentences), and use the above-mentioned set-up to test: note that these latent projections cluster by the syntactic
(i) how well a system reconstructs a sentence with the structure and semantic roles of the sentences, and that
desired syntactic and semantic information, measured at using Electra-It representations leads to a tighter mix of
the output through F1 score4, and (ii) how well the sys- lexicalized and functional sentences that have the same
tem identifies the diferent patterns. Specifically, we ask syntactic structure. This adds depth to the results in Table
1 – showing that when trained on a mix of functionalized
and lexicalized instances, the system is able to discover a
shared space of clues about the grammatical structure –
and also shows that in the representations obtained from
Electra-It there are stronger shared clues about
grammatical structure in both functionalized and lexicalized
sentences compared to the multilingual Electra model.
1google/electra-base-discriminator
2The HOLMES benchmark leaderboard: https://holmes-leaderboard.
streamlit.app/. At the time of writing, the ranks were: Electra - 16,
DeBERTa - 21, BERT - 41, RoBERTa - 45.
3dbmdz/electra-base-italian-xxl-cased-discriminator
4When processing each instance, the system chooses among 8
options, of which one is correct. The F1 score of the "positive" class
provides the most balanced measure of performance.
3.1.2. Task solving
It might be objected that the previous experiments and
visualisations do not conclusively show that latent
representations encode structure, as opposed to just
distinguishing seven distinct but amorphous classes. We use
the BLM data to provide additional support to the
conclusion that structure is represented. The BLM task frames
a linguistic phenomenon as a linguistic puzzle. Solving
this puzzle relies on detecting the linguistic objects, their Figure 3: Comparison between the multilingual (left) and
relevant properties, and the structure both within each monolingual (right) electra models for solving the BLM task:
sentence, and across the input sequence. average F1 over three runs. x-axis shows the traininng data:</p>
        <p>Our BLM dataset has several levels of complexity: (i) a trraateinl yinognoFnUFNUNanadnLdELXEX instances jointly vs. training
sepamixture of change-of-state and object-drop verbs, which
exhibit diferent semantic frames for the intransitive
answers (patient vs agent subjects), and share other frames
(see Figure 1); (ii) lexicalized and functional instances; objective is to maximize the score of the correct answer
(iii) maximal level of lexical variation in each instance. from the candidate answer set, and minimize that of the
This set-up will allow us to test whether syntactic struc- incorrect ones. During testing, the system constructs the
ture and semantic roles are encoded similarly in the rep- representation of an answer, then chooses the closest one
resentation of lexicalized and functional sentences by from the given options. All potential answers consist of
monolingual and multilingual encoder models. a verb frame filled with phrases that play specific roles</p>
        <p>We use the system described by Nastase and Merlo [16], (Section 2). The correct one consists of the combination
that solves the BLM problem in two steps: compresses the of phrases whose roles fit together for the given verb,
sentence into a representation that encodes the structure while the other contain similar pieces, but which violate
relevant to the BLM puzzle – linguistic objects and their some semantic, syntactic (or both) rules. This set-up
alsyntactic and semantic role properties –, and uses these lows us to test whether specific elements in the sentences
compressed representations to solve the multiple-choice from the input sequence, and their semantic roles have
puzzle. The system’s two steps are encoded through in- been detected and used properly in building the correct
terconnected variational encoder-decoders, as illustrated answer.
in Figure 4, which are trained together. The learning Figure 3 shows the F1 results (as averages over three</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Generating functional variations of sentences</title>
        <p>Multilingual generative models are not exposed to the
same amounts of training data across languages and
probably for that reason they do not appear to treat every
language in their training data equally. In fact, evidence
has shown that English serves as a latent language for The prompt with Italian input sentence, and
requestgenerative models (LlaMa 2). Tracking an input in lan- ing an Italian functional version is shown below.
guages other than English through the intermediate
layers of the transformer, it has been shown that from the
input the representations drift more and more towards
English, with a switch towards the input language’s
representation only at the last layers [7, 8]. We test whether
this implies that the structure of an English sentence is
2. Input: "the local languages are studied by some
linguists"
Output:
...</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Discussion</title>
      <p>Replace noun phrases with pronouns and prepositional
phrases with adverbs. Preserve the exact syntactic
structure, word order, and verb forms.</p>
      <sec id="sec-3-1">
        <title>We aimed to explore the impact of encoding together</title>
        <p>multiple language, with English dominating the training
data, for encoder and decoder language models.</p>
        <p>Examples: The comparison of detecting syntactic-semantic
structure using a multilingual and a monolingual encoder
Input: "i suoi giocattoli erano intagliati dai suoi model has shown that the monolingual Italian model
genitori nella baita" → Output: "questi erano intagliati encodes both structural and linguistic abstraction
inforda loro là" mation in a cleaner and more accessible way compared
to a multilingual model, contrary to previous hypotheses
Now convert these: about multilingual training leading to the encoding of
more abstract linguistic structures. We have shown this
1. Input: "quella canzone era canticchiata dai miei efect through an exploration of individual sentences, as
amici da qualche settimana" well as when the sentence structure was required to solve
Output:</p>
        <p>a more complex linguistic puzzle. Adding the lexical
ab2. Input: "le lingue del luogo sono studiate da straction level to the structure exploration allows us to
alcuni linguisti" reach the shared structures of lexicalized and functional
Output: sentence variations.
... Using a decoder transformer model, we have explored
sentence structure encoding through the generative lens:
3.2.2. Evaluation how well does a system recognize and preserve the
syntactic and semantic structure of an input sentence.
BeTo evaluate the outputs, we use three complementary cause it has been shown that English functions as a latent
measures: (i) perfect match (ident) the percentage of in- language, it would be expected that the structure of an
stances for which the system generation matches the English sentence is more readily detected and preserved.
gold standard (ii) structure match (struct), for each output We found that that is not the case, and mapping a
lexicalwe compute an F1 score that quantifies how well the sys- ized Italian input sentence into its functional form leads
tem has predicted the structure 5 and (iii) pronoun/adverb to better results, both in terms of preserving the
strucratio (pron), where we compute the ratio of pronouns and ture, and in the generation of pronominal and adverbial
adverbs in the system output and the pronouns and ad- replacements for noun and prepositional phrases.
verbs in the gold standard. All these measures are rough
approximations, and overestimate the performance, but
in a consistent manner. Table 2 shows these measures 5. Related work
for the four experimental set-ups.</p>
        <p>Similarly to the experiments on the monolingual and
multilingual encoder models, the experiments on the
generative LLM has shown that forcing multiple languages
to share the parameter space leads to the loss of syntactic,
semantic and lexical language-specific information. The
5We obtain dependency relations for the system output and the gold
standard using spaCy (https://spacy.io/v.3.8.7), and computed the
F1 based on the true positive count (how many relations overlap),
false positive (how many additional relations the system answer
has relative to the gold standard) and false negatives (how many
dependencies the gold standard has that do not appear in the system
output).</p>
      </sec>
      <sec id="sec-3-2">
        <title>Multilingual models project many languages in the same</title>
        <p>
          parameter space. This brings some clear advantages: the
model can be moved easily between diferent language
applications, and it allows for low-resource languages to
be bootstrapped by their connections to other languages.
It has been surmised that forcing multiple languages to
share the same parameter space will lead to the
emergence of linguistic universals. It has been shown that
that LLMs generalize across languages through
implicitly learned vector alignment, which is less robust for
generative models [20]. Some work using cross-lingual
structural priming finds evidence that grammatical
representations are abstract and shared in multilingual
language models [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] . Further exploration has found,
however, that this efect depends on the similarity between
the included languages [21]. It has also been shown that
models encode grammatical information, such as chunks
and structure, in a language-specific manner [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Overall,
it is dificult to draw a conclusion on the performance of
multilingual models, because it can be overestimated due
to skewed language selection [22].
        </p>
        <p>There are also downsides to building a multilingual
model, as language particularities may be lost in the
shared space, particularly when there is a dominant
language. This may lead to language confusion in generation
[23], and a decrease in the faithfulness of the multilingual
models compared to monolingual ones, assessed in terms
of feature attribution [24]. An asymmetrical efect of
recall in monolingual and multilingual models depending
on the syntactic role (subject vs. object) has also been
found [25]. Finally, the language of the prompt afects a
multilingual model’s performance on binary questions
about sentence grammaticality [26].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusions</title>
      <p>The current work aimed to explore the costs or
advantages of multilingual and monolingual models, in a
linguistic problem that involves a form of abstraction in
language models. In particular, we focused on the issue
of lexical abstraction through functional words –
pronouns and adverbs standing in for noun and prepositional
phrases. Lexicalized and functional versions of the same
sentence share syntactic structure and semantic roles,
information which should be encoded by language models.
We tested whether this information is identifiable and
whether lexicalized and functional parallel sentences
encode this information in a similar manner. We explored
multilingual models, testing the assumption that forcing
many languages to share the same parameter space leads
to a more abstract encoding of information. We found
that this assumption does not hold in either encoder or
decoder models.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>We gratefully acknowledge the support of this work by the Swiss National Science Foundation, through grant SNF Advanced grant TMAG-1_209426 to PM.</title>
        <p>[7] C. Wendler, V. Veselovsky, G. Monea, R. West, boxNLP Workshop on Analyzing and Interpreting
Do llamas work in English? on the latent lan- Neural Networks for NLP, Association for
Comguage of multilingual transformers, in: L.-W. putational Linguistics, Abu Dhabi, United Arab
Ku, A. Martins, V. Srikumar (Eds.), Proceedings Emirates (Hybrid), 2022, pp. 142–152. URL: https:
of the 62nd Annual Meeting of the Association //aclanthology.org/2022.blackboxnlp-1.12.
for Computational Linguistics (Volume 1: Long Pa- [16] V. Nastase, P. Merlo, Are there identifiable
strucpers), Association for Computational Linguistics, tural parts in the sentence embedding whole?,
Bangkok, Thailand, 2024, pp. 15366–15394. URL: in: Y. Belinkov, N. Kim, J. Jumelet, H.
Mohttps://aclanthology.org/2024.acl-long.820. doi:10. hebbi, A. Mueller, H. Chen (Eds.), Proceedings
18653/v1/2024.acl-long.820. of the 7th BlackboxNLP Workshop:
Analyz[8] I. Papadimitriou, K. Lopez, D. Jurafsky, Multilingual ing and Interpreting Neural Networks for NLP,
BERT has an accent: Evaluating English influences Association for Computational Linguistics,
Mion fluency in multilingual models, in: A. Vlachos, ami, Florida, US, 2024, pp. 23–42. URL: https:
I. Augenstein (Eds.), Findings of the Association for //aclanthology.org/2024.blackboxnlp-1.3/. doi:10.
Computational Linguistics: EACL 2023, Association 18653/v1/2024.blackboxnlp-1.3.
for Computational Linguistics, Dubrovnik, Croatia, [17] Y. Bengio, A. Courville, P. Vincent, Representation
2023, pp. 1194–1200. URL: https://aclanthology.org/ learning: A review and new perspectives, IEEE
2023.findings-eacl.89/. doi: 10.18653/v1/2023. Transactions on Pattern Analysis and Machine
Infindings-eacl.89. telligence 35 (2013) 1798–1828.
[9] G. Samo, A structured synthetic dataset of En- [18] N. Elhage, T. Hume, C. Olsson, N. Schiefer,
glish and Italian verb alternations for testing lex- T. Henighan, S. Kravec, Z. Hatfield-Dodds,
ical abstraction via functional lexicon in LLMs, R. Lasenby, D. Drain, C. Chen, R. Grosse, S.
McCan2025. URL: https://ling.auf.net/lingbuzz/009085. dlish, J. Kaplan, D. Amodei, M. Wattenberg, C. Olah,
arXiv:lingbuzz/009085, preprint available at Toy models of superposition, 2022. URL: https:
lingbuzz/009085. //arxiv.org/abs/2209.10652. arXiv:2209.10652.
[10] B. Levin, English verb classes and alternations: A [19] A. Jones, W. Y. Wang, K. Mahowald, A massively
preliminary investigation, University of Chicago multilingual analysis of cross-linguality in shared
Press, 1993. embedding space, in: M.-F. Moens, X. Huang,
[11] P. Merlo, S. Stevenson, Automatic verb classifi- L. Specia, S. W.-t. Yih (Eds.), Proceedings of the
cation based on statistical distributions of argu- 2021 Conference on Empirical Methods in
Natument structure, Computational Linguistics 27 (2001) ral Language Processing, Association for
Compu373–408. URL: https://aclanthology.org/J01-3003/. tational Linguistics, Online and Punta Cana,
Dodoi:10.1162/089120101317066122. minican Republic, 2021, pp. 5833–5847. URL: https:
[12] G. Samo, A structured synthetic dataset of English //aclanthology.org/2021.emnlp-main.471/. doi:10.
and Italian verb alternations for testing lexical ab- 18653/v1/2021.emnlp-main.471.
straction via functional lexicon in LLMs, 2025. URL: [20] Q. Peng, A. Søgaard, Concept space alignment in
https://ling.auf.net/lingbuzz/009085, preprint avail- multilingual LLMs, in: Y. Al-Onaizan, M. Bansal,
Y.able at lingbuzz/009085. N. Chen (Eds.), Proceedings of the 2024 Conference
[13] P. Merlo, Blackbird language matrices (BLM), on Empirical Methods in Natural Language
Processa new task for rule-like generalization in neural ing, Association for Computational Linguistics,
Minetworks: Motivations and formal specifications, ami, Florida, USA, 2024, pp. 5511–5526. URL: https:
ArXiv cs.CL 2306.11444 (2023). URL: https://doi.org/ //aclanthology.org/2024.emnlp-main.315/. doi:10.
10.48550/arXiv.2306.11444. doi:10.48550/arXiv. 18653/v1/2024.emnlp-main.315.
2306.11444. [21] C. Arnett, T. A. Chang, J. A. Michaelov, B. Bergen,
[14] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, On the acquisition of shared grammatical
represenELECTRA: Pre-training text encoders as discrimi- tations in bilingual language models, in: W. Che,
nators rather than generators, in: ICLR, 2020. URL: J. Nabende, E. Shutova, M. T. Pilehvar (Eds.),
Prohttps://openreview.net/pdf?id=r1xMH1BtvB. ceedings of the 63rd Annual Meeting of the
Asso[15] D. Yi, J. Bruno, J. Han, P. Zukerman, S. Steinert- ciation for Computational Linguistics (Volume 1:
Threlkeld, Probing for understanding of English Long Papers), Association for Computational
Linverb classes and alternations in large pre-trained guistics, Vienna, Austria, 2025, pp. 20707–20726.
language models, in: Proceedings of the Fifth Black- URL: https://aclanthology.org/2025.acl-long.1010/.
doi:10.18653/v1/2025.acl-long.1010.
[22] E. Ploeger, W. Poelman, M. de Lhoneux, J. Bjerva, man Language Technologies (Volume 1: Long
What is “typological diversity” in NLP?, in: Y. Al- Papers), Association for Computational
LinguisOnaizan, M. Bansal, Y.-N. Chen (Eds.), Proceed- tics, Mexico City, Mexico, 2024, pp. 3226–3244.
ings of the 2024 Conference on Empirical Methods URL: https://aclanthology.org/2024.naacl-long.178/.
in Natural Language Processing, Association for doi:10.18653/v1/2024.naacl-long.178.
Computational Linguistics, Miami, Florida, USA, [25] C. Fierro, N. Foroutan, D. Elliott, A. Søgaard,
2024, pp. 5681–5700. URL: https://aclanthology.org/ How do multilingual language models
remem2024.emnlp-main.326/. doi:10.18653/v1/2024. ber facts?, in: W. Che, J. Nabende, E. Shutova,
emnlp-main.326. M. T. Pilehvar (Eds.), Findings of the
Associa[23] K. Marchisio, W.-Y. Ko, A. Berard, T. Dehaze, tion for Computational Linguistics: ACL 2025,
S. Ruder, Understanding and mitigating language Association for Computational Linguistics,
Viconfusion in LLMs, in: Y. Al-Onaizan, M. Bansal, Y.- enna, Austria, 2025, pp. 16052–16106. URL: https:
N. Chen (Eds.), Proceedings of the 2024 Conference //aclanthology.org/2025.findings-acl.827/. doi: 10.
on Empirical Methods in Natural Language Process- 18653/v1/2025.findings-acl.827.
ing, Association for Computational Linguistics, Mi- [26] S. Behzad, A. Zeldes, N. Schneider, To ask LLMs
ami, Florida, USA, 2024, pp. 6653–6677. URL: https: about English grammaticality, prompt them in a
dif//aclanthology.org/2024.emnlp-main.380/. doi:10. ferent language, in: Y. Al-Onaizan, M. Bansal, Y.-N.
18653/v1/2024.emnlp-main.380. Chen (Eds.), Findings of the Association for
Compu[24] Z. Zhao, N. Aletras, Comparing explanation faith- tational Linguistics: EMNLP 2024, Association for
fulness between multilingual and monolingual fine- Computational Linguistics, Miami, Florida, USA,
tuned language models, in: K. Duh, H. Gomez, 2024, pp. 15622–15634. URL: https://aclanthology.
S. Bethard (Eds.), Proceedings of the 2024 Con- org/2024.findings-emnlp.916/. doi: 10.18653/v1/
ference of the North American Chapter of the 2024.findings-emnlp.916.</p>
        <p>Association for Computational Linguistics:
Hu</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>A. Blackbird Language Matrices data</title>
      <p>Verb split between train and test for the COS and OD subsets. For the sentence representation analysis, the data
respects the same split.</p>
      <sec id="sec-6-1">
        <title>A.1. Data split</title>
      </sec>
      <sec id="sec-6-2">
        <title>A.2. BLM task instances for change-of-state verbs</title>
      </sec>
      <sec id="sec-6-3">
        <title>A.3. BLM task instances for object-drop verbs</title>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised crosslingual representation learning at scale</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <article-title>Are all languages created equal in multilingual BERT?</article-title>
          , in: S. Gella,
          <string-name>
            <given-names>J.</given-names>
            <surname>Welbl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Strubell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seo</surname>
          </string-name>
          , H. Hajishirzi (Eds.),
          <source>Proceedings of the 5th Workshop on Representation Learning for NLP, Association for Computational Linguistics</source>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>120</fpage>
          -
          <lpage>130</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .repl4nlp-
          <fpage>1</fpage>
          .16/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .repl4nlp-
          <fpage>1</fpage>
          .
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Emerging cross-lingual structure in pretrained language models</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>6022</fpage>
          -
          <lpage>6034</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>536</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>536</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sinclair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jumelet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zuidema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <article-title>Structural persistence in language models: Priming as a window into abstract language representations</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>1031</fpage>
          -
          <lpage>1050</lpage>
          . URL: https: //aclanthology.org/
          <year>2022</year>
          .tacl-
          <volume>1</volume>
          .60/. doi:
          <volume>10</volume>
          .1162/ tacl_a_
          <fpage>00504</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Michaelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arnett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bergen</surname>
          </string-name>
          ,
          <article-title>Structural priming demonstrates abstract grammatical representations in multilingual language models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>3703</fpage>
          -
          <lpage>3720</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>227</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . emnlp-main.
          <volume>227</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nastase</surname>
          </string-name>
          , G. Samo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merlo</surname>
          </string-name>
          ,
          <article-title>Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement</article-title>
          , in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montemagni</surname>
          </string-name>
          , R. Sprugnoli (Eds.),
          <source>Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy,
          <year>2024</year>
          , pp.
          <fpage>631</fpage>
          -
          <lpage>643</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          . clicit-
          <volume>1</volume>
          .71/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>