BLM-It — Blackbird Language Matrices for Italian:
                                A CALAMITA Challenge
                                Chunyang Jiang1,2,∗ , Giuseppe Samo1 , Vivi Nastase1 and Paola Merlo1,2
                                1
                                    Idiap Research Institute, Martigny, Switzerland
                                2
                                    University of Geneva, Geneva, Switzerland


                                                   Abstract
                                                   In this challenge, we propose Blackbird Language Matrices (BLMs), linguistic puzzles to learn language-related problems and
                                                   investigate deeper formal and semantic properties of language, through a process of paradigm understanding. A BLM matrix
                                                   consists of a context set and an answer set. The context is a sequence of sentences that encode implicitly an underlying
                                                   generative linguistic rule. The contrastive multiple-choice answer set includes negative examples produced following
                                                   corrupted generating rules. We propose three subtasks —agreement concord (Agr), causative (Caus) and object-drop (Od)
                                                   alternation detection— each in two variants of increasing lexical complexity. The datasets comprise a few prompts for
                                                   few-shot learning and a large test set.

                                                   Keywords
                                                   Blackbird Language Matrices, Causative/inchoative alternation, Object-drop alternation, subject-verb number agreement,
                                                   rule-based abstraction, disentanglement


                                1. Introduction and Motivation
                                Current generative large language models (LLMs) trans-
                                late across close languages, produce fluent and informa-
                                tive summaries, and answer questions promptly. And
                                yet, they still fail in very non-human ways. As proven
                                by their prohibitive needs in size of training data and ex-
                                pensive computational resources, large language models                                Figure 1: Example of a Raven’s Progressive Matrix (RPM)
                                do not generalise nor abstract systematically. Humans,                                from visual intelligence tests. This instance is generated with
                                instead, are good at abstraction and generalisation.                                  two generative rules: (i) the red dot moves one place clockwise
                                   To reach systematic abilities in abstraction and gener-                            when traversing the matrix left to right; (ii) the blue square
                                                                                                                      moves one place anticlockwise when traversing the matrix top
                                alisation in neural networks, we need to develop tasks
                                                                                                                      to bottom. The task consists in finding the tile in the answer
                                and data that help us understand their current general-                               set that correctly completes the sequence (indicated with a
                                isation abilities —what exactly do LLMs understand of                                 double border).
                                the language they produce and process so well?— and
                                help us train them to more complex skills.
                                   In the CALAMITA challenge[1], we propose to find
                                the solution to Blackbird Language Matrices (BLMs), lin-                                                           Unlike other attempts to create textual versions of
                                guistic puzzles developed in analogy to the visual Raven                                                        RPMs, BLMs are not simplistic transcriptions of visual
                                Progressive Matrices tests [2]. Raven’s Progressive Ma-                                                         stimuli [4]—a technique that, in practice, might give away
                                trices (RPMs) consist of a sequence of images, called the                                                       parts of the solution to the problem—, nor are they auxil-
                                context, connected in a logical sequence by underlying                                                          iary abstractions of stimuli in the visual domain [5]. In-
                                generative rules [3]. The task is to determine the miss-                                                        stead, BLMs are matrices developed specifically to learn
                                ing element in this visual sequence, the answer, chosen                                                         language-related problems and delve into deeper formal
                                among a set of closely or loosely similar alternatives, as                                                      and semantic properties of language, through a process
                                illustrated in Figure 1.                                                                                        of linguistic paradigm understanding.
                                                                                                                                                   Like RPMs, a BLM instance consists of a context set
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, and an answer set. The context is a sequence of sentences
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                  that encode a linguistic rule. They encode, for example,
                                ∗
                                     Corresponding author.
                                Envelope-Open chunyang.jiang@unige.ch (C. Jiang); giuseppe.samo@idiap.ch
                                                                                                                                                the rule of grammatical number concord: subject and
                                (G. Samo); vivi.a.nastase@gmail.com (V. Nastase);                                                               verb agree in their grammatical number, and they do
                                Paola.Merlo@unige.ch (P. Merlo)                                                                                 so independently of how many noun phrases intervene
                                GLOBE https://www.idiap.ch/en/scientific-research/researchers                                                   between them. BLMs are presented as linguistic puzzles
                                (P. Merlo)                                                                                                      requiring the selection of the missing sentence. In order
                                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
to examine the representations underlying the response,                                  Context
                                                                              NP-sg   PP1-sg            VP-sg
the answer sets include not only the correct answer, but
                                                                              NP-pl   PP1-sg            VP-pl
also erroneous candidates constructed by corrupting the
                                                                              NP-sg   PP1-pl            VP-sg
generating rules. An example template is illustrated in                       NP-pl   PP1-pl            VP-pl
Figure 2.                                                                     NP-sg   PP1-sg PP2-sg VP-sg
   BLM datasets are richly structured and support many                        NP-pl   PP1-sg PP2-sg VP-pl
different types of investigations, at both the sentence                       NP-sg   PP1-pl   PP2-sg VP-sg
and matrix levels. The context-answer set up support                                    Answer set
counterfactual investigations of possible types of errors:            NP-pl    PP1-pl   PP2-sg      VP-pl   Correct
language errors, reasoning errors, and their interactions             NP-pl    PP1-pl   et PP2-sg VP-pl       Coord
[6, 7, 8]. The regular syntactic forms and the systematic             NP-pl    PP1-pl               VP-pl     WNA
semantic properties support investigations on system-                 NP-pl    PP1-sg PP1-sg        VP-pl      WN1
aticity and compositionality in neural networks. The pre-             NP-pl    PP1-pl   PP2-pl      VP-pl      WN2
dictable syntactic structure of individual sentences, and             NP-pl    PP1-pl   PP2-pl      VP-sg      AEV
                                                                      NP-pl    PP1-sg PP2-pl        VP-sg     AEN1
the structure within the sequence of a BLM context, also
                                                                      NP-pl    PP1-pl   PP2-sg      VP-sg     AEN2
support investigations on sentence embeddings [9, 10].
BLMs exists for several tasks and different languages,
enabling multi-tasks and multi-language comparative       Figure 2: BLM-AgrI template for verb-subject agreement,
                                                          with one-two intervening phrases. Three generative rules:
studies [11, 12]. Finally, each BLM problem is a linguistic
paradigm and can be seen as a tool for linguistic investi-(i) Subject matches in number with verb (singular or plural);
                                                          (ii) material can intervene and is of unbounded length; (iii)
gation of specific phenomena.
                                                          singular and plural alternate in regular patterns. NP=Noun
                                                          Phrase, PP=Prepositional Phrase, VP=Verb Phrase. Answers:
2. The BLM-It Challenge                                   WNA= wrong number of attractors; WN1= wrong nr. for 1𝑠𝑡
                                                          attractor noun (N1); WN2= wrong nr. for 2𝑛𝑑 attractor noun
                                                          (N2); AEV=agreement error on the verb; AEN1=agreement
The BLM-It challenge consists of six sub-tasks.1 All sub-
                                                          error on N1; AEN2=agreement error on N2.
tasks are instances of the general BLM task, but they dif-
fer along two dimensions: the linguistic problem defined
(Agr, Caus, Od) and the lexical complexity of the data (II,
III).2 While the agreement (Agr) task focuses on informa-          transitive verb bears the same semantic role (Patient) as
tion about the formal grammatical property of agreement,           the subject of the intransitive verb (L’artista ha aperto
the causative (Caus) and object-drop (Od) alternation              la finestra/La finestra si è aperta ‘The artist opened the
tasks focus on lexical semantic properties of verbs, their         window’/‘The window opened’). The transitive form of
ability to enter or not in a causative alternation and their       the verb has a causative meaning [13].
systematic alternation in the syntactic-semantic mapping              The BLM-CausI template is shown in Figure 4. The con-
of grammatical functions and semantic roles.                       text set of the causative alternation varies depending on
                                                                   the presence of one or two arguments and their attributes
BLM-AgrI The BLM problem for subject-verb agree- (agents, Ag; patients, Pat) and the active (Akt) and pas-
ment [6] consists of a context set of seven sentences that sive (Pass) or passive voice of the verb. The sentences
share the subject-verb agreement phenomenon, but differ are organised in a structured sequence: an alternation
in other aspects – e.g. number of intervening attractors every two items between a prepositional phrase intro-
between the subject and the verb, different grammatical duced by multifarious prepositions (e.g., in pochi secondi,
numbers for these attractors, and different clause struc- P-NP) and a PP introduced by the agentive da-NP (e.g.,
tures. The answer set comprises contrastive sentences dall’artista, da-Ag/da-Pat).
that violate some of the generative rules. The BLM-AgrI               The answer set is composed of one correct answer and
Template can be seen in Figure 2.                                  contrastive erroneous answers, all formed by the same
                                                                   four elements: a verb, two nominal constituents and the
BLM-CausI The BLM-CausI matrix represents the presence (or absence) of a prepositional phrase.
causative/inchoative alternation, where the object of the
2
                                                                   BLM-OdI The BLM-OdI template is minimally differ-
 We choose names of tasks and lexical complexity levels that make
  it easier to cross-reference and compare the data described here
                                                                   ent from BLM-CausI. They also act as each other’s con-
  with other papers published on BLMs.                             trols. In contrast to Caus, the subject in Od bears the
2
  Our datasets are available here:                                 same semantic role (Agent) in both the transitive and
  https://www.idiap.ch/en/scientific-research/data/blm-agri-gen,   intransitive forms (L’artista dipingeva la finestra/L’artista
  https://www.idiap.ch/en/scientific-research/data/blm-causi-gen,  dipingeva ‘the artist painted the window’/‘the artist
https://www.idiap.ch/en/scientific-research/data/blm-odi-gen.
                            type II                                                               type III
                         Context                                                                 Context
1 La zia mangia una bistecca nella sala grande                         1 L’attore deve canticchiare un motivetto dopo il festival
2 La presidente può mangiare una bistecca da programma                 2 L’amica di mia mamma deve cucire la tasca da qualche
3 La specialità della casa deve essere mangiata dalla tur-               giorno
  ista nella sala grande                                               3 L’inno nazionale può essere cantato dal vincitore del
4 Una bistecca fu mangiata dalla presidente da sola                      festival con solo pianoforte
5 La specialità della casa deve essere mangiata in un sec-             4 Una bistecca deve essere mangiata dalla turista da sola
  ondo                                                                 5 Il manuale è insegnato nell’aula magna
6 Una bistecca deve poter essere mangiata da sola                      6 Questi attrezzi devono essere intagliati da manuale
7 La turista deve mangiare con fame                                    7 I due fratelli studiano con molta attenzione
8 ???                                                                  8 ???
                        Answer set                                                             Answer set
1 La specialità della casa può mangiare da sola                        1 La pasta frolla deve impastare da sola
2 La squadra di calcio deve mangiare da mezz’ora                       2 L’autrice deve poter scrivere da qualche giorno
3 Una bistecca è mangiata dalla turista                                3 I libri di testo devono poter essere studiati dai candidati
4 La squadra di calcio può essere mangiata da una car-                 4 Questi stilisti devono poter essere tessuti dai vestiti per
  bonara                                                                 la parata
5 La pasta col pomodoro può mangiare la squadra di calcio              5 Questi motivi greci possono tessere questi stilisti
6 La squadra di calcio mangia una bistecca                             6 L’idraulico saldò i cavi del lampadario
7 La specialità della casa deve poter mangiare dalla turista           7 La stanza pulisce da una delle propretarie dell’albergo
8 La presidente mangia da una bistecca                                 8 Le sommozzatrici pescarono da delle trote
Figure 3: Two instances of BLM-OdI data: with little (type II) and maximal (type III) lexical variation.


        Context                           Answer set                            Context                          Answer set
 1 Ag Akt Pat p-NP                 1 Pat Akt by-NP Correct               1 Ag Akt Pat p-NP                1 Pat Akt by-NP I-Int
 2 Ag Akt Pat by-NP                2 Ag Akt by-NP I-Int                  2 Ag Akt Pat by-NP               2 Ag Akt by-NP Correct
 3 Pat Pass by-Ag p-NP             3 Pat Pass by-Ag ER-Pass              3 Pat Pass by-Ag p-NP            3 Pat Pass by-Ag IER-Pass
 4 Pat Pass by-Ag by-NP            4 Ag Pass by-Pat IER-Pass             4 Pat Pass by-Ag by-NP           4 Ag Pass by-Pat ER-Pass
 5 Pat Pass       p-NP             5 Pat Akt Ag     R-Trans              5 Pat Pass       p-NP            5 Pat Akt Ag     IR-Trans
 6 Pat Pass       by-NP            6 Ag Akt Pat     R-Trans              6 Pat Pass       by-NP           6 Ag Akt Pat     R-Trans
 7 Pat Akt        p-NP             7 Pat Akt by-Ag E-WrBy                7 Ag Akt         p-NP            7 Pat Akt by-Ag IE-WrBy
 8 ???                             8 Ag Akt by-Pat IE-WrBy               8 ???                            8 Ag Akt by-Pat E-WrBy
Figure 4: BLM-CausI Template. Three generative rules:                   Figure 5: BLM-OdI Template. Same generative rules as
(i) the presence of either one or two arguments and their at-           BLM-CausI, with the difference that here the passive/active
tributes (agents, Ag; patients, Pat); (ii) the active (Akt) and pas-    voice is confounding, and the correct answer is an erroneous
sive (Pass) voice of the verb; the number and quality of nominal        answer for BLM-CausI.
phrases (NP) following the verb. Answers: I-Int=wrong subject
semantic role; ER-Pass=wrong verb mood; IER-Pass=wrong
mood and wrong subject semantic role; R-trans=wrong se-
quence reasoning (transitive sentence with the second NP not
                                                                        it is an intransitive form with a da-NP.
preceded by a preposition); IE-WrBy=ungrammatical sentence
(NP following the preposition da).                                      Lexical variants Each of the three BLM templates de-
                                                                        scribed above is developed in two lexical variants, with
                                                                        less (II) or more (III) lexical variation. In type II BLMs,
                                                                        only one word in each sentence changes for each matrix,
painted’) and the verb does not have a causative meaning
                                                                        compared to the other sentences, while in type III data,
[13].
                                                                        all words can change. Instances of the two variations are
   The BLM template for Od is the same as for Caus, but
                                                                        shown in Figure 3.
here the passive voice serves as a confounding element
and one of the contrastive answers for Caus is, in fact,
the correct answer here.                                                3. Data description
   The template for BLM-OdI is in Figure 5. Due to the
asymmetry between the Caus and Od BLM templates,                        The data is generated by the process described in Figure
the contexts of the BLMs minimally differ in the intransi-              6: (i) start from identifying a linguistic phenomenon of
tive followed by P-NP (sentence 7). The correct answer                  interest, its forms of expression and factors influencing it
also varies across the two groups, although in both cases               within a context, (ii) produce a set of seed examples from
                                                                   dataset                (few-shot) train     test
                                                                   BLM-AgrI (II/III)                    10    2000
                                                                   BLM-CausI (II/III)                   80    2080
                                                                   BLM-OdI (II/III)                     80    2080
                                                            Table 1
                                                            Data statistics for the three datasets, in terms of few-shot
                                                            training and testing. There are the same number of examples
                                                            in the type II (small lexical variation within an instance) and
                                                            type III (maximal lexical variation within an instance) varia-
                                                            tions of the three datasets.


Figure 6: BLM data generation process, from seed examples   3.3. Detailed data statistics
of a linguistic problem to the complete dataset
                                                           For the BLM-AgrI datasets, for each of types II and III,
                                                           we randomly sample 10 instances for few-shot learning
                                                           from a dataset of 2010 instances. The rest will be used for
natural or synthetic data, (iii) automatically augment the
                                                           testing. For the BLM-CausI and BLM-OdI datasets, which
seeds using a fill-mask strategy, (iv) produce BLM in-
                                                           are focused on specific verbs, we extract all instances for
stances following the designed templates and generative
                                                           one verb (based on the correct answer in each instance)
rules. Two instances of Od verb alternations are shown
                                                           for few-shot training. From an initial dataset of 2160
in Figure 3.
                                                           instances for 27 verbs (80 instances per verb), we select
                                                           the 80 instances for one verb for few-shot training, and
3.1. Origin of data                                        the rest are left for testing.
BLM-AgrI To instantiate the templates, our starting
point are the examples in Franck et al. [14, appendix1].    3.4. Example of prompts
They provide a set of subject NPs of various complexity
                                                           We design prompts in English and Italian in zero-shot
– including prepositional phrases, themselves of various
                                                           and few-shot prediction settings, to test the impact of
complexity. The sentences were produced based on these
                                                           the language of the prompt on the task. These prompts
subject NPs by manually adding verb phrases, and by
                                                           test LLMs’ ability to perform complex linguistic tasks
making the NPs more complex to increase the distance
                                                           with varying levels of context. Both types of prompts are
between the subject and the verb in the sentence [6].
                                                           structured to minimize ambiguity and focus on the core
Each of these sentences is used to produce a seed.
                                                           task of selecting the best sentence to follow the given
                                                           context.
BLM-CausI and BLM-OdI Thirty verbs from each of               Zero-Shot Prompt Example in English The prompt
the causative and object-drop classes in English in Levin in Figure 8 is designed to create a clear zero-shot base-
[13] were selected and translated by a native speaker into line for challenging linguistic tasks. We avoid complex
Italian, where translations maintain the same alternation prompting techniques, like chain-of-thought or step-by-
structure.                                                 step reasoning [16, 17]. This ensures that the model’s
   The seeds were augmented using masked modeling performance reflects its intrinsic capabilities for linguis-
on bert-base-uncased [15]. The Italian data are built tic understanding and reasoning without prior in-context
as native-speaker translations of the English data, with learning or guided reasoning steps.
manual corrections to guarantee the acceptability and         We format the prompt in Markdown format and ex-
semantic plausibility of the sentences, and assure vari- plicit label sections for Context and Answer Set. The
ability in gender and number.                              task is framed as a simple “puzzle” with the instruction
                                                           to “choose […] the sentence that could […] follow the
3.2. Data format                                           context”. This abstract formulation guides the model to
                                                           focus on identifying the best sequential fit without intro-
The structured BLM data is provided in a json file, each ducing ambiguity. The prompt also aims to reduce noise
instance as one element with specific fields described in and simplify the evaluation by fixing its output format.
Figure 7. A data instance is shown in Figure 10 in the        Few-Shot (One-Shot) Prompt Example in Italian
appendix.                                                  For the one-shot prediction setup (as is shown in Fig-
                                                           ure 9), we provide an example of the task in Italian before
                                                           presenting the new instance to the model. The prompt
                                                           serves to test the model’s ability to use prior examples
 {
      "ID": <ID NUMBER>,
      "Context": [<List of comma-separated, double-quoted sentences>],
      "Context_concatenated": <Double-quoted concatenation of context sentences,
          each prefixed by a numeral (1 to 7) followed by a tab, separated by newlines>,
      "Answer_set": [<List of comma-separated, double-quoted sentences>],
      "Answer_concatenated": <Double-quoted concatenation of answer sentences,
          each prefixed by a letter (A, B, C, ...) followed by a tab, separated by newlines>,
      "Correct_option": <Double-quoted single letter label>,
      "Correct_answer": <Double-quoted single correct answer sentence>,
      "Answer_set_annotation": [<List of comma-separated triplets
      {"label":<error-type>,"value":<truth value>,"option":<single letter label>}>],
      "Verb": <Double-quoted single verb>
},


Figure 7: Data format


     # TASK: I'm asking you to solve a puzzle. The             # COMPITO: Ti chiedo di risolvere un quesito. La
     language of the puzzle is Italian.                        lingua di questo quesito e' l'italiano.
     I will give you a list of sentences (numbered from 1      Ti daro' una lista di frasi (numerate da 1 a 7) che
     to 7) called the **Context**, and a set of sentences      chiameremo **Contesto**, e un insieme di frasi
     (identified by capital letters) called the **Answer       (identificate da una lettera) che chiameremo
     Set**.                                                    **Risposte**.
     Your task is to choose among the **Answer Set**           Il tuo compito e' di scegliere fra le **Risposte** la
     the sentence that could be the next sentence              frase che potrebbe essere la frase seguente del
     following the **Context**.                                **Contesto**.

     # FORMAT: You should **ONLY** output the letter           # FORMATO: Devi mettere **SOLO** la lettera che
     corresponding to the best answer. Do not output           corrisponde alla risposta migliore. Non inserire altro
     other text before or after.                               testo, ne' prima ne' dopo.


     # QUESTION                                                # ESEMPIO 1
     **Context**                                               **Contesto**
     {{Context_concatenated}}                                  {{Context_concatenated}}

     **Answer Set**                                            **Risposte**
     {{Answer_concatenated}}                                   {{Answer_concatenated}}

     **Your Choice**                                           **Scelta corretta**
                                                               {Correct_option}

Figure 8: Zero-Shot Prompt in English.                         # DOMANDA
                                                               **Contesto**
                                                               {{Context_concatenated}}

and adapt to a new linguistic context.                         **Risposte**
                                                               {{Answer_concatenated}}

4. Metrics                                                     **La tua scelta**

We perform zero-shot and one-shot evaluation on BLM-
AgrI, BLM-CausI and BLM-OdI tasks, using English and        Figure 9: Few (One)-Shot Prompt in Italian.
Italian prompts, with 100 samples each (batch size of
one, evaluated instance by instance, over three inde-
pendent runs) with Meta-Llama-3-8B-Instruct (ML-
                                                            BLM-AgrI tasks Meta-Llama-3-70B-Instruct con-
8), Meta-Llama-3-70B-Instruct (ML-70), Mistral-7B-
                                                            sistently outperforms the other models, particularly in
Instruct-v0.3 (M-7), and Gemma-2-9b-It (G-2). We
                                                            zero-shot English prompts, while also competitive in
report averaged F1 scores over 3 runs in Table 2.
                 English Prompt                   Italian Prompt                                                                            Results
Model
            Zero-Shot         One-Shot       Zero-Shot        One-Shot

 BLM-AgrI type II
ML-70   44.1 ± 0.46         44.88 ± 4.63     39.46 ± 0.79    35.62 ± 2.36                  50
                                                                                           40
ML-8    22.34 ± 0.33         17.84 ± 0.48    16.66 ± 1.56    19.30 ± 2.30


                                                                            F1 Macro (%)
                                                                                           30
M-7     25.54 ± 0.58         30.66 ± 4.60    17.41 ± 1.37     21.1 ± 2.26                  20

G-2     42.75 ± 1.01         43.64 ± 2.25   42.87 ± 0.62    40.62 ± 1.83                   10
                                                                                             0
                                                                                                  Meta-Llama-3-8B-Instruct   Meta-Llama-3-70B-Instruct      Mistral-7B-Instruct-v0.3           gemma-2-9b-it
                                                                                                                                                    Model
                                                                                                                                      Prompt Language & Number of Shot(s)
                                                                                                                        en | 0-shot           en | 1-shot          it | 0-shot         it | 1-shot
 BLM-AgrI type III
ML-70  45.64 ± 0.05         41.35 ± 6.71    40.48 ± 0.52     34.89 ± 5.93                  50
                                                                                           40
ML-8    26.65 ± 1.71         21.00 ± 2.07    22.68 ± 1.41    19.58 ± 5.68


                                                                            F1 Macro (%)
                                                                                           30
M-7     31.26 ± 1.60         12.75 ± 6.28    33.21 ± 0.91    19.64 ± 6.02                  20

G-2     38.48 ± 1.12         39.36 ± 3.27    36.54 ± 1.18   42.52 ± 6.83                   10
                                                                                             0
                                                                                                  Meta-Llama-3-8B-Instruct   Meta-Llama-3-70B-Instruct      Mistral-7B-Instruct-v0.3           gemma-2-9b-it
                                                                                                                                                    Model
                                                                                                                                      Prompt Language & Number of Shot(s)
                                                                                                                        en | 0-shot           en | 1-shot          it | 0-shot         it | 1-shot
 BLM-CausI type II
ML-70  19.97 ± 0.65       36.81 ± 10.11     16.46 ± 0.36    31.95 ± 8.75                   50
                                                                                           40
ML-8     5.85 ± 0.20         9.57 ± 5.20      6.72 ± 0.09     7.12 ± 3.00


                                                                            F1 Macro (%)
                                                                                           30
M-7      8.45 ± 0.44         7.66 ± 1.87      5.94 ± 0.04     6.21 ± 1.02                  20

G-2     18.06 ± 0.25        25.64 ± 4.30     14.23 ± 0.16    21.81 ± 3.93                  10
                                                                                             0
                                                                                                  Meta-Llama-3-8B-Instruct   Meta-Llama-3-70B-Instruct      Mistral-7B-Instruct-v0.3           gemma-2-9b-it
                                                                                                                                                    Model
                                                                                                                                      Prompt Language & Number of Shot(s)
                                                                                                                        en | 0-shot           en | 1-shot          it | 0-shot         it | 1-shot
 BLM-CausI type III
ML-70   26.49 ± 0.85         24.14 ± 3.34    25.27 ± 0.72    23.78 ± 7.16                  50
                                                                                           40
ML-8    18.03 ± 1.52          4.65 ± 0.38    16.59 ± 0.49    10.52 ± 2.21   F1 Macro (%)   30
M-7     20.08 ± 0.76          8.69 ± 3.12    14.91 ± 0.15    13.05 ± 2.05                  20

G-2    29.12 ± 0.73         25.93 ± 4.98     28.8 ± 0.04    25.41 ± 2.94                   10
                                                                                             0
                                                                                                  Meta-Llama-3-8B-Instruct   Meta-Llama-3-70B-Instruct      Mistral-7B-Instruct-v0.3           gemma-2-9b-it
                                                                                                                                                    Model
                                                                                                                                      Prompt Language & Number of Shot(s)
                                                                                                                        en | 0-shot           en | 1-shot          it | 0-shot         it | 1-shot
 BLM-OdI type II
ML-70  18.28 ± 2.18         32.51 ± 5.77    17.89 ± 1.06    24.61 ± 5.31                     50
                                                                                             40
ML-8     8.55 ± 0.21          9.18 ± 1.62      9.1 ± 0.41     5.25 ± 2.92
                                                                              F1 Macro (%)


                                                                                             30
M-7      1.92 ± 0.27          7.11 ± 3.59     2.79 ± 0.07     5.69 ± 1.31                    20

G-2     14.07 ± 0.78         27.64 ± 4.63    14.43 ± 0.08    23.70 ± 2.42                    10
                                                                                              0
                                                                                                  Meta-Llama-3-8B-Instruct   Meta-Llama-3-70B-Instruct      Mistral-7B-Instruct-v0.3           gemma-2-9b-it
                                                                                                                                                    Model
                                                                                                                                      Prompt Language & Number of Shot(s)
                                                                                                                        en | 0-shot           en | 1-shot          it | 0-shot         it | 1-shot
 BLM-OdI type III
ML-70  17.70 ± 0.32         20.05 ± 6.28    18.10 ± 0.44    23.01 ± 4.56                   50
                                                                                           40
ML-8     9.50 ± 0.95          3.20 ± 0.57    10.78 ± 0.61     3.64 ± 0.85
                                                                            F1 Macro (%)


                                                                                           30
M-7     11.60 ± 0.64          7.45 ± 4.27     9.74 ± 0.01      6.6 ± 2.19                  20

G-2     14.74 ± 0.40         14.75 ± 3.55    15.49 ± 1.54    18.58 ± 1.60                  10
                                                                                             0
                                                                                                  Meta-Llama-3-8B-Instruct   Meta-Llama-3-70B-Instruct      Mistral-7B-Instruct-v0.3           gemma-2-9b-it
                                                                                                                                                    Model
                                                                                                                                      Prompt Language & Number of Shot(s)
                                                                                                                        en | 0-shot           en | 1-shot          it | 0-shot         it | 1-shot

Table 2
Evaluation results on BLM-It tasks (AgrI, CausI, and OdI) using macro averaged F1 score (over 3 runs) and standard deviations
(±std). Each run was evaluated with 100 samples, one instance at a time, for Meta-Llama-3-70B-Instruct (ML-70), Meta-
Llama-3-8B-Instruct (ML-8), Mistral-7B-Instruct-v0.3 (M-7), Gemma-2-9b-It (G-2). Best performance is in bold, second best, if
overlapping intervals, in italics.


one-shot settings. Gemma-2-9b-it shows robust per-             BLM-CausI        tasks Meta-Llama-3-70B-Instruct
formance, especially with Italian prompts, performing          leads across both English and Italian prompts, with
similarly to the larger Meta-Llama model. In contrast,         improvement in one-shot English for type II. Gemma-
smaller models, such as Meta-Llama-3-8B-Instruct               2-9b-it shows comparable performance across both
and Mistral-7B-Instruct-v0.3 , perform more weakly,            languages, in both zero-shot and one-shot settings.
especially with Italian prompts.                               Smaller models perform worse for this task, especially in
                                                               one-shot Italian prompts.
 dataset                train:test             avg F1                    While not directly comparable due to the different
                                         E-M            E-It
                                                                      training process and the different test data, using pre-
 BLM-AgrI type II      2400:4121     0.881 (0.003) 0.784 (0.007)      trained transformer encoder architectures, like Electra,
 BLM-AgrI type III     2400:4121     0.874 (0.006) 0.336 (0.005)      significantly outperform the zero and one-shot prompt-
 BLM-CausI type II      2160:240     0.486 (0.005) 0.903 (0.010)
                                                                      ing baseline. The performance gap suggests that while
 BLM-CausI type III     2160:240     0.475 (0.010) 0.918 (0.010)
                                                                      zero or one-shot prompting is flexible, it may not capture
 BLM-OdI type II        2160:240     0.596 (0.010) 0.983 (0.003)
 BLM-OdI type III       2160:240     0.592 (0.024) 0.994 (0.004)      the complex syntactic and semantic features required for
                                                                      the BLM task in Italian.
Table 3
Dataset statistics and evaluation results on a two-level varia-
tional encoder-decoder architecture using an Italian Electra 5. Limitations
(E-It) and a multilingual Electra (E-M) pretrained model to
provide sentence embeddings.                                    While the data is very rich and richly structured, it shares
                                                                      all the limitations of artificial and synthetic data: stilted
                                                                      sentence structure, limited variability, possibly sentences
BLM-OdI tasks OdI tasks show the lowest overall                       that are too short. This artificiality, though, might reduce,
performance across models. This indicates that the                    without eliminating, the risk of having sentences that
task is the most complex and challenging for the mod-                 were directly seen in the training data of the pretrained
els. Meta-Llama-3-70B-Instruct performs best, partic-                 models that will be used, and that we use, for further
ularly in one-shot English and Italian prompts. However,              experiments.
Mistral-7B-Instruct-v0.3 struggles the most, partic-                     The initial seed sentences, although minimal, were
ularly in zero-shot settings, which reflects that the model           crafted by experts. This approach is deliberate, like in the
has limited generalisation capabilities in complex linguis-           ARC dataset, to guarantee that the data are not algorith-
tic tasks.                                                            mically reproducible [19]. This expert-based approach,
                                                                      though, might not be easily scalable, especially given the
Key Observations   Larger models, such as Meta-                       complexity of the data. Exploring methods to leverage
Llama-3-70B-Instruct and Gemma-2-9b-it , consis-
                                                                      existing datasets for seed generation could mitigate this
tently outperform smaller models, showing better gener-               dependency.
alisation and stability across tasks. English prompts gen-               The current dataset comprises three main tasks. More
erally result in higher F1 scores, though Italian prompts             tasks and variants are needed to demonstrate the robust-
sometimes achieve comparable performance, particularly                ness and the wider appeal of the data.
with Gemma-2-9b-it . One-shot prompting tends to im-
prove performance, though the degree of improvement                   6. Ethical issues
varies by model and task complexity. Smaller models,
such as Mistral-7B-Instruct and Meta-Llama-3-8B-                      The data presented include an augmentation step that
Instruct , show substantial variance, especially in one-              uses large language models (LLMs). LLMs are trained on
shot scenarios, indicating instability in complex linguistic          extensive text data, which may unintentionally incorpo-
tasks.                                                                rate biases present in the training corpus.

Comparison with Multitask Learning Approaches
We compare our LLM prompting results with the work of                 7. Data license and copyright
[12, 11], which explored the properties of Italian sentence              issues
embeddings – the embeddings of the [CLS] token from a
pretrained Electra model[18]3 – through the agreement                 This work is licensed under the Creative Com-
and the causative and object-drop BLM datasets, using                 mons Attribution-NonCommercial-ShareAlike 4.0 Inter-
a two-level Variational Encoder-Decoder architecture.                 national (CC BY-NC-SA 4.0). For uses outside of these
This system learns to compress the sentence embeddings                terms, please contact the authors.
into representations relevant for the specific BLM tasks.
The dataset statistics, and results on the individual BLM
tasks as averaged F1 score over three runs and different              Acknowledgments
amounts of lexical variation are shown in Table 3.             We gratefully acknowledge the support of this work by
                                                               the Swiss National Science Foundation, through grant
3
  Italian Electra (E-It) pretrained model: dbmdz/electra-base- SNF Advanced grant TMAG-1_209426 to PM.
italian-xxl-cased-discriminator, multi-lingual Electra (E-M) model:
google/electra-base-discriminator
References                                                          through targeted sparsification, in: Proceedings of
                                                                    the 9th Workshop on Representation Learning for
 [1] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-        NLP (RepL4NLP-2024), Bangkok, Thailand, 2024,
     cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-        pp. 203–214. URL: https://aclanthology.org/2024.
     naldi, D. Scalena, CALAMITA – Challenge the Abil-              repl4nlp-1.15.
     ities of LAnguage Models in ITAlian: Overview, in:        [11] V. Nastase, G. Samo, C. Jiang, P. Merlo, Explor-
     Proceedings of the 10th Italian Conference on Com-             ing Italian sentence embeddings properties through
     putational Linguistics (CLiC-it 2024), 2024.                   multi-tasking, in: Proceedings of the Tenth Italian
 [2] P. Merlo, Blackbird language matrices (BLM),                   Conference on Computational Linguistics (CLiC-It
     a new task for rule-like generalization in neu-                2024), Pisa, Italy, 2024.
     ral networks: Motivations and formal specifica-           [12] V. Nastase, C. Jiang, G. Samo, P. Merlo, Ex-
     tions, ArXiv cs.CL 2306.11444 (2023). URL: https://            ploring syntactic information in sentence embed-
     doi.org/10.48550/arXiv.2306.11444. doi:10.48550/               dings through multilingual subject-verb agreement,
     arXiv.2306.11444 .                                             in: Proceedings of the Tenth Italian Conference
 [3] J. C. Raven, Standardization of progressive matri-             on Computational Linguistics (CLiC-It 2024), Pisa,
     ces, British Journal of Medical Psychology 19 (1938)           Italy, 2024.
     137–150.                                                  [13] B. Levin, English verb classes and alternations: A
 [4] T. Webb, K. J. Holyoak, H. Lu, Emergent analogical             preliminary investigation, University of Chicago
     reasoning in large language models, Nature Hu-                 Press, 1993.
     man Behaviour 7 (2023) 1526–1541. URL: https://           [14] J. Franck, G. Vigliocco, J. Nicol, Subject-verb agree-
     doi.org/10.1038/s41562-023-01659-w. doi:10.1038/               ment errors in french and english: The role of syn-
     s41562- 023- 01659- w .                                        tactic hierarchy, Language and cognitive processes
 [5] X. Hu, S. Storks, R. Lewis, J. Chai, In-context ana-           17 (2002) 371–404.
     logical reasoning with pre-trained language models,       [15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     in: Proceedings of the 61st Annual Meeting of the              Pre-training of deep bidirectional transformers for
     Association for Computational Linguistics (Volume              language understanding, in: Proceedings of the
     1: Long Papers), Association for Computational                 2019 Conference of the North American Chap-
     Linguistics, Toronto, Canada, 2023, pp. 1953–1969.             ter of the Association for Computational Linguis-
     URL: https://aclanthology.org/2023.acl-long.109.               tics: Human Language Technologies, Volume 1
 [6] A. An, C. Jiang, M. A. Rodriguez, V. Nastase,                  (Long and Short Papers), Association for Com-
     P. Merlo, BLM-AgrF: A new French benchmark                     putational Linguistics, Minneapolis, Minnesota,
     to investigate generalization of agreement in neu-             2019, pp. 4171–4186. URL: https://aclanthology.org/
     ral networks, in: Proceedings of the 17th Confer-              N19-1423. doi:10.18653/v1/N19- 1423 .
     ence of the European Chapter of the Association for       [16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia,
     Computational Linguistics, Association for Com-                E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought
     putational Linguistics, Dubrovnik, Croatia, 2023,              prompting elicits reasoning in large language mod-
     pp. 1363–1374. URL: https://aclanthology.org/2023.             els, Advances in neural information processing
     eacl-main.99.                                                  systems 35 (2022) 24824–24837.
 [7] V. Nastase, P. Merlo, Grammatical information             [17] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa,
     in BERT sentence embeddings as two-dimensional                 Large language models are zero-shot reasoners, Ad-
     arrays, in: Proceedings of the 8th Workshop on                 vances in neural information processing systems
     Representation Learning for NLP (RepL4NLP 2023),               35 (2022) 22199–22213.
     Toronto, Canada, 2023.                                    [18] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Elec-
 [8] G. Samo, V. Nastase, C. Jiang, P. Merlo, BLM-s/lE:             tra: Pre- training text encoders as discriminators
     A structured dataset of English spray-load verb al-            rather than generators, in: ICLR, 2020, pp. 1–18.
     ternations for testing generalization in LLMs, in:        [19] F. Chollet, On the measure of intelligence,
     Proceedings of the 2023 Conference on Empirical                2019. URL: https://arxiv.org/abs/1911.01547.
     Methods in Natural Language Processing, Singa-                 arXiv:1911.01547 .
     pore, 2023.
 [9] V. Nastase, P. Merlo, Are there identifiable struc-
     tural parts in the sentence embedding whole?, 2024.
     URL: https://aclanthology.org/2024.blackboxnlp-1.
     3. doi:10.18653/v1/2024.blackboxnlp- 1.3 .
[10] V. Nastase, P. Merlo, Tracking linguistic infor-
     mation in transformer-based sentence embeddings
A. Example Data Format
    [{
           "ID": 215,
           "Context": [
               "le pittrici possono disegnare delle forme in meno di due giorni",
               "le artiste possono disegnare delle rappresentazioni artistiche da un mese",
               "alcune coreografie sono disegnate dalle pittrici nel salone espositivo",
               "delle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese",
               "alcune coreografie devono essere disegnate con pochi mezzi economici",
               "le scenografie devono essere disegnate da pochi mesi",
               "le pittrici devono disegnare nel salone espositivo"],
           "Context_concatenated": "1\tle pittrici possono disegnare delle forme in meno di due giorni\n2\tle artiste possono
           disegnare delle rappresentazioni artistiche da un mese\n3\talcune coreografie sono disegnate dalle pittrici nel
           salone espositivo\n4\tdelle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da
           un mese\n5\talcune coreografie devono essere disegnate con pochi mezzi economici\n6\tle scenografie devono essere
           disegnate da pochi mesi\n7\tle pittrici devono disegnare nel salone espositivo",
           "Answer_set": [
               "delle rappresentazioni artistiche devono poter disegnare le sue allieve",
               "le scenografie devono essere disegnate dalle sue allieve",
               "le sue allieve devono essere disegnate da delle rappresentazioni artistiche",
               "le pittrici possono disegnare le scenografie",
               "le pittrici possono disegnare da un anno circa",
               "delle forme devono poter disegnare da pochi mesi",
               "le artiste devono poter disegnare da alcune coreografie",
               "delle rappresentazioni artistiche devono disegnare dalle artiste"],
           "Answer_concatenated": "A\tdelle rappresentazioni artistiche devono poter disegnare le sue allieve\nB\tle scenografie
           devono essere disegnate dalle sue allieve\nC\tle sue allieve devono essere disegnate da delle rappresentazioni
           artistiche\nD\tle pittrici possono disegnare le scenografie\nE\tle pittrici possono disegnare da un anno circa\nF\tdelle
           forme devono poter disegnare da pochi mesi\nG\tle artiste devono poter disegnare da alcune coreografie\nE\tdelle
           rappresentazioni artistiche devono disegnare dalle artiste",
           "Correct_option": "E",
           "Correct_answer": "le pittrici possono disegnare da un anno circa",
           "Answer_set_annotation": [
               {   "label": "IR-trans",
                   "value": false,
                   "option": "A" },
               {   "label": "IER-pass",
                   "value": false,
                   "option": "B" },
               {   "label": "ER-pass",
                   "value": false,
                   "option": "C" },
               {   "label": "R-trans",
                   "value": false,
                   "option": "D" },
               {   "label": "Correct",
                   "value": true,
                   "option": "E" },
               {   "label": "I-Int",
                   "value": false,
                   "option": "F" },
               {   "label": "E-WrBy",
                   "value": false,
                   "option": "G" },
               {   "label": "IE-WrBy",
                   "value": false,
                   "option": "H" }
           ],
           "Verb": "disegnare"
    },
    ....
    ]

Figure 10: Sample entry formatted for usage with the provided prompts.