BLM-It — Blackbird Language Matrices for Italian: A CALAMITA Challenge Chunyang Jiang1,2,∗ , Giuseppe Samo1 , Vivi Nastase1 and Paola Merlo1,2 1 Idiap Research Institute, Martigny, Switzerland 2 University of Geneva, Geneva, Switzerland Abstract In this challenge, we propose Blackbird Language Matrices (BLMs), linguistic puzzles to learn language-related problems and investigate deeper formal and semantic properties of language, through a process of paradigm understanding. A BLM matrix consists of a context set and an answer set. The context is a sequence of sentences that encode implicitly an underlying generative linguistic rule. The contrastive multiple-choice answer set includes negative examples produced following corrupted generating rules. We propose three subtasks —agreement concord (Agr), causative (Caus) and object-drop (Od) alternation detection— each in two variants of increasing lexical complexity. The datasets comprise a few prompts for few-shot learning and a large test set. Keywords Blackbird Language Matrices, Causative/inchoative alternation, Object-drop alternation, subject-verb number agreement, rule-based abstraction, disentanglement 1. Introduction and Motivation Current generative large language models (LLMs) trans- late across close languages, produce fluent and informa- tive summaries, and answer questions promptly. And yet, they still fail in very non-human ways. As proven by their prohibitive needs in size of training data and ex- pensive computational resources, large language models Figure 1: Example of a Raven’s Progressive Matrix (RPM) do not generalise nor abstract systematically. Humans, from visual intelligence tests. This instance is generated with instead, are good at abstraction and generalisation. two generative rules: (i) the red dot moves one place clockwise To reach systematic abilities in abstraction and gener- when traversing the matrix left to right; (ii) the blue square moves one place anticlockwise when traversing the matrix top alisation in neural networks, we need to develop tasks to bottom. The task consists in finding the tile in the answer and data that help us understand their current general- set that correctly completes the sequence (indicated with a isation abilities —what exactly do LLMs understand of double border). the language they produce and process so well?— and help us train them to more complex skills. In the CALAMITA challenge[1], we propose to find the solution to Blackbird Language Matrices (BLMs), lin- Unlike other attempts to create textual versions of guistic puzzles developed in analogy to the visual Raven RPMs, BLMs are not simplistic transcriptions of visual Progressive Matrices tests [2]. Raven’s Progressive Ma- stimuli [4]—a technique that, in practice, might give away trices (RPMs) consist of a sequence of images, called the parts of the solution to the problem—, nor are they auxil- context, connected in a logical sequence by underlying iary abstractions of stimuli in the visual domain [5]. In- generative rules [3]. The task is to determine the miss- stead, BLMs are matrices developed specifically to learn ing element in this visual sequence, the answer, chosen language-related problems and delve into deeper formal among a set of closely or loosely similar alternatives, as and semantic properties of language, through a process illustrated in Figure 1. of linguistic paradigm understanding. Like RPMs, a BLM instance consists of a context set CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, and an answer set. The context is a sequence of sentences Dec 04 — 06, 2024, Pisa, Italy that encode a linguistic rule. They encode, for example, ∗ Corresponding author. Envelope-Open chunyang.jiang@unige.ch (C. Jiang); giuseppe.samo@idiap.ch the rule of grammatical number concord: subject and (G. Samo); vivi.a.nastase@gmail.com (V. Nastase); verb agree in their grammatical number, and they do Paola.Merlo@unige.ch (P. Merlo) so independently of how many noun phrases intervene GLOBE https://www.idiap.ch/en/scientific-research/researchers between them. BLMs are presented as linguistic puzzles (P. Merlo) requiring the selection of the missing sentence. In order © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings to examine the representations underlying the response, Context NP-sg PP1-sg VP-sg the answer sets include not only the correct answer, but NP-pl PP1-sg VP-pl also erroneous candidates constructed by corrupting the NP-sg PP1-pl VP-sg generating rules. An example template is illustrated in NP-pl PP1-pl VP-pl Figure 2. NP-sg PP1-sg PP2-sg VP-sg BLM datasets are richly structured and support many NP-pl PP1-sg PP2-sg VP-pl different types of investigations, at both the sentence NP-sg PP1-pl PP2-sg VP-sg and matrix levels. The context-answer set up support Answer set counterfactual investigations of possible types of errors: NP-pl PP1-pl PP2-sg VP-pl Correct language errors, reasoning errors, and their interactions NP-pl PP1-pl et PP2-sg VP-pl Coord [6, 7, 8]. The regular syntactic forms and the systematic NP-pl PP1-pl VP-pl WNA semantic properties support investigations on system- NP-pl PP1-sg PP1-sg VP-pl WN1 aticity and compositionality in neural networks. The pre- NP-pl PP1-pl PP2-pl VP-pl WN2 dictable syntactic structure of individual sentences, and NP-pl PP1-pl PP2-pl VP-sg AEV NP-pl PP1-sg PP2-pl VP-sg AEN1 the structure within the sequence of a BLM context, also NP-pl PP1-pl PP2-sg VP-sg AEN2 support investigations on sentence embeddings [9, 10]. BLMs exists for several tasks and different languages, enabling multi-tasks and multi-language comparative Figure 2: BLM-AgrI template for verb-subject agreement, with one-two intervening phrases. Three generative rules: studies [11, 12]. Finally, each BLM problem is a linguistic paradigm and can be seen as a tool for linguistic investi-(i) Subject matches in number with verb (singular or plural); (ii) material can intervene and is of unbounded length; (iii) gation of specific phenomena. singular and plural alternate in regular patterns. NP=Noun Phrase, PP=Prepositional Phrase, VP=Verb Phrase. Answers: 2. The BLM-It Challenge WNA= wrong number of attractors; WN1= wrong nr. for 1𝑠𝑡 attractor noun (N1); WN2= wrong nr. for 2𝑛𝑑 attractor noun (N2); AEV=agreement error on the verb; AEN1=agreement The BLM-It challenge consists of six sub-tasks.1 All sub- error on N1; AEN2=agreement error on N2. tasks are instances of the general BLM task, but they dif- fer along two dimensions: the linguistic problem defined (Agr, Caus, Od) and the lexical complexity of the data (II, III).2 While the agreement (Agr) task focuses on informa- transitive verb bears the same semantic role (Patient) as tion about the formal grammatical property of agreement, the subject of the intransitive verb (L’artista ha aperto the causative (Caus) and object-drop (Od) alternation la finestra/La finestra si è aperta ‘The artist opened the tasks focus on lexical semantic properties of verbs, their window’/‘The window opened’). The transitive form of ability to enter or not in a causative alternation and their the verb has a causative meaning [13]. systematic alternation in the syntactic-semantic mapping The BLM-CausI template is shown in Figure 4. The con- of grammatical functions and semantic roles. text set of the causative alternation varies depending on the presence of one or two arguments and their attributes BLM-AgrI The BLM problem for subject-verb agree- (agents, Ag; patients, Pat) and the active (Akt) and pas- ment [6] consists of a context set of seven sentences that sive (Pass) or passive voice of the verb. The sentences share the subject-verb agreement phenomenon, but differ are organised in a structured sequence: an alternation in other aspects – e.g. number of intervening attractors every two items between a prepositional phrase intro- between the subject and the verb, different grammatical duced by multifarious prepositions (e.g., in pochi secondi, numbers for these attractors, and different clause struc- P-NP) and a PP introduced by the agentive da-NP (e.g., tures. The answer set comprises contrastive sentences dall’artista, da-Ag/da-Pat). that violate some of the generative rules. The BLM-AgrI The answer set is composed of one correct answer and Template can be seen in Figure 2. contrastive erroneous answers, all formed by the same four elements: a verb, two nominal constituents and the BLM-CausI The BLM-CausI matrix represents the presence (or absence) of a prepositional phrase. causative/inchoative alternation, where the object of the 2 BLM-OdI The BLM-OdI template is minimally differ- We choose names of tasks and lexical complexity levels that make it easier to cross-reference and compare the data described here ent from BLM-CausI. They also act as each other’s con- with other papers published on BLMs. trols. In contrast to Caus, the subject in Od bears the 2 Our datasets are available here: same semantic role (Agent) in both the transitive and https://www.idiap.ch/en/scientific-research/data/blm-agri-gen, intransitive forms (L’artista dipingeva la finestra/L’artista https://www.idiap.ch/en/scientific-research/data/blm-causi-gen, dipingeva ‘the artist painted the window’/‘the artist https://www.idiap.ch/en/scientific-research/data/blm-odi-gen. type II type III Context Context 1 La zia mangia una bistecca nella sala grande 1 L’attore deve canticchiare un motivetto dopo il festival 2 La presidente può mangiare una bistecca da programma 2 L’amica di mia mamma deve cucire la tasca da qualche 3 La specialità della casa deve essere mangiata dalla tur- giorno ista nella sala grande 3 L’inno nazionale può essere cantato dal vincitore del 4 Una bistecca fu mangiata dalla presidente da sola festival con solo pianoforte 5 La specialità della casa deve essere mangiata in un sec- 4 Una bistecca deve essere mangiata dalla turista da sola ondo 5 Il manuale è insegnato nell’aula magna 6 Una bistecca deve poter essere mangiata da sola 6 Questi attrezzi devono essere intagliati da manuale 7 La turista deve mangiare con fame 7 I due fratelli studiano con molta attenzione 8 ??? 8 ??? Answer set Answer set 1 La specialità della casa può mangiare da sola 1 La pasta frolla deve impastare da sola 2 La squadra di calcio deve mangiare da mezz’ora 2 L’autrice deve poter scrivere da qualche giorno 3 Una bistecca è mangiata dalla turista 3 I libri di testo devono poter essere studiati dai candidati 4 La squadra di calcio può essere mangiata da una car- 4 Questi stilisti devono poter essere tessuti dai vestiti per bonara la parata 5 La pasta col pomodoro può mangiare la squadra di calcio 5 Questi motivi greci possono tessere questi stilisti 6 La squadra di calcio mangia una bistecca 6 L’idraulico saldò i cavi del lampadario 7 La specialità della casa deve poter mangiare dalla turista 7 La stanza pulisce da una delle propretarie dell’albergo 8 La presidente mangia da una bistecca 8 Le sommozzatrici pescarono da delle trote Figure 3: Two instances of BLM-OdI data: with little (type II) and maximal (type III) lexical variation. Context Answer set Context Answer set 1 Ag Akt Pat p-NP 1 Pat Akt by-NP Correct 1 Ag Akt Pat p-NP 1 Pat Akt by-NP I-Int 2 Ag Akt Pat by-NP 2 Ag Akt by-NP I-Int 2 Ag Akt Pat by-NP 2 Ag Akt by-NP Correct 3 Pat Pass by-Ag p-NP 3 Pat Pass by-Ag ER-Pass 3 Pat Pass by-Ag p-NP 3 Pat Pass by-Ag IER-Pass 4 Pat Pass by-Ag by-NP 4 Ag Pass by-Pat IER-Pass 4 Pat Pass by-Ag by-NP 4 Ag Pass by-Pat ER-Pass 5 Pat Pass p-NP 5 Pat Akt Ag R-Trans 5 Pat Pass p-NP 5 Pat Akt Ag IR-Trans 6 Pat Pass by-NP 6 Ag Akt Pat R-Trans 6 Pat Pass by-NP 6 Ag Akt Pat R-Trans 7 Pat Akt p-NP 7 Pat Akt by-Ag E-WrBy 7 Ag Akt p-NP 7 Pat Akt by-Ag IE-WrBy 8 ??? 8 Ag Akt by-Pat IE-WrBy 8 ??? 8 Ag Akt by-Pat E-WrBy Figure 4: BLM-CausI Template. Three generative rules: Figure 5: BLM-OdI Template. Same generative rules as (i) the presence of either one or two arguments and their at- BLM-CausI, with the difference that here the passive/active tributes (agents, Ag; patients, Pat); (ii) the active (Akt) and pas- voice is confounding, and the correct answer is an erroneous sive (Pass) voice of the verb; the number and quality of nominal answer for BLM-CausI. phrases (NP) following the verb. Answers: I-Int=wrong subject semantic role; ER-Pass=wrong verb mood; IER-Pass=wrong mood and wrong subject semantic role; R-trans=wrong se- quence reasoning (transitive sentence with the second NP not it is an intransitive form with a da-NP. preceded by a preposition); IE-WrBy=ungrammatical sentence (NP following the preposition da). Lexical variants Each of the three BLM templates de- scribed above is developed in two lexical variants, with less (II) or more (III) lexical variation. In type II BLMs, only one word in each sentence changes for each matrix, painted’) and the verb does not have a causative meaning compared to the other sentences, while in type III data, [13]. all words can change. Instances of the two variations are The BLM template for Od is the same as for Caus, but shown in Figure 3. here the passive voice serves as a confounding element and one of the contrastive answers for Caus is, in fact, the correct answer here. 3. Data description The template for BLM-OdI is in Figure 5. Due to the asymmetry between the Caus and Od BLM templates, The data is generated by the process described in Figure the contexts of the BLMs minimally differ in the intransi- 6: (i) start from identifying a linguistic phenomenon of tive followed by P-NP (sentence 7). The correct answer interest, its forms of expression and factors influencing it also varies across the two groups, although in both cases within a context, (ii) produce a set of seed examples from dataset (few-shot) train test BLM-AgrI (II/III) 10 2000 BLM-CausI (II/III) 80 2080 BLM-OdI (II/III) 80 2080 Table 1 Data statistics for the three datasets, in terms of few-shot training and testing. There are the same number of examples in the type II (small lexical variation within an instance) and type III (maximal lexical variation within an instance) varia- tions of the three datasets. Figure 6: BLM data generation process, from seed examples 3.3. Detailed data statistics of a linguistic problem to the complete dataset For the BLM-AgrI datasets, for each of types II and III, we randomly sample 10 instances for few-shot learning from a dataset of 2010 instances. The rest will be used for natural or synthetic data, (iii) automatically augment the testing. For the BLM-CausI and BLM-OdI datasets, which seeds using a fill-mask strategy, (iv) produce BLM in- are focused on specific verbs, we extract all instances for stances following the designed templates and generative one verb (based on the correct answer in each instance) rules. Two instances of Od verb alternations are shown for few-shot training. From an initial dataset of 2160 in Figure 3. instances for 27 verbs (80 instances per verb), we select the 80 instances for one verb for few-shot training, and 3.1. Origin of data the rest are left for testing. BLM-AgrI To instantiate the templates, our starting point are the examples in Franck et al. [14, appendix1]. 3.4. Example of prompts They provide a set of subject NPs of various complexity We design prompts in English and Italian in zero-shot – including prepositional phrases, themselves of various and few-shot prediction settings, to test the impact of complexity. The sentences were produced based on these the language of the prompt on the task. These prompts subject NPs by manually adding verb phrases, and by test LLMs’ ability to perform complex linguistic tasks making the NPs more complex to increase the distance with varying levels of context. Both types of prompts are between the subject and the verb in the sentence [6]. structured to minimize ambiguity and focus on the core Each of these sentences is used to produce a seed. task of selecting the best sentence to follow the given context. BLM-CausI and BLM-OdI Thirty verbs from each of Zero-Shot Prompt Example in English The prompt the causative and object-drop classes in English in Levin in Figure 8 is designed to create a clear zero-shot base- [13] were selected and translated by a native speaker into line for challenging linguistic tasks. We avoid complex Italian, where translations maintain the same alternation prompting techniques, like chain-of-thought or step-by- structure. step reasoning [16, 17]. This ensures that the model’s The seeds were augmented using masked modeling performance reflects its intrinsic capabilities for linguis- on bert-base-uncased [15]. The Italian data are built tic understanding and reasoning without prior in-context as native-speaker translations of the English data, with learning or guided reasoning steps. manual corrections to guarantee the acceptability and We format the prompt in Markdown format and ex- semantic plausibility of the sentences, and assure vari- plicit label sections for Context and Answer Set. The ability in gender and number. task is framed as a simple “puzzle” with the instruction to “choose […] the sentence that could […] follow the 3.2. Data format context”. This abstract formulation guides the model to focus on identifying the best sequential fit without intro- The structured BLM data is provided in a json file, each ducing ambiguity. The prompt also aims to reduce noise instance as one element with specific fields described in and simplify the evaluation by fixing its output format. Figure 7. A data instance is shown in Figure 10 in the Few-Shot (One-Shot) Prompt Example in Italian appendix. For the one-shot prediction setup (as is shown in Fig- ure 9), we provide an example of the task in Italian before presenting the new instance to the model. The prompt serves to test the model’s ability to use prior examples { "ID": , "Context": [], "Context_concatenated": , "Answer_set": [], "Answer_concatenated": , "Correct_option": , "Correct_answer": , "Answer_set_annotation": [,"value":,"option":}>], "Verb": }, Figure 7: Data format # TASK: I'm asking you to solve a puzzle. The # COMPITO: Ti chiedo di risolvere un quesito. La language of the puzzle is Italian. lingua di questo quesito e' l'italiano. I will give you a list of sentences (numbered from 1 Ti daro' una lista di frasi (numerate da 1 a 7) che to 7) called the **Context**, and a set of sentences chiameremo **Contesto**, e un insieme di frasi (identified by capital letters) called the **Answer (identificate da una lettera) che chiameremo Set**. **Risposte**. Your task is to choose among the **Answer Set** Il tuo compito e' di scegliere fra le **Risposte** la the sentence that could be the next sentence frase che potrebbe essere la frase seguente del following the **Context**. **Contesto**. # FORMAT: You should **ONLY** output the letter # FORMATO: Devi mettere **SOLO** la lettera che corresponding to the best answer. Do not output corrisponde alla risposta migliore. Non inserire altro other text before or after. testo, ne' prima ne' dopo. # QUESTION # ESEMPIO 1 **Context** **Contesto** {{Context_concatenated}} {{Context_concatenated}} **Answer Set** **Risposte** {{Answer_concatenated}} {{Answer_concatenated}} **Your Choice** **Scelta corretta** {Correct_option} Figure 8: Zero-Shot Prompt in English. # DOMANDA **Contesto** {{Context_concatenated}} and adapt to a new linguistic context. **Risposte** {{Answer_concatenated}} 4. Metrics **La tua scelta** We perform zero-shot and one-shot evaluation on BLM- AgrI, BLM-CausI and BLM-OdI tasks, using English and Figure 9: Few (One)-Shot Prompt in Italian. Italian prompts, with 100 samples each (batch size of one, evaluated instance by instance, over three inde- pendent runs) with Meta-Llama-3-8B-Instruct (ML- BLM-AgrI tasks Meta-Llama-3-70B-Instruct con- 8), Meta-Llama-3-70B-Instruct (ML-70), Mistral-7B- sistently outperforms the other models, particularly in Instruct-v0.3 (M-7), and Gemma-2-9b-It (G-2). We zero-shot English prompts, while also competitive in report averaged F1 scores over 3 runs in Table 2. English Prompt Italian Prompt Results Model Zero-Shot One-Shot Zero-Shot One-Shot BLM-AgrI type II ML-70 44.1 ± 0.46 44.88 ± 4.63 39.46 ± 0.79 35.62 ± 2.36 50 40 ML-8 22.34 ± 0.33 17.84 ± 0.48 16.66 ± 1.56 19.30 ± 2.30 F1 Macro (%) 30 M-7 25.54 ± 0.58 30.66 ± 4.60 17.41 ± 1.37 21.1 ± 2.26 20 G-2 42.75 ± 1.01 43.64 ± 2.25 42.87 ± 0.62 40.62 ± 1.83 10 0 Meta-Llama-3-8B-Instruct Meta-Llama-3-70B-Instruct Mistral-7B-Instruct-v0.3 gemma-2-9b-it Model Prompt Language & Number of Shot(s) en | 0-shot en | 1-shot it | 0-shot it | 1-shot BLM-AgrI type III ML-70 45.64 ± 0.05 41.35 ± 6.71 40.48 ± 0.52 34.89 ± 5.93 50 40 ML-8 26.65 ± 1.71 21.00 ± 2.07 22.68 ± 1.41 19.58 ± 5.68 F1 Macro (%) 30 M-7 31.26 ± 1.60 12.75 ± 6.28 33.21 ± 0.91 19.64 ± 6.02 20 G-2 38.48 ± 1.12 39.36 ± 3.27 36.54 ± 1.18 42.52 ± 6.83 10 0 Meta-Llama-3-8B-Instruct Meta-Llama-3-70B-Instruct Mistral-7B-Instruct-v0.3 gemma-2-9b-it Model Prompt Language & Number of Shot(s) en | 0-shot en | 1-shot it | 0-shot it | 1-shot BLM-CausI type II ML-70 19.97 ± 0.65 36.81 ± 10.11 16.46 ± 0.36 31.95 ± 8.75 50 40 ML-8 5.85 ± 0.20 9.57 ± 5.20 6.72 ± 0.09 7.12 ± 3.00 F1 Macro (%) 30 M-7 8.45 ± 0.44 7.66 ± 1.87 5.94 ± 0.04 6.21 ± 1.02 20 G-2 18.06 ± 0.25 25.64 ± 4.30 14.23 ± 0.16 21.81 ± 3.93 10 0 Meta-Llama-3-8B-Instruct Meta-Llama-3-70B-Instruct Mistral-7B-Instruct-v0.3 gemma-2-9b-it Model Prompt Language & Number of Shot(s) en | 0-shot en | 1-shot it | 0-shot it | 1-shot BLM-CausI type III ML-70 26.49 ± 0.85 24.14 ± 3.34 25.27 ± 0.72 23.78 ± 7.16 50 40 ML-8 18.03 ± 1.52 4.65 ± 0.38 16.59 ± 0.49 10.52 ± 2.21 F1 Macro (%) 30 M-7 20.08 ± 0.76 8.69 ± 3.12 14.91 ± 0.15 13.05 ± 2.05 20 G-2 29.12 ± 0.73 25.93 ± 4.98 28.8 ± 0.04 25.41 ± 2.94 10 0 Meta-Llama-3-8B-Instruct Meta-Llama-3-70B-Instruct Mistral-7B-Instruct-v0.3 gemma-2-9b-it Model Prompt Language & Number of Shot(s) en | 0-shot en | 1-shot it | 0-shot it | 1-shot BLM-OdI type II ML-70 18.28 ± 2.18 32.51 ± 5.77 17.89 ± 1.06 24.61 ± 5.31 50 40 ML-8 8.55 ± 0.21 9.18 ± 1.62 9.1 ± 0.41 5.25 ± 2.92 F1 Macro (%) 30 M-7 1.92 ± 0.27 7.11 ± 3.59 2.79 ± 0.07 5.69 ± 1.31 20 G-2 14.07 ± 0.78 27.64 ± 4.63 14.43 ± 0.08 23.70 ± 2.42 10 0 Meta-Llama-3-8B-Instruct Meta-Llama-3-70B-Instruct Mistral-7B-Instruct-v0.3 gemma-2-9b-it Model Prompt Language & Number of Shot(s) en | 0-shot en | 1-shot it | 0-shot it | 1-shot BLM-OdI type III ML-70 17.70 ± 0.32 20.05 ± 6.28 18.10 ± 0.44 23.01 ± 4.56 50 40 ML-8 9.50 ± 0.95 3.20 ± 0.57 10.78 ± 0.61 3.64 ± 0.85 F1 Macro (%) 30 M-7 11.60 ± 0.64 7.45 ± 4.27 9.74 ± 0.01 6.6 ± 2.19 20 G-2 14.74 ± 0.40 14.75 ± 3.55 15.49 ± 1.54 18.58 ± 1.60 10 0 Meta-Llama-3-8B-Instruct Meta-Llama-3-70B-Instruct Mistral-7B-Instruct-v0.3 gemma-2-9b-it Model Prompt Language & Number of Shot(s) en | 0-shot en | 1-shot it | 0-shot it | 1-shot Table 2 Evaluation results on BLM-It tasks (AgrI, CausI, and OdI) using macro averaged F1 score (over 3 runs) and standard deviations (±std). Each run was evaluated with 100 samples, one instance at a time, for Meta-Llama-3-70B-Instruct (ML-70), Meta- Llama-3-8B-Instruct (ML-8), Mistral-7B-Instruct-v0.3 (M-7), Gemma-2-9b-It (G-2). Best performance is in bold, second best, if overlapping intervals, in italics. one-shot settings. Gemma-2-9b-it shows robust per- BLM-CausI tasks Meta-Llama-3-70B-Instruct formance, especially with Italian prompts, performing leads across both English and Italian prompts, with similarly to the larger Meta-Llama model. In contrast, improvement in one-shot English for type II. Gemma- smaller models, such as Meta-Llama-3-8B-Instruct 2-9b-it shows comparable performance across both and Mistral-7B-Instruct-v0.3 , perform more weakly, languages, in both zero-shot and one-shot settings. especially with Italian prompts. Smaller models perform worse for this task, especially in one-shot Italian prompts. dataset train:test avg F1 While not directly comparable due to the different E-M E-It training process and the different test data, using pre- BLM-AgrI type II 2400:4121 0.881 (0.003) 0.784 (0.007) trained transformer encoder architectures, like Electra, BLM-AgrI type III 2400:4121 0.874 (0.006) 0.336 (0.005) significantly outperform the zero and one-shot prompt- BLM-CausI type II 2160:240 0.486 (0.005) 0.903 (0.010) ing baseline. The performance gap suggests that while BLM-CausI type III 2160:240 0.475 (0.010) 0.918 (0.010) zero or one-shot prompting is flexible, it may not capture BLM-OdI type II 2160:240 0.596 (0.010) 0.983 (0.003) BLM-OdI type III 2160:240 0.592 (0.024) 0.994 (0.004) the complex syntactic and semantic features required for the BLM task in Italian. Table 3 Dataset statistics and evaluation results on a two-level varia- tional encoder-decoder architecture using an Italian Electra 5. Limitations (E-It) and a multilingual Electra (E-M) pretrained model to provide sentence embeddings. While the data is very rich and richly structured, it shares all the limitations of artificial and synthetic data: stilted sentence structure, limited variability, possibly sentences BLM-OdI tasks OdI tasks show the lowest overall that are too short. This artificiality, though, might reduce, performance across models. This indicates that the without eliminating, the risk of having sentences that task is the most complex and challenging for the mod- were directly seen in the training data of the pretrained els. Meta-Llama-3-70B-Instruct performs best, partic- models that will be used, and that we use, for further ularly in one-shot English and Italian prompts. However, experiments. Mistral-7B-Instruct-v0.3 struggles the most, partic- The initial seed sentences, although minimal, were ularly in zero-shot settings, which reflects that the model crafted by experts. This approach is deliberate, like in the has limited generalisation capabilities in complex linguis- ARC dataset, to guarantee that the data are not algorith- tic tasks. mically reproducible [19]. This expert-based approach, though, might not be easily scalable, especially given the Key Observations Larger models, such as Meta- complexity of the data. Exploring methods to leverage Llama-3-70B-Instruct and Gemma-2-9b-it , consis- existing datasets for seed generation could mitigate this tently outperform smaller models, showing better gener- dependency. alisation and stability across tasks. English prompts gen- The current dataset comprises three main tasks. More erally result in higher F1 scores, though Italian prompts tasks and variants are needed to demonstrate the robust- sometimes achieve comparable performance, particularly ness and the wider appeal of the data. with Gemma-2-9b-it . One-shot prompting tends to im- prove performance, though the degree of improvement 6. Ethical issues varies by model and task complexity. Smaller models, such as Mistral-7B-Instruct and Meta-Llama-3-8B- The data presented include an augmentation step that Instruct , show substantial variance, especially in one- uses large language models (LLMs). LLMs are trained on shot scenarios, indicating instability in complex linguistic extensive text data, which may unintentionally incorpo- tasks. rate biases present in the training corpus. Comparison with Multitask Learning Approaches We compare our LLM prompting results with the work of 7. Data license and copyright [12, 11], which explored the properties of Italian sentence issues embeddings – the embeddings of the [CLS] token from a pretrained Electra model[18]3 – through the agreement This work is licensed under the Creative Com- and the causative and object-drop BLM datasets, using mons Attribution-NonCommercial-ShareAlike 4.0 Inter- a two-level Variational Encoder-Decoder architecture. national (CC BY-NC-SA 4.0). For uses outside of these This system learns to compress the sentence embeddings terms, please contact the authors. into representations relevant for the specific BLM tasks. The dataset statistics, and results on the individual BLM tasks as averaged F1 score over three runs and different Acknowledgments amounts of lexical variation are shown in Table 3. We gratefully acknowledge the support of this work by the Swiss National Science Foundation, through grant 3 Italian Electra (E-It) pretrained model: dbmdz/electra-base- SNF Advanced grant TMAG-1_209426 to PM. italian-xxl-cased-discriminator, multi-lingual Electra (E-M) model: google/electra-base-discriminator References through targeted sparsification, in: Proceedings of the 9th Workshop on Representation Learning for [1] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran- NLP (RepL4NLP-2024), Bangkok, Thailand, 2024, cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri- pp. 203–214. URL: https://aclanthology.org/2024. naldi, D. Scalena, CALAMITA – Challenge the Abil- repl4nlp-1.15. ities of LAnguage Models in ITAlian: Overview, in: [11] V. Nastase, G. Samo, C. Jiang, P. Merlo, Explor- Proceedings of the 10th Italian Conference on Com- ing Italian sentence embeddings properties through putational Linguistics (CLiC-it 2024), 2024. multi-tasking, in: Proceedings of the Tenth Italian [2] P. Merlo, Blackbird language matrices (BLM), Conference on Computational Linguistics (CLiC-It a new task for rule-like generalization in neu- 2024), Pisa, Italy, 2024. ral networks: Motivations and formal specifica- [12] V. Nastase, C. Jiang, G. Samo, P. Merlo, Ex- tions, ArXiv cs.CL 2306.11444 (2023). URL: https:// ploring syntactic information in sentence embed- doi.org/10.48550/arXiv.2306.11444. doi:10.48550/ dings through multilingual subject-verb agreement, arXiv.2306.11444 . in: Proceedings of the Tenth Italian Conference [3] J. C. Raven, Standardization of progressive matri- on Computational Linguistics (CLiC-It 2024), Pisa, ces, British Journal of Medical Psychology 19 (1938) Italy, 2024. 137–150. [13] B. Levin, English verb classes and alternations: A [4] T. Webb, K. J. Holyoak, H. Lu, Emergent analogical preliminary investigation, University of Chicago reasoning in large language models, Nature Hu- Press, 1993. man Behaviour 7 (2023) 1526–1541. URL: https:// [14] J. Franck, G. Vigliocco, J. Nicol, Subject-verb agree- doi.org/10.1038/s41562-023-01659-w. doi:10.1038/ ment errors in french and english: The role of syn- s41562- 023- 01659- w . tactic hierarchy, Language and cognitive processes [5] X. Hu, S. Storks, R. Lewis, J. Chai, In-context ana- 17 (2002) 371–404. logical reasoning with pre-trained language models, [15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: in: Proceedings of the 61st Annual Meeting of the Pre-training of deep bidirectional transformers for Association for Computational Linguistics (Volume language understanding, in: Proceedings of the 1: Long Papers), Association for Computational 2019 Conference of the North American Chap- Linguistics, Toronto, Canada, 2023, pp. 1953–1969. ter of the Association for Computational Linguis- URL: https://aclanthology.org/2023.acl-long.109. tics: Human Language Technologies, Volume 1 [6] A. An, C. Jiang, M. A. Rodriguez, V. Nastase, (Long and Short Papers), Association for Com- P. Merlo, BLM-AgrF: A new French benchmark putational Linguistics, Minneapolis, Minnesota, to investigate generalization of agreement in neu- 2019, pp. 4171–4186. URL: https://aclanthology.org/ ral networks, in: Proceedings of the 17th Confer- N19-1423. doi:10.18653/v1/N19- 1423 . ence of the European Chapter of the Association for [16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, Computational Linguistics, Association for Com- E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought putational Linguistics, Dubrovnik, Croatia, 2023, prompting elicits reasoning in large language mod- pp. 1363–1374. URL: https://aclanthology.org/2023. els, Advances in neural information processing eacl-main.99. systems 35 (2022) 24824–24837. [7] V. Nastase, P. Merlo, Grammatical information [17] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, in BERT sentence embeddings as two-dimensional Large language models are zero-shot reasoners, Ad- arrays, in: Proceedings of the 8th Workshop on vances in neural information processing systems Representation Learning for NLP (RepL4NLP 2023), 35 (2022) 22199–22213. Toronto, Canada, 2023. [18] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Elec- [8] G. Samo, V. Nastase, C. Jiang, P. Merlo, BLM-s/lE: tra: Pre- training text encoders as discriminators A structured dataset of English spray-load verb al- rather than generators, in: ICLR, 2020, pp. 1–18. ternations for testing generalization in LLMs, in: [19] F. Chollet, On the measure of intelligence, Proceedings of the 2023 Conference on Empirical 2019. URL: https://arxiv.org/abs/1911.01547. Methods in Natural Language Processing, Singa- arXiv:1911.01547 . pore, 2023. [9] V. Nastase, P. Merlo, Are there identifiable struc- tural parts in the sentence embedding whole?, 2024. URL: https://aclanthology.org/2024.blackboxnlp-1. 3. doi:10.18653/v1/2024.blackboxnlp- 1.3 . [10] V. Nastase, P. Merlo, Tracking linguistic infor- mation in transformer-based sentence embeddings A. Example Data Format [{ "ID": 215, "Context": [ "le pittrici possono disegnare delle forme in meno di due giorni", "le artiste possono disegnare delle rappresentazioni artistiche da un mese", "alcune coreografie sono disegnate dalle pittrici nel salone espositivo", "delle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese", "alcune coreografie devono essere disegnate con pochi mezzi economici", "le scenografie devono essere disegnate da pochi mesi", "le pittrici devono disegnare nel salone espositivo"], "Context_concatenated": "1\tle pittrici possono disegnare delle forme in meno di due giorni\n2\tle artiste possono disegnare delle rappresentazioni artistiche da un mese\n3\talcune coreografie sono disegnate dalle pittrici nel salone espositivo\n4\tdelle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese\n5\talcune coreografie devono essere disegnate con pochi mezzi economici\n6\tle scenografie devono essere disegnate da pochi mesi\n7\tle pittrici devono disegnare nel salone espositivo", "Answer_set": [ "delle rappresentazioni artistiche devono poter disegnare le sue allieve", "le scenografie devono essere disegnate dalle sue allieve", "le sue allieve devono essere disegnate da delle rappresentazioni artistiche", "le pittrici possono disegnare le scenografie", "le pittrici possono disegnare da un anno circa", "delle forme devono poter disegnare da pochi mesi", "le artiste devono poter disegnare da alcune coreografie", "delle rappresentazioni artistiche devono disegnare dalle artiste"], "Answer_concatenated": "A\tdelle rappresentazioni artistiche devono poter disegnare le sue allieve\nB\tle scenografie devono essere disegnate dalle sue allieve\nC\tle sue allieve devono essere disegnate da delle rappresentazioni artistiche\nD\tle pittrici possono disegnare le scenografie\nE\tle pittrici possono disegnare da un anno circa\nF\tdelle forme devono poter disegnare da pochi mesi\nG\tle artiste devono poter disegnare da alcune coreografie\nE\tdelle rappresentazioni artistiche devono disegnare dalle artiste", "Correct_option": "E", "Correct_answer": "le pittrici possono disegnare da un anno circa", "Answer_set_annotation": [ { "label": "IR-trans", "value": false, "option": "A" }, { "label": "IER-pass", "value": false, "option": "B" }, { "label": "ER-pass", "value": false, "option": "C" }, { "label": "R-trans", "value": false, "option": "D" }, { "label": "Correct", "value": true, "option": "E" }, { "label": "I-Int", "value": false, "option": "F" }, { "label": "E-WrBy", "value": false, "option": "G" }, { "label": "IE-WrBy", "value": false, "option": "H" } ], "Verb": "disegnare" }, .... ] Figure 10: Sample entry formatted for usage with the provided prompts.