<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">BLM-It -Blackbird Language Matrices for Italian: A CALAMITA Challenge</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Chunyang</forename><surname>Jiang</surname></persName>
							<email>chunyang.jiang@unige.ch</email>
							<affiliation key="aff0">
								<orgName type="institution">Idiap Research Institute</orgName>
								<address>
									<settlement>Martigny</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Geneva</orgName>
								<address>
									<settlement>Geneva</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giuseppe</forename><surname>Samo</surname></persName>
							<email>giuseppe.samo@idiap.ch</email>
							<affiliation key="aff0">
								<orgName type="institution">Idiap Research Institute</orgName>
								<address>
									<settlement>Martigny</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Vivi</forename><surname>Nastase</surname></persName>
							<email>vivi.a.nastase@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Idiap Research Institute</orgName>
								<address>
									<settlement>Martigny</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Paola</forename><surname>Merlo</surname></persName>
							<email>paola.merlo@unige.ch</email>
							<affiliation key="aff0">
								<orgName type="institution">Idiap Research Institute</orgName>
								<address>
									<settlement>Martigny</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Geneva</orgName>
								<address>
									<settlement>Geneva</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">BLM-It -Blackbird Language Matrices for Italian: A CALAMITA Challenge</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">63307A04A2533E51CD03A074AC76DA3F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Blackbird Language Matrices</term>
					<term>Causative/inchoative alternation</term>
					<term>Object-drop alternation</term>
					<term>subject-verb number agreement</term>
					<term>rule-based abstraction</term>
					<term>disentanglement</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this challenge, we propose Blackbird Language Matrices (BLMs), linguistic puzzles to learn language-related problems and investigate deeper formal and semantic properties of language, through a process of paradigm understanding. A BLM matrix consists of a context set and an answer set. The context is a sequence of sentences that encode implicitly an underlying generative linguistic rule. The contrastive multiple-choice answer set includes negative examples produced following corrupted generating rules. We propose three subtasks -agreement concord (Agr), causative (Caus) and object-drop (Od) alternation detection-each in two variants of increasing lexical complexity. The datasets comprise a few prompts for few-shot learning and a large test set.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction and Motivation</head><p>Current generative large language models (LLMs) translate across close languages, produce fluent and informative summaries, and answer questions promptly. And yet, they still fail in very non-human ways. As proven by their prohibitive needs in size of training data and expensive computational resources, large language models do not generalise nor abstract systematically. Humans, instead, are good at abstraction and generalisation.</p><p>To reach systematic abilities in abstraction and generalisation in neural networks, we need to develop tasks and data that help us understand their current generalisation abilities -what exactly do LLMs understand of the language they produce and process so well?-and help us train them to more complex skills.</p><p>In the CALAMITA challenge <ref type="bibr" target="#b0">[1]</ref>, we propose to find the solution to Blackbird Language Matrices (BLMs), linguistic puzzles developed in analogy to the visual Raven Progressive Matrices tests <ref type="bibr" target="#b1">[2]</ref>. Raven's Progressive Matrices (RPMs) consist of a sequence of images, called the context, connected in a logical sequence by underlying generative rules <ref type="bibr">[3]</ref>. The task is to determine the missing element in this visual sequence, the answer, chosen among a set of closely or loosely similar alternatives, as illustrated in Figure <ref type="figure" target="#fig_0">1</ref>. Unlike other attempts to create textual versions of RPMs, BLMs are not simplistic transcriptions of visual stimuli <ref type="bibr" target="#b3">[4]</ref>-a technique that, in practice, might give away parts of the solution to the problem-, nor are they auxiliary abstractions of stimuli in the visual domain <ref type="bibr" target="#b4">[5]</ref>. Instead, BLMs are matrices developed specifically to learn language-related problems and delve into deeper formal and semantic properties of language, through a process of linguistic paradigm understanding.</p><p>Like RPMs, a BLM instance consists of a context set and an answer set. The context is a sequence of sentences that encode a linguistic rule. They encode, for example, the rule of grammatical number concord: subject and verb agree in their grammatical number, and they do so independently of how many noun phrases intervene between them. BLMs are presented as linguistic puzzles requiring the selection of the missing sentence. In order to examine the representations underlying the response, the answer sets include not only the correct answer, but also erroneous candidates constructed by corrupting the generating rules. An example template is illustrated in Figure <ref type="figure">2</ref>.</p><p>BLM datasets are richly structured and support many different types of investigations, at both the sentence and matrix levels. The context-answer set up support counterfactual investigations of possible types of errors: language errors, reasoning errors, and their interactions <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8]</ref>. The regular syntactic forms and the systematic semantic properties support investigations on systematicity and compositionality in neural networks. The predictable syntactic structure of individual sentences, and the structure within the sequence of a BLM context, also support investigations on sentence embeddings <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b10">10]</ref>. BLMs exists for several tasks and different languages, enabling multi-tasks and multi-language comparative studies <ref type="bibr" target="#b11">[11,</ref><ref type="bibr" target="#b12">12]</ref>. Finally, each BLM problem is a linguistic paradigm and can be seen as a tool for linguistic investigation of specific phenomena.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">The BLM-It Challenge</head><p>The BLM-It challenge consists of six sub-tasks. 1 All subtasks are instances of the general BLM task, but they differ along two dimensions: the linguistic problem defined (Agr, Caus, Od) and the lexical complexity of the data (II, III). 2 While the agreement (Agr) task focuses on information about the formal grammatical property of agreement, the causative (Caus) and object-drop (Od) alternation tasks focus on lexical semantic properties of verbs, their ability to enter or not in a causative alternation and their systematic alternation in the syntactic-semantic mapping of grammatical functions and semantic roles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BLM-AgrI</head><p>The BLM problem for subject-verb agreement <ref type="bibr" target="#b5">[6]</ref> consists of a context set of seven sentences that share the subject-verb agreement phenomenon, but differ in other aspects -e.g. number of intervening attractors between the subject and the verb, different grammatical numbers for these attractors, and different clause structures. The answer set comprises contrastive sentences that violate some of the generative rules. The BLM-AgrI Template can be seen in Figure <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BLM-CausI</head><p>The BLM-CausI matrix represents the causative/inchoative alternation, where the object of the 2 We choose names of tasks and lexical complexity levels that make it easier to cross-reference and compare the data described here with other papers published on BLMs. 2 Our datasets are available here: https://www.idiap.ch/en/scientific-research/data/blm-agri-gen, https://www.idiap.ch/en/scientific-research/data/blm-causi-gen, https://www.idiap.ch/en/scientific-research/data/blm-odi-gen. transitive verb bears the same semantic role (Patient) as the subject of the intransitive verb (L'artista ha aperto la finestra/La finestra si è aperta 'The artist opened the window'/'The window opened'). The transitive form of the verb has a causative meaning <ref type="bibr" target="#b13">[13]</ref>.</p><p>The BLM-CausI template is shown in Figure <ref type="figure">4</ref>. The context set of the causative alternation varies depending on the presence of one or two arguments and their attributes (agents, Ag; patients, Pat) and the active (Akt) and passive (Pass) or passive voice of the verb. The sentences are organised in a structured sequence: an alternation every two items between a prepositional phrase introduced by multifarious prepositions (e.g., in pochi secondi, P-NP) and a PP introduced by the agentive da-NP (e.g., dall'artista, da-Ag/da-Pat).</p><p>The answer set is composed of one correct answer and contrastive erroneous answers, all formed by the same four elements: a verb, two nominal constituents and the presence (or absence) of a prepositional phrase.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BLM-OdI</head><p>The BLM-OdI template is minimally different from BLM-CausI. They also act as each other's controls. In contrast to Caus, the subject in Od bears the same semantic role (Agent) in both the transitive and intransitive forms (L'artista dipingeva la finestra/L'artista dipingeva 'the artist painted the window'/'the artist type II type III Context 1 La zia mangia una bistecca nella sala grande 2 La presidente può mangiare una bistecca da programma 3 La specialità della casa deve essere mangiata dalla turista nella sala grande 4 Una bistecca fu mangiata dalla presidente da sola 5 La specialità della casa deve essere mangiata in un secondo 6 Una bistecca deve poter essere mangiata da sola 7 La turista deve mangiare con fame 8 ???</p><p>Answer set 1 La specialità della casa può mangiare da sola 2 La squadra di calcio deve mangiare da mezz'ora 3 Una bistecca è mangiata dalla turista 4 La squadra di calcio può essere mangiata da una carbonara 5 La pasta col pomodoro può mangiare la squadra di calcio 6 La squadra di calcio mangia una bistecca 7 La specialità della casa deve poter mangiare dalla turista 8 La presidente mangia da una bistecca Context 1 L'attore deve canticchiare un motivetto dopo il festival 2 L'amica di mia mamma deve cucire la tasca da qualche giorno 3 L'inno nazionale può essere cantato dal vincitore del festival con solo pianoforte 4 Una bistecca deve essere mangiata dalla turista da sola 5 Il manuale è insegnato nell'aula magna 6 Questi attrezzi devono essere intagliati da manuale 7 I due fratelli studiano con molta attenzione 8 ???</p><p>Answer set 1 La pasta frolla deve impastare da sola 2 L'autrice deve poter scrivere da qualche giorno 3 I libri di testo devono poter essere studiati dai candidati 4 Questi stilisti devono poter essere tessuti dai vestiti per la parata 5 Questi motivi greci possono tessere questi stilisti 6 L'idraulico saldò i cavi del lampadario 7 La stanza pulisce da una delle propretarie dell'albergo 8 Le sommozzatrici pescarono da delle trote painted') and the verb does not have a causative meaning <ref type="bibr" target="#b13">[13]</ref>.</p><p>The BLM template for Od is the same as for Caus, but here the passive voice serves as a confounding element and one of the contrastive answers for Caus is, in fact, the correct answer here.</p><p>The template for BLM-OdI is in Figure <ref type="figure">5</ref>. Due to the asymmetry between the Caus and Od BLM templates, the contexts of the BLMs minimally differ in the intransitive followed by P-NP (sentence 7). The correct answer also varies across the two groups, although in both cases it is an intransitive form with a da-NP.</p><p>Lexical variants Each of the three BLM templates described above is developed in two lexical variants, with less (II) or more (III) lexical variation. In type II BLMs, only one word in each sentence changes for each matrix, compared to the other sentences, while in type III data, all words can change. Instances of the two variations are shown in Figure <ref type="figure">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data description</head><p>The data is generated by the process described in Figure <ref type="figure" target="#fig_3">6</ref>: (i) start from identifying a linguistic phenomenon of interest, its forms of expression and factors influencing it within a context, (ii) produce a set of seed examples from </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Origin of data BLM-AgrI</head><p>To instantiate the templates, our starting point are the examples in Franck et al. <ref type="bibr">[14, appendix1]</ref>. They provide a set of subject NPs of various complexity -including prepositional phrases, themselves of various complexity. The sentences were produced based on these subject NPs by manually adding verb phrases, and by making the NPs more complex to increase the distance between the subject and the verb in the sentence <ref type="bibr" target="#b5">[6]</ref>. Each of these sentences is used to produce a seed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BLM-CausI and BLM-OdI</head><p>Thirty verbs from each of the causative and object-drop classes in English in Levin <ref type="bibr" target="#b13">[13]</ref> were selected and translated by a native speaker into Italian, where translations maintain the same alternation structure.</p><p>The seeds were augmented using masked modeling on bert-base-uncased <ref type="bibr" target="#b15">[15]</ref>. The Italian data are built as native-speaker translations of the English data, with manual corrections to guarantee the acceptability and semantic plausibility of the sentences, and assure variability in gender and number.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Data format</head><p>The structured BLM data is provided in a json file, each instance as one element with specific fields described in Figure <ref type="figure">7</ref>. A data instance is shown in Figure <ref type="figure" target="#fig_0">10</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Data statistics for the three datasets, in terms of few-shot training and testing. There are the same number of examples in the type II (small lexical variation within an instance) and type III (maximal lexical variation within an instance) variations of the three datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Detailed data statistics</head><p>For the BLM-AgrI datasets, for each of types II and III, we randomly sample 10 instances for few-shot learning from a dataset of 2010 instances. The rest will be used for testing. For the BLM-CausI and BLM-OdI datasets, which are focused on specific verbs, we extract all instances for one verb (based on the correct answer in each instance) for few-shot training. From an initial dataset of 2160 instances for 27 verbs (80 instances per verb), we select the 80 instances for one verb for few-shot training, and the rest are left for testing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Example of prompts</head><p>We design prompts in English and Italian in zero-shot and few-shot prediction settings, to test the impact of the language of the prompt on the task. These prompts test LLMs' ability to perform complex linguistic tasks with varying levels of context. Both types of prompts are structured to minimize ambiguity and focus on the core task of selecting the best sentence to follow the given context.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Zero-Shot Prompt Example in English</head><p>The prompt in Figure <ref type="figure">8</ref> is designed to create a clear zero-shot baseline for challenging linguistic tasks. We avoid complex prompting techniques, like chain-of-thought or step-bystep reasoning <ref type="bibr" target="#b16">[16,</ref><ref type="bibr" target="#b17">17]</ref>. This ensures that the model's performance reflects its intrinsic capabilities for linguistic understanding and reasoning without prior in-context learning or guided reasoning steps.</p><p>We format the prompt in Markdown format and explicit label sections for Context and Answer Set. The task is framed as a simple "puzzle" with the instruction to "choose […] the sentence that could […] follow the context". This abstract formulation guides the model to focus on identifying the best sequential fit without introducing ambiguity. The prompt also aims to reduce noise and simplify the evaluation by fixing its output format.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Few-Shot (One-Shot) Prompt Example in Italian</head><p>For the one-shot prediction setup (as is shown in Figure <ref type="figure" target="#fig_7">9</ref>), we provide an example of the task in Italian before presenting the new instance to the model. The prompt serves to test the model's ability to use prior examples { "ID": &lt;ID NUMBER&gt;, "Context": [&lt;List of comma-separated, double-quoted sentences&gt;], "Context_concatenated": &lt;Double-quoted concatenation of context sentences, each prefixed by a numeral (1 to 7) followed by a tab, separated by newlines&gt;, "Answer_set": [&lt;List of comma-separated, double-quoted sentences&gt;], "Answer_concatenated": &lt;Double-quoted concatenation of answer sentences, each prefixed by a letter (A, B, C, ...) followed by a tab, separated by newlines&gt;, "Correct_option": &lt;Double-quoted single letter label&gt;, "Correct_answer": &lt;Double-quoted single correct answer sentence&gt;, "Answer_set_annotation": [&lt;List of comma-separated triplets {"label":&lt;error-type&gt;,"value":&lt;truth value&gt;,"option":&lt;single letter label&gt;}&gt;], "Verb": &lt;Double-quoted single verb&gt; }, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Metrics</head><p>We perform zero-shot and one-shot evaluation on BLM-AgrI, BLM-CausI and BLM-OdI tasks, using English and Italian prompts, with 100 samples each (batch size of one, evaluated instance by instance, over three independent runs) with Meta-Llama-3-8B-Instruct (ML-8), Meta-Llama-3-70B-Instruct (ML-70), Mistral-7B-Instruct-v0.3 (M-7), and Gemma-2-9b-It (G-2). We report averaged F1 scores over 3 runs in Table <ref type="table">2</ref>.    BLM-OdI tasks OdI tasks show the lowest overall performance across models. This indicates that the task is the most complex and challenging for the models. Meta-Llama-3-70B-Instruct performs best, particularly in one-shot English and Italian prompts. However, Mistral-7B-Instruct-v0.3 struggles the most, particularly in zero-shot settings, which reflects that the model has limited generalisation capabilities in complex linguistic tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Key Observations</head><p>Larger models, such as Meta-Llama-3-70B-Instruct and Gemma-2-9b-it, consistently outperform smaller models, showing better generalisation and stability across tasks. English prompts generally result in higher F1 scores, though Italian prompts sometimes achieve comparable performance, particularly with Gemma-2-9b-it. One-shot prompting tends to improve performance, though the degree of improvement varies by model and task complexity. Smaller models, such as Mistral-7B-Instruct and Meta-Llama-3-8B-Instruct, show substantial variance, especially in oneshot scenarios, indicating instability in complex linguistic tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Comparison with Multitask Learning Approaches</head><p>We compare our LLM prompting results with the work of <ref type="bibr" target="#b12">[12,</ref><ref type="bibr" target="#b11">11]</ref>, which explored the properties of Italian sentence embeddings -the embeddings of the [CLS] token from a pretrained Electra model <ref type="bibr" target="#b18">[18]</ref> <ref type="foot" target="#foot_0">3</ref> -through the agreement and the causative and object-drop BLM datasets, using a two-level Variational Encoder-Decoder architecture. This system learns to compress the sentence embeddings into representations relevant for the specific BLM tasks. The dataset statistics, and results on the individual BLM tasks as averaged F1 score over three runs and different amounts of lexical variation are shown in Table <ref type="table" target="#tab_2">3</ref>.</p><p>While not directly comparable due to the different training process and the different test data, using pretrained transformer encoder architectures, like Electra, significantly outperform the zero and one-shot prompting baseline. The performance gap suggests that while zero or one-shot prompting is flexible, it may not capture the complex syntactic and semantic features required for the BLM task in Italian.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Limitations</head><p>While the data is very rich and richly structured, it shares all the limitations of artificial and synthetic data: stilted sentence structure, limited variability, possibly sentences that are too short. This artificiality, though, might reduce, without eliminating, the risk of having sentences that were directly seen in the training data of the pretrained models that will be used, and that we use, for further experiments.</p><p>The initial seed sentences, although minimal, were crafted by experts. This approach is deliberate, like in the ARC dataset, to guarantee that the data are not algorithmically reproducible <ref type="bibr" target="#b19">[19]</ref>. This expert-based approach, though, might not be easily scalable, especially given the complexity of the data. Exploring methods to leverage existing datasets for seed generation could mitigate this dependency.</p><p>The current dataset comprises three main tasks. More tasks and variants are needed to demonstrate the robustness and the wider appeal of the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Ethical issues</head><p>The data presented include an augmentation step that uses large language models (LLMs). LLMs are trained on extensive text data, which may unintentionally incorporate biases present in the training corpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Data license and copyright issues</head><p>This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). For uses outside of these terms, please contact the authors.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Example of a Raven's Progressive Matrix (RPM) from visual intelligence tests. This instance is generated with two generative rules: (i) the red dot moves one place clockwise when traversing the matrix left to right; (ii) the blue square moves one place anticlockwise when traversing the matrix top to bottom. The task consists in finding the tile in the answer set that correctly completes the sequence (indicated with a double border).</figDesc><graphic coords="1,302.62,310.48,203.35,70.84" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 3 :Figure 4 :</head><label>34</label><figDesc>Figure 3: Two instances of BLM-OdI data: with little (type II) and maximal (type III) lexical variation.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Context 1 AgFigure 5 :</head><label>15</label><figDesc>Figure 5: BLM-OdI Template. Same generative rules as BLM-CausI, with the difference that here the passive/active voice is confounding, and the correct answer is an erroneous answer for BLM-CausI.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: BLM data generation process, from seed examples of a linguistic problem to the complete dataset</figDesc><graphic coords="4,109.63,84.19,162.69,119.15" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 7 :Figure 8 :</head><label>78</label><figDesc>Figure 7: Data format</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>#</head><label></label><figDesc>COMPITO: Ti chiedo di risolvere un quesito. La lingua di questo quesito e' l'italiano. Ti daro' una lista di frasi (numerate da 1 a 7) che chiameremo **Contesto**, e un insieme di frasi (identificate da una che chiameremo **Risposte**. Il tuo compito e' di scegliere fra le **Risposte** la frase che potrebbe essere la frase seguente del **Contesto**. # FORMATO: Devi mettere **SOLO** la lettera che corrisponde alla risposta migliore. Non inserire altro testo, ne' prima ne' dopo.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 9 :</head><label>9</label><figDesc>Figure 9: Few (One)-Shot Prompt in Italian.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc></figDesc><table><row><cell>Dataset statistics and evaluation results on a two-level varia-</cell></row><row><cell>tional encoder-decoder architecture using an Italian Electra</cell></row><row><cell>(E-It) and a multilingual Electra (E-M) pretrained model to</cell></row><row><cell>provide sentence embeddings.</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">Italian Electra (E-It) pretrained model: dbmdz/electra-baseitalian-xxl-cased-discriminator, multi-lingual Electra (E-M) model: google/electra-base-discriminator</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We gratefully acknowledge the support of this work by the Swiss National Science Foundation, through grant SNF Advanced grant TMAG-1_209426 to PM.</p></div>
			</div>


			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>(P. Merlo) GLOBE https://www.idiap.ch/en/scientific-research/researchers (P. Merlo</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Example Data Format</head><p>[{ "ID": 215, "Context": [ "le pittrici possono disegnare delle forme in meno di due giorni", "le artiste possono disegnare delle rappresentazioni artistiche da un mese", "alcune coreografie sono disegnate dalle pittrici nel salone espositivo", "delle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese", "alcune coreografie devono essere disegnate con pochi mezzi economici", "le scenografie devono essere disegnate da pochi mesi", "le pittrici devono disegnare nel salone espositivo"], "Context_concatenated": "1\tle pittrici possono disegnare delle forme in meno di due giorni\n2\tle artiste possono disegnare delle rappresentazioni artistiche da un mese\n3\talcune coreografie sono disegnate dalle pittrici nel salone espositivo\n4\tdelle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese\n5\talcune coreografie devono essere disegnate con pochi mezzi economici\n6\tle scenografie devono essere disegnate da pochi mesi\n7\tle pittrici devono disegnare nel salone espositivo", "Answer_set": [ "delle rappresentazioni artistiche devono poter disegnare le sue allieve", "le scenografie devono essere disegnate dalle sue allieve", "le sue allieve devono essere disegnate da delle rappresentazioni artistiche", "le pittrici possono disegnare le scenografie", "le pittrici possono disegnare da un anno circa", "delle forme devono poter disegnare da pochi mesi", "le artiste devono poter disegnare da alcune coreografie", "delle rappresentazioni artistiche devono disegnare dalle artiste"], "Answer_concatenated": "A\tdelle rappresentazioni artistiche devono poter disegnare le sue allieve\nB\tle scenografie devono essere disegnate dalle sue allieve\nC\tle sue allieve devono essere disegnate da delle rappresentazioni artistiche\nD\tle pittrici possono disegnare le scenografie\nE\tle pittrici possono disegnare da un anno circa\nF\tdelle forme devono poter disegnare da pochi mesi\nG\tle artiste devono poter disegnare da alcune coreografie\nE\tdelle rappresentazioni artistiche devono disegnare dalle artiste", "Correct_option": "E", "Correct_answer": "le pittrici possono disegnare da un anno circa", "Answer_set_annotation": [ { "label": "IR-trans", "value": false, "option": "A" }, { "label": "IER-pass", "value": false, "option": "B" }, { "label": "ER-pass", "value": false, "option": "C" }, { "label": "R-trans", "value": false, "option": "D" }, { "label": "Correct", "value": true, "option": "E" }, { "label": "I-Int", "value": false, "option": "F" }, { "label": "E-WrBy", "value": false, "option": "G" }, { "label": "IE-WrBy", "value": false, "option": "H" } ], "Verb": "disegnare" }, .... ] </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">CALAMITA -Challenge the Abilities of LAnguage Models in ITAlian: Overview</title>
		<author>
			<persName><forename type="first">G</forename><surname>Attanasio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Borazio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Croce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Francis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Musacchio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Patti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rinaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Scalena</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)</title>
				<meeting>the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Motivations and formal specifications</title>
		<author>
			<persName><forename type="first">P</forename><surname>Merlo</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2306.11444</idno>
		<idno>.2306.11444</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2306.11444.doi:10.48550/arXiv" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Standardization of progressive matrices</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Raven</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">British Journal of Medical Psychology</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page" from="137" to="150" />
			<date type="published" when="1938">1938</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Emergent analogical reasoning in large language models</title>
		<author>
			<persName><forename type="first">T</forename><surname>Webb</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">J</forename><surname>Holyoak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lu</surname></persName>
		</author>
		<idno type="DOI">10.1038/s41562-023-01659-w</idno>
		<ptr target="https://doi.org/10.1038/s41562-023-01659-w.doi:10.1038/s41562-023-01659-w" />
	</analytic>
	<monogr>
		<title level="j">Nature Human Behaviour</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="1526" to="1541" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">In-context analogical reasoning with pre-trained language models</title>
		<author>
			<persName><forename type="first">X</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Storks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chai</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2023.acl-long.109" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<meeting>the 61st Annual Meeting of the Association for Computational Linguistics<address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1953" to="1969" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">BLM-AgrF: A new French benchmark to investigate generalization of agreement in neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>An</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Nastase</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merlo</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2023.eacl-main.99" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<meeting>the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics<address><addrLine>Dubrovnik, Croatia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1363" to="1374" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Grammatical information in BERT sentence embeddings as two-dimensional arrays</title>
		<author>
			<persName><forename type="first">V</forename><surname>Nastase</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merlo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)</title>
				<meeting>the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)<address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">BLM-s/lE: A structured dataset of English spray-load verb alternations for testing generalization in LLMs</title>
		<author>
			<persName><forename type="first">G</forename><surname>Samo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Nastase</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merlo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Are there identifiable structural parts in the sentence embedding whole?</title>
		<author>
			<persName><forename type="first">V</forename><surname>Nastase</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merlo</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.blackboxnlp-1" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title/>
		<idno type="DOI">10.18653/v1/2024.blackboxnlp-1.3</idno>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Tracking linguistic information in transformer-based sentence embeddings through targeted sparsification</title>
		<author>
			<persName><forename type="first">V</forename><surname>Nastase</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merlo</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.repl4nlp-1.15" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)</title>
				<meeting>the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)<address><addrLine>Bangkok, Thailand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="203" to="214" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Exploring Italian sentence embeddings properties through multi-tasking</title>
		<author>
			<persName><forename type="first">V</forename><surname>Nastase</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Samo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merlo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024)</title>
				<meeting>the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024)<address><addrLine>Pisa, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement</title>
		<author>
			<persName><forename type="first">V</forename><surname>Nastase</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Samo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merlo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024)</title>
				<meeting>the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024)<address><addrLine>Pisa, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Levin</surname></persName>
		</author>
		<title level="m">English verb classes and alternations: A preliminary investigation</title>
				<imprint>
			<publisher>University of Chicago Press</publisher>
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Subject-verb agreement errors in french and english: The role of syntactic hierarchy</title>
		<author>
			<persName><forename type="first">J</forename><surname>Franck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Vigliocco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Nicol</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language and cognitive processes</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="page" from="371" to="404" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N19-1423</idno>
		<ptr target="https://aclanthology.org/N19-1423.doi:10.18653/v1/N19-1423" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long and Short Papers</title>
		<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>Minneapolis, Minnesota</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Chain-of-thought prompting elicits reasoning in large language models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schuurmans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bosma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="24824" to="24837" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Large language models are zero-shot reasoners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Kojima</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Reid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matsuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Iwasawa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="22199" to="22213" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Electra: Pre-training text encoders as discriminators rather than generators</title>
		<author>
			<persName><forename type="first">K</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-T</forename><surname>Luong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICLR</title>
		<imprint>
			<biblScope unit="page" from="1" to="18" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">On the measure of intelligence</title>
		<author>
			<persName><forename type="first">F</forename><surname>Chollet</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1911.01547.arXiv:1911.01547" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
