-

1613-0073

Matrices for Italian: A CALAMITA Challenge

ChunyangJiang

0 1 2

Giuseppe Samo

giuseppe.samo@idiap.ch 0 1

Vivi Nastas

vivi.a.nastase@gmail.co 0 1

PaolaMerlo

0 1 2 0 Blackbird Language Matrices, Causative/inchoative alternation , Object-drop alternation, subject-verb number agreement 1 Idiap Research Institute , Martigny , Switzerland 2 University of Geneva , Geneva , Switzerland

2000

In this challenge, we propose Blackbird Language Matrices (BLMs), linguistic puzzles to learn language-related problems and investigate deeper formal and semantic properties of language, through a process of paradigm understanding. A BLM matrix consists of a context set and an answer set. The context is a sequence of sentences that encode implicitly an underlying generative linguistic rule. The contrastive multiple-choice answer set includes negative examples produced following corrupted generating rules. We propose three subtasks -agreement cAongrc)o,rcdau(sativeC(aus) and object-dropO(d) alternation detection- each in two variants of increasing lexical complexity. The datasets comprise a few prompts for few-shot learning and a large test set.

rule-based abstraction disentanglement

CEUR ceur-ws.org

1. Introduction and Motivation

Current generative large language models (LLMs) translate across close languages, produce fluent and informative summaries, and answer questions promptly. And yet, they still fail in very non-human ways. As proven by their prohibitive needs in size of training data and exthe language they produce and process so well?— and isation abilities —what exactly do LLMs understanddooufble border). help us train them to more complex skills.

In the CALAMITA challenge1[], we propose to find

pensive computational resources, large language modFeilgsure 1: Example of a Raven’s Progressive Matrix (RPM) do not generalise nor abstract systematically. Humfaronms, visual intelligence tests. This instance is generated with instead, are good at abstraction and generalisationt.wo generative rules: (i) the red dot moves one place clockwise

To reach systematic abilities in abstraction and gewnheren- traversing the matrix left to right; (ii) the blue square alisation in neural networks, we need to develop tamskosves one place anticlockwise when traversing the matrix top and data that help us understand their current genseetrathl-at correctly completes the sequence (indicated with a to bottom. The task consists in finding the tile in the answer the solution to Blackbird Language Matrices (BLMs), linU-nlike other attempts to create textual versions of guistic puzzles developed in analogy to the visual RaveRnPMs, BLMs are not simplistic transcriptions of visual Progressive Matrices test2]s. [Raven’s Progressive Ma- stimuli [4]—a technique that, in practice, might give away trices (RPMs) consist of a sequence of images, called thpearts of the solution to the problem—, nor are they auxilcontext, connected in a logical sequence by underlyinigary abstractions of stimuli in the visual dom5a].inIn[generative rules3][. The task is to determine the misss-tead, BLMs are matrices developed specifically to learn ing element in this visual sequence, tahneswer, chosen among a set of closely or loosely similar alternatives,aansd semantic properties of language, through a process language-related problems and delve into deeper formal CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, and an answer set. The context is a sequence of sentences illustrated in Figur1e. (P. Merlo)

Like RPMs, a BLM instance consists of a context set that encode a linguistic rule. They encode, for example, the rule of grammatical number concord: subject and verb agree in their grammatical number, and they do so independently of how many noun phrases intervene between them. BLMs are presented as linguistic puzzles © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons Lirceensqeuiring the selection of the missing sentence. In order Attribution 4.0 International (CC BY 4.0). to examine the representations underlying the response, the answer sets include not only the correct answer, but NP-sg VP-sg also erroneous candidates constructed by corrupting the NP-pl VP-pl generating rules. An example template is illustrated in NNPP--spgl VVPP--spgl Figure2. NP-sg VP-sg

BLM datasets are richly structured and support many NP-pl VP-pl diferent types of investigations, at both the sentence NP-sg VP-sg and matrix levels. The context-answer set up support counterfactual investigations of possible types of errors:NP-pl PP1-pl VP-pl Correct language errors, reasoning errors, and their interactionNsP-pl PP1-pl VP-pl Coord [6, 7, 8]. The regular syntactic forms and the systematic NP-pl PP1-pl VP-pl WNA semantic properties support investigations on system-NP-pl PP1-sg PP1-sg VP-pl WN1 aticity and compositionality in neural networks. The pre-NP-pl PP1-pl PP2-pl VP-pl WN2 dictable syntactic structure of individual sentences, andNP-pl PP1-pl PP2-pl VP-sg AEV the structure within the sequence of a BLM context, alsoNNPP--ppll PP1-sg PP2-pl VP-sg AEN1 support investigations on sentence embeddi9n,g1s0[]. PP1-pl PP2-sg VP-sg AEN2 BLMs exists for several tasks and diferent languages, enabling multi-tasks and multi-language comparatFivigeure 2: BLM-AgrI templatefor verb-subject agreement, studies 1[1, 12]. Finally, each BLM problem is a linguisticwith one-two intervening phrases. Three generative rules: paradigm and can be seen as a tool for linguistic inve(sit)iS-ubject matches in number with verb (singular or plural); gation of specific phenomena. (ii) material can intervene and is of unbounded length; (iii) singular and plural alternate in regular patterns. NP=Noun

Phrase, PP=Prepositional Phrase, VP=Verb Phrase. Answers: 2. The BLM-It Challenge WNA= wrong number of attractors; WN1= wrong nr. f or 1 attractor noun (N1); WN2= wrong nr. for 2attractor noun

Context

PP1-sg PP1-sg PP1-pl PP1-pl PP1-sg PP2-sg PP1-sg PP2-sg PP1-pl PP2-sg

Answer set PP2-sg et PP2-sg The BLM-It challenge consists of six sub-tas1kAsl.l sub- (N2); AEV=agreement error on the verb; AEN1=agreement error on N1; AEN2=agreement error on N2. tasks are instances of the general BLM task, but they differ along two dimensions: the linguistic problem defined (Agr, Caus, Od) and the lexical complexity of the data (II, III).2 While the agreementA(gr) task focuses on informa-transitive verb bears the same semantic role (Patient) as tion about the formal grammatical property of agreemtehnets, ubject of the intransitive veLr’abrt(ista ha aperto the causativeC(aus) and object-dropO(d) alternation la finestra/La finestra si è aperta ‘The artist opened the tasks focus on lexical semantic properties of verbs, thweiinrdow’/‘The window opened’). The transitive form of ability to enter or not in a causative alternation andtthheevirerb has a causative meanin1g3][. systematic alternation in the syntactic-semantic mappiTnhge BLM-CausI template is shown in Figu4r.eThe conof grammatical functions and semantic roles. text set of the causative alternation varies depending on the presence of one or two arguments and their attributes BLM-AgrI The BLM problem for subject-verb agree(-agents,Ag; patientsP, at) and the activeA(kt) and pasment [6] consists of a context set of seven sentences tshiavte (Pass) or passive voice of the verb. The sentences share the subject-verb agreement phenomenon, but difearre organised in a structured sequence: an alternation in other aspects – e.g. number of intervening attracetvoerrsy two items between a prepositional phrase introbetween the subject and the verb, diferent grammaticdaulced by multifarious prepositions (ei.ng.p,ochi secondi, numbers for these attractors, and diferent clause stPr-uNcP- ) and a PP introduced by the agentdiva-eNP (e.g., tures. The answer set comprises contrastive sentendcaells’artista, da-Ag/da-Pat). that violate some of the generative rules. The BLM-AgrIThe answer set is composed of one correct answer and Template can be seen in Figur2e. contrastive erroneous answers, all formed by the same four elements: a verb, two nominal constituents and the BLM-CausI The BLM-CausI matrix represents thperesence (or absence) of a prepositional phrase. causative/inchoative alternation, where the object of the

BLM-OdI The BLM-OdI template is minimally difer2Witeecahsoieorsetonacmroessso-rfetfaesrkesnacendanledxiccoamlcpoamrepltexhietydaletvaeldsetshcartibmedaekhneertefrom BLM-CausI. They also act as each other’s conwith other papers published on BLMs. trols. In contrastCtaous, the subject inOd bears the 2Our datasets are available here: same semantic role (Agent) in both the transitive and https://www.idiap.ch/en/scientific-research/data/blm-agr,i-genintransitive formLs’(artista dipingeva la finestra/L’artista https://www.idiap.ch/en/scientific-research/data/blm-cau,si-gendipingeva ‘the artist painted the window’/‘the artist https://www.idiap.ch/en/scientific-research/data/blm-od. i-gen mood and wrong subject semantic role; R-trans=wrong sei-t is an intransitive form witdha-aNP. quence reasoning (transitive sentence with the second NP not preceded by a preposition); IE-WrBy=ungrammatical sentence (NP following the prepositionda).

Lexical variants Each of the three BLM templates described above is developed in two lexical variants, with less (II) or more (III) lexical variation. In type II BLMs, only one word in each sentence changes for each matrix, painted’) and the verb does not have a causative meaning

compared to the other sentences, while in type III data, [13].

all words can change. Instances of the two variations are The BLM template foOrd is the same as forCaus, but

shown in Figure3. here the passive voice serves as a confounding element and one of the contrastive answersCfaours is, in fact, the correct answer here. 3. Data description

The template for BLM-OdI is in Figur5e. Due to the asymmetry between thCeaus andOd BLM templates, The data is generated by the process described in Figure the contexts of the BLMs minimally difer in the intrans6i:-(i) start from identifying a linguistic phenomenon of tive followed byP-NP (sentence 7). The correct answeirnterest, its forms of expression and factors influencing it also varies across the two groups, although in both cawseitshin a context, (ii) produce a set of seed examples from dataset BLM-AgrI (II/III) BLM-CausI (II/III) BLM-OdI (II/III)

For the BLM-AgrI datasets, for each of types II and III, we randomly sample 10 instances for few-shot learning from a dataset of 2010 instances. The rest will be used for natural or synthetic data, (iii) automatically augmenttetshteing. For the BLM-CausI and BLM-OdI datasets, which seeds using a fill-mask strategy, (iv) produce BLM in-are focused on specific verbs, we extract all instances for stances following the designed templates and generatoivnee verb (based on the correct answer in each instance) rules. Two instances oOfd verb alternations are shownfor few-shot training. From an initial dataset of 2160 in Figure3. instances for 27 verbs (80 instances per verb), we select the 80 instances for one verb for few-shot training, and 3.1. Origin of data the rest are left for testing.

BLM-AgrI To instantiate the templates, our starting point are the examples in Franck et[1a4l,. appendix1]. 3.4. Example of prompts They provide a set of subject NPs of various complexitWye design prompts in English and Italian in zero-shot – including prepositional phrases, themselves of variouasnd few-shot prediction settings, to test the impact of complexity. The sentences were produced based on thetshee language of the prompt on the task. These prompts subject NPs by manually adding verb phrases, and by

test LLMs’ ability to perform complex linguistic tasks making the NPs more complex to increase the distanwcieth varying levels of context. Both types of prompts are between the subject and the verb in the sente6n].ce s[tructured to minimize ambiguity and focus on the core Each of these sentences is used to produce a seed. task of selecting the best sentence to follow the given context.

BLM-CausI and BLM-OdI Thirty verbs from each of Zero-Shot Prompt Example in English The prompt the causative and object-drop classes in English in LeviinnFigure8 is designed to create a clear zero-shot base[13] were selected and translated by a native speaker ilninteofor challenging linguistic tasks. We avoid complex Italian, where translations maintain the same alternaptrioomnpting techniques, like chain-of-thought or step-bystructure. step reasoning1[6, 17]. This ensures that the model’s

The seeds were augmented using masked modelingperformance reflects its intrinsic capabilities for linguison bert-base-uncased [15]. The Italian data are builttic understanding and reasoning without prior in-context as native-speaker translations of the English data, wleiatrhning or guided reasoning steps. manual corrections to guarantee the acceptability aWnde format the prompt Minarkdown format and exsemantic plausibility of the sentences, and assure vaprliic-it label sections for Context and Answer Set. The ability in gender and number. task is framed as a simple “puzzle” with the instruction to “choose […] the sentence that could […] follow the 3.2. Data format context”. This abstract formulation guides the model to focus on identifying the best sequential fit without introThe structured BLM data is provided in a json file, eacdhucing ambiguity. The prompt also aims to reduce noise instance as one element with specific fields described inand simplify the evaluation by fixing its output format. Figure7. A data instance is shown in Figu1r0ein the Few-Shot (One-Shot) Prompt Example in Italian appendix. For the one-shot prediction setup (as is shown in Figure9), we provide an example of the task in Italian before presenting the new instance to the model. The prompt serves to test the model’s ability to use prior examples { },

4. Metrics

# COMPITO: Ti chiedo di risolvere un quesito. La lingua di questo quesito e' l'italiano.

Ti daro' una lista di frasi (numerate da 1 a 7) che chiameremo **Contesto**, e un insieme di frasi (identificate da una lettera) che chiameremo **Risposte**.

Il tuo compito e' di scegliere fra le **Risposte** la frase che potrebbe essere la frase seguente del **Contesto**. # FORMATO: Devi mettere **SOLO** la lettera che corrisponde alla risposta migliore. Non inserire altro testo, ne' prima ne' dopo. # ESEMPIO 1 **Contesto** {{Context_concatenated}} **Risposte** {{Answer_concatenated}} **Scelta corretta** {Correct_option} # DOMANDA **Contesto** {{Context_concatenated}} **Risposte** {{Answer_concatenated}} **La tua scelta** We perform zero-shot and one-shot evaluation on BLMAgrI, BLM-CausI and BLM-OdI tasks, using English andFigure 9: Few (One)-Shot Prompt in Italian. Italian prompts, with 100 samples each (batch size of one, evaluated instance by instance, over three independent runs) witMheta-Llama-3-8B-Instruct (ML8), Meta-Llama-3-70B-Instruct (ML-70), Mistral-7BInstruct-v0.3 (M-7), and Gemma-2-9b-It (G-2). We report averaged F1 scores over 3 runs in T2a.ble BLM-AgrI tasks Meta-Llama-3-70B-Instruct consistently outperforms the other models, particularly in zero-shot English prompts, while also competitive in

Zero-Shot One-Shot Zero-Shot One-Shot Results BLM-AgrI type II

ML-70 44.1 ± 0.46 ML-8 22.34 ± 0.33 M-7 25.54 ± 0.58 G-2 42.75 ± 1.01

BLM-AgrI type III

ML-70 45.64 ± 0.05 ML-8 26.65 ± 1.71 M-7 31.26 ± 1.60 G-2 38.48 ± 1.12

BLM-CausI type II

ML-70 19.97 ± 0.65 ML-8 5.85 ± 0.20 M-7 8.45 ± 0.44 G-2 18.06 ± 0.25

BLM-CausI type III

ML-70 26.49 ± 0.85 ML-8 18.03 ± 1.52 M-7 20.08 ± 0.76 G-2 29.12 ± 0.73

BLM-OdI type II

ML-70 18.28 ± 2.18 ML-8 8.55 ± 0.21 M-7 1.92 ± 0.27 G-2 14.07 ± 0.78

BLM-OdI type III

ML-70 17.70 ± 0.32 ML-8 9.50 ± 0.95 M-7 11.60 ± 0.64 G-2 14.74 ± 0.40 one-shot settingsG.emma-2-9b-it shows robust per- BLM-CausI tasks Meta-Llama-3-70B-Instruct formance, especially with Italian prompts, performilnegads across both English and Italian prompts, with similarly to the larger Meta-Llama model. In contriamsptr,ovement in one-shot English for typeGIeI.mmasmaller models, such asMeta-Llama-3-8B-Instruct 2-9b-it shows comparable performance across both andMistral-7B-Instruct-v0.3, perform more weakly, languages, in both zero-shot and one-shot settings. especially with Italian prompts. Smaller models perform worse for this task, especially in one-shot Italian prompts. dataset train:test BLM-AgrI type II 2400:4121 0.881 (0.003) 0.784 (0.007) BLM-AgrI type III 2400:4121 0.874 (0.006) 0.336 (0.005) BLM-CausI type II 2160:240 0.486 (0.005) 0.903 (0.010) BLM-CausI type III 2160:240 0.475 (0.010) 0.918 (0.010) BLM-OdI type II 2160:240 0.596 (0.010) 0.983 (0.003) BLM-OdI type III 2160:240 0.592 (0.024) 0.994 (0.004)

While not directly comparable due to the diferent training process and the diferent test data, using pretrained transformer encoder architectures, like Electra, significantly outperform the zero and one-shot prompting baseline. The performance gap suggests that while zero or one-shot prompting is flexible, it may not capture the complex syntactic and semantic features required for the BLM task in Italian.

While the data is very rich and richly structured, it shares all the limitations of artificial and synthetic data: stilted sentence structure, limited variability, possibly sentences BLM-OdI tasks OdI tasks show the lowest overaltl hat are too short. This artificiality, though, might reduce, performance across models. This indicates that twhiethout eliminating, the risk of having sentences that task is the most complex and challenging for the mowde-re directly seen in the training data of the pretrained els. Meta-Llama-3-70B-Instruct performs best, partic-models that will be used, and that we use, for further ularly in one-shot English and Italian prompts. Howeveexrp,eriments.

Mistral-7B-Instruct-v0.3 struggles the most, partic- The initial seed sentences, although minimal, were ularly in zero-shot settings, which reflects that the modcrealfted by experts. This approach is deliberate, like in the has limited generalisation capabilities in complex linguAisR-C dataset, to guarantee that the data are not algorithtic tasks. mically reproducible1[9]. This expert-based approach, though, might not be easily scalable, especially given the Key Observations Larger models, such asMeta- complexity of the data. Exploring methods to leverage Llama-3-70B-Instruct and Gemma-2-9b-it, consis- existing datasets for seed generation could mitigate this tently outperform smaller models, showing better gendeerp-endency.

The current dataset comprises three main tasks. More alisation and stability across tasks. English prompts generally result in higher F1 scores, though Italian prompttassks and variants are needed to demonstrate the robustsometimes achieve comparable performance, particulanrelyss and the wider appeal of the data. withGemma-2-9b-it. One-shot prompting tends to improve performance, though the degree of improveme6nt. Ethical issues varies by model and task complexity. Smaller models, such asMistral-7B-Instruct andMeta-Llama-3-8B- The data presented include an augmentation step that Instruct, show substantial variance, especially in onues-es large language models (LLMs). LLMs are trained on shot scenarios, indicating instability in complex linguisetxictensive text data, which may unintentionally incorpotasks. rate biases present in the training corpus.

Comparison with Multitask Learning Approaches 7. Data license and copyright

We compare our LLM prompting results with the work of [12, 11], which explored the properties of Italian sentence issues embeddings – the embeddings of the [CLS] token from a pretrained Electra mod1e8l][3 – through the agreementThis work is licensed under theCreative Comand the causative and object-drop BLM datasets, usminogns Attribution-NonCommercial-ShareAlike 4.0 Intera two-level Variational Encoder-Decoder architectnuartei.onal (CC BY-NC-SA 4.0.)For uses outside of these This system learns to compress the sentence embeddintgesrms, please contact the authors. into representations relevant for the specific BLM tasks.

The dataset statistics, and results on the individual BLM

Acknowledgments

tasks as averaged F1 score over three runs and diferent amounts of lexical variation are shown in T3a.ble

We gratefully acknowledge the support of this work by the Swiss National Science Foundation, through grant 3Italian Electra (E-It) pretrained model: dbmdz/electra-bSaNseF- Advanced grant TMAG-1_209426 to PM. italian-xxl-cased-discriminator, multi-lingual Electra (E-M) model: google/electra-base-discriminator "ID": 215, "Context": [ "le pittrici possono disegnare delle forme in meno di due giorni", "le artiste possono disegnare delle rappresentazioni artistiche da un mese", "alcune coreografie sono disegnate dalle pittrici nel salone espositivo", "delle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese", "alcune coreografie devono essere disegnate con pochi mezzi economici", "le scenografie devono essere disegnate da pochi mesi", "le pittrici devono disegnare nel salone espositivo"], "Context_concatenated": "1\tle pittrici possono disegnare delle forme in meno di due giorni\n2\tle artiste possono disegnare delle rappresentazioni artistiche da un mese\n3\talcune coreografie sono disegnate dalle pittrici nel salone espositivo\n4\tdelle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese\n5\talcune coreografie devono essere disegnate con pochi mezzi economici\n6\tle scenografie devono essere disegnate da pochi mesi\n7\tle pittrici devono disegnare nel salone espositivo", "Answer_set": [ "delle rappresentazioni artistiche devono poter disegnare le sue allieve", "le scenografie devono essere disegnate dalle sue allieve", "le sue allieve devono essere disegnate da delle rappresentazioni artistiche", "le pittrici possono disegnare le scenografie", "le pittrici possono disegnare da un anno circa", "delle forme devono poter disegnare da pochi mesi", "le artiste devono poter disegnare da alcune coreografie", "delle rappresentazioni artistiche devono disegnare dalle artiste"], "Answer_concatenated": "A\tdelle rappresentazioni artistiche devono poter disegnare le sue allieve\nB\tle scenografie devono essere disegnate dalle sue allieve\nC\tle sue allieve devono essere disegnate da delle rappresentazioni artistiche\nD\tle pittrici possono disegnare le scenografie\nE\tle pittrici possono disegnare da un anno circa\nF\tdelle forme devono poter disegnare da pochi mesi\nG\tle artiste devono poter disegnare da alcune coreografie\nE\tdelle rappresentazioni artistiche devono disegnare dalle artiste", "Correct_option": "E", "Correct_answer": "le pittrici possono disegnare da un anno circa", "Answer_set_annotation": [ { "label": "IR-trans", "value": false, "option": "A" }, { "label": "IER-pass", "value": false, "option": "B" }, { "label": "ER-pass", "value": false, "option": "C" }, { "label": "R-trans", "value": false, "option": "D" }, { "label": "Correct", "value": true, "option": "E" }, { "label": "I-Int", "value": false, "option": "F" }, { "label": "E-WrBy", "value": false, "option": "G" }, { "label": "IE-WrBy", "value": false, "option": "H" } }, .... ] ], "Verb": "disegnare"