BLM-It -Blackbird Language Matrices for Italian: A CALAMITA Challenge

BLM-It -Blackbird Language Matrices for Italian: A CALAMITA Challenge ChunyangJiang chunyang.jiang@unige.ch Idiap Research Institute

Martigny Switzerland

University of Geneva

Geneva Switzerland

GiuseppeSamo giuseppe.samo@idiap.ch Idiap Research Institute

Martigny Switzerland

ViviNastase vivi.a.nastase@gmail.com Idiap Research Institute

Martigny Switzerland

PaolaMerlo paola.merlo@unige.ch Idiap Research Institute

Martigny Switzerland

University of Geneva

Geneva Switzerland

Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

BLM-It -Blackbird Language Matrices for Italian: A CALAMITA Challenge 1613-0073 63307A04A2533E51CD03A074AC76DA3F GROBID - A machine learning software for extracting information from scholarly documents Blackbird Language Matrices Causative/inchoative alternation Object-drop alternation subject-verb number agreement rule-based abstraction disentanglement

In this challenge, we propose Blackbird Language Matrices (BLMs), linguistic puzzles to learn language-related problems and investigate deeper formal and semantic properties of language, through a process of paradigm understanding. A BLM matrix consists of a context set and an answer set. The context is a sequence of sentences that encode implicitly an underlying generative linguistic rule. The contrastive multiple-choice answer set includes negative examples produced following corrupted generating rules. We propose three subtasks -agreement concord (Agr), causative (Caus) and object-drop (Od) alternation detection-each in two variants of increasing lexical complexity. The datasets comprise a few prompts for few-shot learning and a large test set.

Introduction and Motivation

Current generative large language models (LLMs) translate across close languages, produce fluent and informative summaries, and answer questions promptly. And yet, they still fail in very non-human ways. As proven by their prohibitive needs in size of training data and expensive computational resources, large language models do not generalise nor abstract systematically. Humans, instead, are good at abstraction and generalisation.

To reach systematic abilities in abstraction and generalisation in neural networks, we need to develop tasks and data that help us understand their current generalisation abilities -what exactly do LLMs understand of the language they produce and process so well?-and help us train them to more complex skills.

In the CALAMITA challenge [1], we propose to find the solution to Blackbird Language Matrices (BLMs), linguistic puzzles developed in analogy to the visual Raven Progressive Matrices tests [2]. Raven's Progressive Matrices (RPMs) consist of a sequence of images, called the context, connected in a logical sequence by underlying generative rules [3]. The task is to determine the missing element in this visual sequence, the answer, chosen among a set of closely or loosely similar alternatives, as illustrated in Figure 1. Unlike other attempts to create textual versions of RPMs, BLMs are not simplistic transcriptions of visual stimuli [4]-a technique that, in practice, might give away parts of the solution to the problem-, nor are they auxiliary abstractions of stimuli in the visual domain [5]. Instead, BLMs are matrices developed specifically to learn language-related problems and delve into deeper formal and semantic properties of language, through a process of linguistic paradigm understanding.

Like RPMs, a BLM instance consists of a context set and an answer set. The context is a sequence of sentences that encode a linguistic rule. They encode, for example, the rule of grammatical number concord: subject and verb agree in their grammatical number, and they do so independently of how many noun phrases intervene between them. BLMs are presented as linguistic puzzles requiring the selection of the missing sentence. In order to examine the representations underlying the response, the answer sets include not only the correct answer, but also erroneous candidates constructed by corrupting the generating rules. An example template is illustrated in Figure 2.

BLM datasets are richly structured and support many different types of investigations, at both the sentence and matrix levels. The context-answer set up support counterfactual investigations of possible types of errors: language errors, reasoning errors, and their interactions [6,7,8]. The regular syntactic forms and the systematic semantic properties support investigations on systematicity and compositionality in neural networks. The predictable syntactic structure of individual sentences, and the structure within the sequence of a BLM context, also support investigations on sentence embeddings [9,10]. BLMs exists for several tasks and different languages, enabling multi-tasks and multi-language comparative studies [11,12]. Finally, each BLM problem is a linguistic paradigm and can be seen as a tool for linguistic investigation of specific phenomena.

The BLM-It Challenge

The BLM-It challenge consists of six sub-tasks. 1 All subtasks are instances of the general BLM task, but they differ along two dimensions: the linguistic problem defined (Agr, Caus, Od) and the lexical complexity of the data (II, III). 2 While the agreement (Agr) task focuses on information about the formal grammatical property of agreement, the causative (Caus) and object-drop (Od) alternation tasks focus on lexical semantic properties of verbs, their ability to enter or not in a causative alternation and their systematic alternation in the syntactic-semantic mapping of grammatical functions and semantic roles.

BLM-AgrI

The BLM problem for subject-verb agreement [6] consists of a context set of seven sentences that share the subject-verb agreement phenomenon, but differ in other aspects -e.g. number of intervening attractors between the subject and the verb, different grammatical numbers for these attractors, and different clause structures. The answer set comprises contrastive sentences that violate some of the generative rules. The BLM-AgrI Template can be seen in Figure 2.

BLM-CausI

The BLM-CausI matrix represents the causative/inchoative alternation, where the object of the 2 We choose names of tasks and lexical complexity levels that make it easier to cross-reference and compare the data described here with other papers published on BLMs. 2 Our datasets are available here: https://www.idiap.ch/en/scientific-research/data/blm-agri-gen, https://www.idiap.ch/en/scientific-research/data/blm-causi-gen, https://www.idiap.ch/en/scientific-research/data/blm-odi-gen. transitive verb bears the same semantic role (Patient) as the subject of the intransitive verb (L'artista ha aperto la finestra/La finestra si è aperta 'The artist opened the window'/'The window opened'). The transitive form of the verb has a causative meaning [13].

The BLM-CausI template is shown in Figure 4. The context set of the causative alternation varies depending on the presence of one or two arguments and their attributes (agents, Ag; patients, Pat) and the active (Akt) and passive (Pass) or passive voice of the verb. The sentences are organised in a structured sequence: an alternation every two items between a prepositional phrase introduced by multifarious prepositions (e.g., in pochi secondi, P-NP) and a PP introduced by the agentive da-NP (e.g., dall'artista, da-Ag/da-Pat).

The answer set is composed of one correct answer and contrastive erroneous answers, all formed by the same four elements: a verb, two nominal constituents and the presence (or absence) of a prepositional phrase.

BLM-OdI

The BLM-OdI template is minimally different from BLM-CausI. They also act as each other's controls. In contrast to Caus, the subject in Od bears the same semantic role (Agent) in both the transitive and intransitive forms (L'artista dipingeva la finestra/L'artista dipingeva 'the artist painted the window'/'the artist type II type III Context 1 La zia mangia una bistecca nella sala grande 2 La presidente può mangiare una bistecca da programma 3 La specialità della casa deve essere mangiata dalla turista nella sala grande 4 Una bistecca fu mangiata dalla presidente da sola 5 La specialità della casa deve essere mangiata in un secondo 6 Una bistecca deve poter essere mangiata da sola 7 La turista deve mangiare con fame 8 ???

Answer set 1 La specialità della casa può mangiare da sola 2 La squadra di calcio deve mangiare da mezz'ora 3 Una bistecca è mangiata dalla turista 4 La squadra di calcio può essere mangiata da una carbonara 5 La pasta col pomodoro può mangiare la squadra di calcio 6 La squadra di calcio mangia una bistecca 7 La specialità della casa deve poter mangiare dalla turista 8 La presidente mangia da una bistecca Context 1 L'attore deve canticchiare un motivetto dopo il festival 2 L'amica di mia mamma deve cucire la tasca da qualche giorno 3 L'inno nazionale può essere cantato dal vincitore del festival con solo pianoforte 4 Una bistecca deve essere mangiata dalla turista da sola 5 Il manuale è insegnato nell'aula magna 6 Questi attrezzi devono essere intagliati da manuale 7 I due fratelli studiano con molta attenzione 8 ???

Answer set 1 La pasta frolla deve impastare da sola 2 L'autrice deve poter scrivere da qualche giorno 3 I libri di testo devono poter essere studiati dai candidati 4 Questi stilisti devono poter essere tessuti dai vestiti per la parata 5 Questi motivi greci possono tessere questi stilisti 6 L'idraulico saldò i cavi del lampadario 7 La stanza pulisce da una delle propretarie dell'albergo 8 Le sommozzatrici pescarono da delle trote painted') and the verb does not have a causative meaning [13].

The BLM template for Od is the same as for Caus, but here the passive voice serves as a confounding element and one of the contrastive answers for Caus is, in fact, the correct answer here.

The template for BLM-OdI is in Figure 5. Due to the asymmetry between the Caus and Od BLM templates, the contexts of the BLMs minimally differ in the intransitive followed by P-NP (sentence 7). The correct answer also varies across the two groups, although in both cases it is an intransitive form with a da-NP.

Lexical variants Each of the three BLM templates described above is developed in two lexical variants, with less (II) or more (III) lexical variation. In type II BLMs, only one word in each sentence changes for each matrix, compared to the other sentences, while in type III data, all words can change. Instances of the two variations are shown in Figure 3.

Data description

The data is generated by the process described in Figure 6: (i) start from identifying a linguistic phenomenon of interest, its forms of expression and factors influencing it within a context, (ii) produce a set of seed examples from

Origin of data BLM-AgrI

To instantiate the templates, our starting point are the examples in Franck et al. [14, appendix1]. They provide a set of subject NPs of various complexity -including prepositional phrases, themselves of various complexity. The sentences were produced based on these subject NPs by manually adding verb phrases, and by making the NPs more complex to increase the distance between the subject and the verb in the sentence [6]. Each of these sentences is used to produce a seed.

BLM-CausI and BLM-OdI

Thirty verbs from each of the causative and object-drop classes in English in Levin [13] were selected and translated by a native speaker into Italian, where translations maintain the same alternation structure.

The seeds were augmented using masked modeling on bert-base-uncased [15]. The Italian data are built as native-speaker translations of the English data, with manual corrections to guarantee the acceptability and semantic plausibility of the sentences, and assure variability in gender and number.

Data format

The structured BLM data is provided in a json file, each instance as one element with specific fields described in Figure 7. A data instance is shown in Figure 10

Table 1

Data statistics for the three datasets, in terms of few-shot training and testing. There are the same number of examples in the type II (small lexical variation within an instance) and type III (maximal lexical variation within an instance) variations of the three datasets.

Detailed data statistics

For the BLM-AgrI datasets, for each of types II and III, we randomly sample 10 instances for few-shot learning from a dataset of 2010 instances. The rest will be used for testing. For the BLM-CausI and BLM-OdI datasets, which are focused on specific verbs, we extract all instances for one verb (based on the correct answer in each instance) for few-shot training. From an initial dataset of 2160 instances for 27 verbs (80 instances per verb), we select the 80 instances for one verb for few-shot training, and the rest are left for testing.

Example of prompts

We design prompts in English and Italian in zero-shot and few-shot prediction settings, to test the impact of the language of the prompt on the task. These prompts test LLMs' ability to perform complex linguistic tasks with varying levels of context. Both types of prompts are structured to minimize ambiguity and focus on the core task of selecting the best sentence to follow the given context.

Zero-Shot Prompt Example in English

The prompt in Figure 8 is designed to create a clear zero-shot baseline for challenging linguistic tasks. We avoid complex prompting techniques, like chain-of-thought or step-bystep reasoning [16,17]. This ensures that the model's performance reflects its intrinsic capabilities for linguistic understanding and reasoning without prior in-context learning or guided reasoning steps.

We format the prompt in Markdown format and explicit label sections for Context and Answer Set. The task is framed as a simple "puzzle" with the instruction to "choose […] the sentence that could […] follow the context". This abstract formulation guides the model to focus on identifying the best sequential fit without introducing ambiguity. The prompt also aims to reduce noise and simplify the evaluation by fixing its output format.

Few-Shot (One-Shot) Prompt Example in Italian

For the one-shot prediction setup (as is shown in Figure 9), we provide an example of the task in Italian before presenting the new instance to the model. The prompt serves to test the model's ability to use prior examples { "ID": <ID NUMBER>, "Context": [<List of comma-separated, double-quoted sentences>], "Context_concatenated": <Double-quoted concatenation of context sentences, each prefixed by a numeral (1 to 7) followed by a tab, separated by newlines>, "Answer_set": [<List of comma-separated, double-quoted sentences>], "Answer_concatenated": <Double-quoted concatenation of answer sentences, each prefixed by a letter (A, B, C, ...) followed by a tab, separated by newlines>, "Correct_option": <Double-quoted single letter label>, "Correct_answer": <Double-quoted single correct answer sentence>, "Answer_set_annotation": [<List of comma-separated triplets {"label":<error-type>,"value":<truth value>,"option":<single letter label>}>], "Verb": <Double-quoted single verb> },

Metrics

We perform zero-shot and one-shot evaluation on BLM-AgrI, BLM-CausI and BLM-OdI tasks, using English and Italian prompts, with 100 samples each (batch size of one, evaluated instance by instance, over three independent runs) with Meta-Llama-3-8B-Instruct (ML-8), Meta-Llama-3-70B-Instruct (ML-70), Mistral-7B-Instruct-v0.3 (M-7), and Gemma-2-9b-It (G-2). We report averaged F1 scores over 3 runs in Table 2. BLM-OdI tasks OdI tasks show the lowest overall performance across models. This indicates that the task is the most complex and challenging for the models. Meta-Llama-3-70B-Instruct performs best, particularly in one-shot English and Italian prompts. However, Mistral-7B-Instruct-v0.3 struggles the most, particularly in zero-shot settings, which reflects that the model has limited generalisation capabilities in complex linguistic tasks.

Key Observations

Larger models, such as Meta-Llama-3-70B-Instruct and Gemma-2-9b-it, consistently outperform smaller models, showing better generalisation and stability across tasks. English prompts generally result in higher F1 scores, though Italian prompts sometimes achieve comparable performance, particularly with Gemma-2-9b-it. One-shot prompting tends to improve performance, though the degree of improvement varies by model and task complexity. Smaller models, such as Mistral-7B-Instruct and Meta-Llama-3-8B-Instruct, show substantial variance, especially in oneshot scenarios, indicating instability in complex linguistic tasks.

Comparison with Multitask Learning Approaches

We compare our LLM prompting results with the work of [12,11], which explored the properties of Italian sentence embeddings -the embeddings of the [CLS] token from a pretrained Electra model [18] 3 -through the agreement and the causative and object-drop BLM datasets, using a two-level Variational Encoder-Decoder architecture. This system learns to compress the sentence embeddings into representations relevant for the specific BLM tasks. The dataset statistics, and results on the individual BLM tasks as averaged F1 score over three runs and different amounts of lexical variation are shown in Table 3.

While not directly comparable due to the different training process and the different test data, using pretrained transformer encoder architectures, like Electra, significantly outperform the zero and one-shot prompting baseline. The performance gap suggests that while zero or one-shot prompting is flexible, it may not capture the complex syntactic and semantic features required for the BLM task in Italian.

Limitations

While the data is very rich and richly structured, it shares all the limitations of artificial and synthetic data: stilted sentence structure, limited variability, possibly sentences that are too short. This artificiality, though, might reduce, without eliminating, the risk of having sentences that were directly seen in the training data of the pretrained models that will be used, and that we use, for further experiments.

The initial seed sentences, although minimal, were crafted by experts. This approach is deliberate, like in the ARC dataset, to guarantee that the data are not algorithmically reproducible [19]. This expert-based approach, though, might not be easily scalable, especially given the complexity of the data. Exploring methods to leverage existing datasets for seed generation could mitigate this dependency.

The current dataset comprises three main tasks. More tasks and variants are needed to demonstrate the robustness and the wider appeal of the data.

Ethical issues

The data presented include an augmentation step that uses large language models (LLMs). LLMs are trained on extensive text data, which may unintentionally incorporate biases present in the training corpus.

Data license and copyright issues

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). For uses outside of these terms, please contact the authors.

Figure 1 :1Figure 1: Example of a Raven's Progressive Matrix (RPM) from visual intelligence tests. This instance is generated with two generative rules: (i) the red dot moves one place clockwise when traversing the matrix left to right; (ii) the blue square moves one place anticlockwise when traversing the matrix top to bottom. The task consists in finding the tile in the answer set that correctly completes the sequence (indicated with a double border).

Figure 3 :Figure 4 :34Figure 3: Two instances of BLM-OdI data: with little (type II) and maximal (type III) lexical variation.

Context 1 AgFigure 5 :15Figure 5: BLM-OdI Template. Same generative rules as BLM-CausI, with the difference that here the passive/active voice is confounding, and the correct answer is an erroneous answer for BLM-CausI.

Figure 6 :6Figure 6: BLM data generation process, from seed examples of a linguistic problem to the complete dataset

Figure 7 :Figure 8 :78Figure 7: Data format

#COMPITO: Ti chiedo di risolvere un quesito. La lingua di questo quesito e' l'italiano. Ti daro' una lista di frasi (numerate da 1 a 7) che chiameremo **Contesto**, e un insieme di frasi (identificate da una che chiameremo **Risposte**. Il tuo compito e' di scegliere fra le **Risposte** la frase che potrebbe essere la frase seguente del **Contesto**. # FORMATO: Devi mettere **SOLO** la lettera che corrisponde alla risposta migliore. Non inserire altro testo, ne' prima ne' dopo.

Figure 9 :9Figure 9: Few (One)-Shot Prompt in Italian.

Table 33Dataset statistics and evaluation results on a two-level varia-tional encoder-decoder architecture using an Italian Electra(E-It) and a multilingual Electra (E-M) pretrained model toprovide sentence embeddings.

Italian Electra (E-It) pretrained model: dbmdz/electra-baseitalian-xxl-cased-discriminator, multi-lingual Electra (E-M) model: google/electra-base-discriminator

Acknowledgments

We gratefully acknowledge the support of this work by the Swiss National Science Foundation, through grant SNF Advanced grant TMAG-1_209426 to PM.

(P. Merlo) GLOBE https://www.idiap.ch/en/scientific-research/researchers (P. Merlo

A. Example Data Format

[{ "ID": 215, "Context": [ "le pittrici possono disegnare delle forme in meno di due giorni", "le artiste possono disegnare delle rappresentazioni artistiche da un mese", "alcune coreografie sono disegnate dalle pittrici nel salone espositivo", "delle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese", "alcune coreografie devono essere disegnate con pochi mezzi economici", "le scenografie devono essere disegnate da pochi mesi", "le pittrici devono disegnare nel salone espositivo"], "Context_concatenated": "1\tle pittrici possono disegnare delle forme in meno di due giorni\n2\tle artiste possono disegnare delle rappresentazioni artistiche da un mese\n3\talcune coreografie sono disegnate dalle pittrici nel salone espositivo\n4\tdelle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese\n5\talcune coreografie devono essere disegnate con pochi mezzi economici\n6\tle scenografie devono essere disegnate da pochi mesi\n7\tle pittrici devono disegnare nel salone espositivo", "Answer_set": [ "delle rappresentazioni artistiche devono poter disegnare le sue allieve", "le scenografie devono essere disegnate dalle sue allieve", "le sue allieve devono essere disegnate da delle rappresentazioni artistiche", "le pittrici possono disegnare le scenografie", "le pittrici possono disegnare da un anno circa", "delle forme devono poter disegnare da pochi mesi", "le artiste devono poter disegnare da alcune coreografie", "delle rappresentazioni artistiche devono disegnare dalle artiste"], "Answer_concatenated": "A\tdelle rappresentazioni artistiche devono poter disegnare le sue allieve\nB\tle scenografie devono essere disegnate dalle sue allieve\nC\tle sue allieve devono essere disegnate da delle rappresentazioni artistiche\nD\tle pittrici possono disegnare le scenografie\nE\tle pittrici possono disegnare da un anno circa\nF\tdelle forme devono poter disegnare da pochi mesi\nG\tle artiste devono poter disegnare da alcune coreografie\nE\tdelle rappresentazioni artistiche devono disegnare dalle artiste", "Correct_option": "E", "Correct_answer": "le pittrici possono disegnare da un anno circa", "Answer_set_annotation": [ { "label": "IR-trans", "value": false, "option": "A" }, { "label": "IER-pass", "value": false, "option": "B" }, { "label": "ER-pass", "value": false, "option": "C" }, { "label": "R-trans", "value": false, "option": "D" }, { "label": "Correct", "value": true, "option": "E" }, { "label": "I-Int", "value": false, "option": "F" }, { "label": "E-WrBy", "value": false, "option": "G" }, { "label": "IE-WrBy", "value": false, "option": "H" } ], "Verb": "disegnare" }, .... ]

CALAMITA -Challenge the Abilities of LAnguage Models in ITAlian: Overview GAttanasio PBasile FBorazio DCroce MFrancis JGili EMusacchio MNissim VPatti MRinaldi DScalena Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024) the 10th Italian Conference on Computational Linguistics (CLiC-it 2024) 2024 Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Motivations and formal specifications PMerlo 10.48550/arXiv.2306.11444 .2306.11444 2023 Standardization of progressive matrices JCRaven British Journal of Medical Psychology 19 1938 Emergent analogical reasoning in large language models TWebb KJHolyoak HLu 10.1038/s41562-023-01659-w Nature Human Behaviour 7 2023 In-context analogical reasoning with pre-trained language models XHu SStorks RLewis JChai Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Long Papers the 61st Annual Meeting of the Association for Computational Linguistics

Toronto, Canada

2023 1 Association for Computational Linguistics BLM-AgrF: A new French benchmark to investigate generalization of agreement in neural networks AAn CJiang MARodriguez VNastase PMerlo Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics

Dubrovnik, Croatia

2023 Grammatical information in BERT sentence embeddings as two-dimensional arrays VNastase PMerlo Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023) the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

Toronto, Canada

2023 BLM-s/lE: A structured dataset of English spray-load verb alternations for testing generalization in LLMs GSamo VNastase CJiang PMerlo Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing the 2023 Conference on Empirical Methods in Natural Language Processing

Singapore

2023 Are there identifiable structural parts in the sentence embedding whole? VNastase PMerlo 2024 <idno type="DOI">10.18653/v1/2024.blackboxnlp-1.3</idno> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b10"> <analytic> <title level="a" type="main">Tracking linguistic information in transformer-based sentence embeddings through targeted sparsification VNastase PMerlo Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024) the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)

Bangkok, Thailand

2024 Exploring Italian sentence embeddings properties through multi-tasking VNastase GSamo CJiang PMerlo Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024) the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024)

Pisa, Italy

2024 Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement VNastase CJiang GSamo PMerlo Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024) the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024)

Pisa, Italy

2024 BLevin English verb classes and alternations: A preliminary investigation University of Chicago Press 1993 Subject-verb agreement errors in french and english: The role of syntactic hierarchy JFranck GVigliocco JNicol Language and cognitive processes 17 2002 BERT: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova 10.18653/v1/N19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

Association for Computational Linguistics 2019 1 Chain-of-thought prompting elicits reasoning in large language models JWei XWang DSchuurmans MBosma FXia EChi QVLe DZhou Advances in neural information processing systems 35 2022 Large language models are zero-shot reasoners TKojima SSGu MReid YMatsuo YIwasawa Advances in neural information processing systems 35 2022 Electra: Pre-training text encoders as discriminators rather than generators KClark M.-TLuong QVLe CDManning ICLR 2020 On the measure of intelligence FChollet 2019