-

1613-0073

Modular Design Patterns for Generative Neuro-Symbolic Systems

Maaike H. T. de Boer

maaike.deboer@tno.nl 1 5

Quirine S. Smit

quirine.smit@tno.nl 1 5

Michael van Bekkum

1 5

André Meyer-Vitali

andre.meyer-vitali@dfki.de 0 1

Thomas Schmid

thomas.schmid@medizin.uni-halle.de 1 2 3 4 0 Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI) , Saarbrücken , Germany 1 GeNeSy'24: First International Workshop on Generative Neuro-Symbolic AI 2 Lancaster University in Leipzig , Leipzig , Germany 3 Leipzig University , Leipzig , Germany 4 Martin Luther University Halle-Wittenberg , Halle (Saale) , Germany 5 TNO, dep. Data Science , The Hague , The Netherlands

Developing systems that are able to generate novel outputs is one of the dominating trends in current Artificial Intelligence (AI) research. Both capabilities and availability of such generative systems, in particular of so-called Large Language Models (LLMs), have been exploding in recent years. While Neuro-Symbolic generative models ofer advantages over purely statistical generative models, it is currently dificult to compare the diferent ways in which the training, fine-tuning and usage of the growing variety of such approaches is carried out. In this work, we use the modular design patterns and Boxology language of van Bekkum et al for this purpose and extend those to enable the representation of generative models, specifically LLMs. These patterns provide a general language to describe, compare and understand the diferent architectures and methods used. Our main aim is to support better understanding of generative models as well as to support engineering of LLM-based systems. In order to demonstrate the usefulness of this approach, we explore generative Neuro-Symbolic architectures and approaches as use cases for these generative design patterns.

design patterns neuro-symbolic AI generative models Large Language Models

CEUR ceur-ws.org

1. Introduction

Recently, Artificial Intelligence (AI) has taken a leap in the form of generative models. Prominently, multimodal statistical models, such as DALL-E [ 1 ] and Stable Difusion [ 2 ] have changed the world of image generation, and with the release of OpenAI’s ChatGPT system1, the world of text generation has changed forever. Targeting text generation tasks in particular, both the development and the number of Large Language Models (LLMs) has increased enormously. Currently, many diferent generative models are popping up, both open-source and proprietary [ 3 ]. Moreover, due to open challenges of LLMs, such as hallucination [ 4 ], explainability [ 5 ] and trustworthiness, novel Neuro-Symbolic generative approaches have emerged [ 6, 7 ]. nEvelop-O ∗Corresponding author; both authors contributed equally.

Not only several LLMs, but also a large number of so-called foundation models dealing with various input and output modalities have entered the scene in recent years. Due to the quantity and diversity of emerging generative techniques, it becomes more and more challenging to keep track of the ever-growing variety of models with diferent architectures and capabilities. One of the solutions to tackle this issue is to create a high-level conceptual framework to discuss, compare, configure and combine diferent models is using a Boxology. The Boxology started in the field of Neuro-Symbolic systems, by Van Harmelen and Ten Teije [ 8 ] in 2019. This work is extended in 2021 by van Bekkum et al. [ 9 ] by providing a taxonomically organised vocabulary to describe both processes and data structures used in hybrid systems.

Here, we propose to use and extend the Boxology to gain insights in a variety of generative models, specifically on LLMs. To this end, we test validity and usefulness of the Boxology in this field on example architectures and applications, such as ChatGPT, KnowGL, GENOME and Logic-LM. Our modular approach supports new architectures and engineering approaches to systems based on generative AI models. Our pattern extensions promote transparency and trustworthiness in system design, by providing interpretable, high-level component descriptions of generative AI models.

The rest of the paper is organized as follows. In the next section, we give a more detailed overview of the Boxology. In the third section, we propose to extend the Boxology by three novel patterns in order to be able to handle generative models. In section 4, we dive into specific applications and tasks in which generative models, specifically in Neuro-Symbolic systems, are used. We conclude with summarizing our key findings and outlining future work.

2. Related Work on the Boxology

We will base our paper on the paper by van Bekkum et al. [ 9 ], in which the authors provide a taxonomically organised vocabulary to describe both processes and data structures used in hybrid systems. The highest level of this taxonomy contains instances, models, processes and actors, which may be described as follows.

Instances: The two main classes of instances are data and symbols. Symbols are defined as to have a designation to an object, class or a relation in the world, which can be either atomic or complex, and when a new symbol is created from another symbol and a system of operations, it should have a designation. Examples of symbols are labels (short descriptions), relations (connections between data items, such as triples) and traces (records of data and events). Data is defined as not symbolic. Examples are numbers, texts, tensors or streams.

Models: Models are descriptions of entities and their relationships, which can be statistical or semantic. Statistical models represent dependencies between statistical variables, such as LLMs or Bayesian Networks. Semantic models specify concepts, attributes and relationships to represent the implicit meaning of symbols, such as ontologies, taxonomies, knowledge graphs or rule bases.

Processes: Processes are operations instances and models. Three types of processes are defined: generation, transformation and inference. Generation can be done using, for example, the training of a model or by knowledge engineering. Transformation is the transformation of data, for example from knowledge graph to vector space. Inference can be inductive or deductive, in which induction generalises instances and deduction reaches conclusions on specific instances, such as with classification.

Actors: Actors can be humans, (software) agents or robots (physically embedded agents).

Meyer-Vitali et al. [ 10 ] extended the original paper with a definition of teams of actors in the Boxology.

Besides the vocabulary, the visual language is defined in van Bekkum et al. [ 9 ], as an extension on Van Harmelen and Ten Teije [ 8 ]. The visual language consists of rectangular boxes (instances), hexagonal boxes (models), ovals (processes) and triangles (actors) and unspecified arrows between them. Within the boxes the concept will be noted by each level in the vocabulary using colon-separation from most generic to most-specific, for example a neural network will be model:stat:NN.

van Bekkum et al. [ 9 ] present elementary patterns, which can then be combined into more complex patterns. Patterns 1a and 2a from Figure 1, for example, can be combined into a pattern which is named 3a in the paper (depicted in Figure 2). Whereas 1a describes the pattern of training a model based on data (data generates a model), 2a describes the usage of the model in deducing a symbol (data and model deduce a symbol), such as a prediction. The combination in 3a describes a basic structure for a (statistical) Machine Learning (ML) model depicting the training (creating the model) and testing or application phase (applying the model on new data).

In the past years, the Boxology has been used and extended in diferent ways. Three of the most influential papers are the formalisation of the notions from the Boxology and implementation in the heterogeneous tool set (Hets) [ 11 ], the extension of the Boxology for (teams of) actors [ 10 ] and the systematic study of nearly 500 papers published in the past decade in the area of Semantic Web Machine Learning [ 12 ].

3. Design Patterns for Generative Models

While Generative AI originates in the realm of data-driven AI, it has demonstrated capabilities that exceed classical machine learning tasks like classification and regression by far. In particular, such generative systems specialise in the generation of content, such as images [ 1, 2 ], videos [ 13 ], or text [ 14, 15, 16 ]. In the original, purely statistical setting, these capabilities are acquired during a so-called (pre-)training phase [ 17 ] where a representation of a large data body is learned and in a second phase used to process input to output that has not explicitly been specified but follows the characteristics of the data body (application phase).

However, specific arrangements for both (pre-)training and representation usage in downstream tasks vary for diferent approaches and systems [ 18 ]. In order to allow for a coherent description of the generative paradigm, we propose to extend the elementary patterns of van Bekkum et al. [ 9 ] that describe the generic pattern for instances, models, processes and actors (Figure 1 1a-1d and 2a-d). Please note that while patterns 1e and 1f are required for certain aspects of the generative paradigm, their usage is not limited to this. Data generation and labelling by humans may also be employed work with any statistical approach.

In particular, when describing classical machine learning systems, mostly pattern 2a is used, where the output is a symbol, such as a classification or a label. However, the key concept in generative models is that the output is not a symbol, but data; this can be an image, video or text, depending on the model. Additionally, actors play an important role in Generative AI, by creating prompts or label data. To this end, we here propose three new elementary patterns: pattern 1e, in which an actor can generate data, pattern 1f, in which an actor labels data, and 2e, in which a model can deduce data from data. In the remainder of this section we mainly focus on Large Language Models (LLMs). Please note, however, that the patterns proposed in this section are transferable to other data types, for example to vision transformers, which follow a similar architecture paradigm as transformers but operate on image data .

3.1. Transformer Models

The key technology behind basically all current LLMs is the so-called transformer architecture. The original transformer paper by Vaswani et al. [ 19 ] proposed to use two interacting models, an encoder and a decoder. In the transformer family, some models, however, only use the encoder or the decoder part [ 18 ]. Figure 3A shows the architecture of a transformer model as a design pattern. Transformers are made up of two parts, an encoder and a decoder. These are usually trained end-to-end (such as flan-T5 [ 20 ]), but can also be used separately as encoderonly (Figure 3B) or decoder-only (Figure 3C) models. In the following sections, we focus on an encoder-only and a decoder-only family. Other sections focus on instructions and prompting of diferent models and the interaction with actors.

3.1.1. Encoder only: BERT (base)

Some systems are encoder-only. These systems are specialised in contextual encoding, often named a base model. They can ‘understand’ and encode input sentences. An encoder model is trained using data, pattern 1a. It is often connected to other systems, such as a classification system, pattern 3a (see Figure 3B), to be useful for tasks other than the encoding input sentences. An example of this is BERT [ 21 ]. Encoders are transformer models, but not generative models.

3.1.2. Decoder only: GPT

Other transformer based systems have decoder-only architectures. This approach is complementary to the encoder-only paradigm, but structurally diferent [ 18 ]: an encoder processes the input data (in these cases text) and transforms it into a diferent, machine interpretable, representation, often a vector representation. A decoder-only system, on the other hand, decodes the input data directly, without being transformed into a higher, more abstract representation, to the desired representation (text or images). Examples of this are generative models from the GPT family [ 14 ].

In the Boxology, both encoders and decoders have a similar representation. For generative models from the GPT family, we suggest pattern 3c, (see Figure 2), which is a combination of 1a and 2e, as presented in Figure 1: data is used to train a decoder model, which does not use an encoder as input as well, such as with other transformers. This decoder model can be used to deduce output data from input data directly.

Decoder-only architectures may be further divided into causal decoder architectures and prefix decoder architectures. Causal decoder architectures, such as GPT [ 22, 14 ] and BLOOMZ [ 23 ], use only unidirectional attention to the input sequence by using a specific mask. Prefix decoder architectures, such as PaLM [ 24 ], uses the bidirectional attention for tokens in the prefix while maintaining unidirectional attention for generating subsequent tokens. Both architectures follow the elementary pattern 2e.

3.1.3. Prompts and Instructions

One of the main diferences between current LLMs and earlier BERT or other transformer models is that the model is fine-tuned on instructions [ 18 ]. Multi-task fine-tuning or instruction tuning, is currently often done using a collection of datasets phrased as instructions, to improve model performance and generalisation to unseen tasks [ 20 ]. The original model is often referred to as foundation model [ 25 ], whereas the fine-tuned model is an adjusted model. In the Boxology, we define this adjusted model as another model as we did with the encoder and decoder model in Figure 3, but then stacking two decoder models. This instruction tuning also follows pattern 1a, but this data is diferent as it also contains instructions.

Next to instruction learning LLMs can also be tweaked by in-context learning. Here examples are used as part of the prompt to give context for the answers to the instructions. In this case the model weights are not changed. This optimizes the performance of models on diferent tasks [26], but does not need as much training data as training a model from scratch. These prompts can include a few (training) examples of the input and output (few-shot) or no examples (zero-shot). These few-shot examples do not train the foundational or instruction model, and therefore we model them as input data that is used to deduce data (text), which is pattern 2e. Assistants or GPTs could, however, be seen as a new model, especially if they perform other tasks, such as Retrieval Augmented Generation (RAG).

3.2. Actor Interaction

Actors play a large role in the current generative models. In the original paper by van Bekkum et al. [ 9 ], patterns using actors are underspecified. On the one hand, actors often create data, not only in the interaction with an agent that uses generative models, but also in common Machine Learning approaches. Many of the created textual datasets are written, pre-processed and labelled by actors. A first proposed pattern is pattern 1e, in which an actor creates data. The second proposed pattern is pattern 1f, in which an actor generates a label, or annotates data. Both patterns are depicted in Figure 1.

Generative models are often not used only once. With the current chat functions, actors are interacting with the model multiple times. The main diference with other Machine Learning models, where also more data is inputted and symbols are outputted, the data inputted is often not dependent on the output of the previous data point. However, with conversational generative models, prompts can be related to the previous response. Currently, recurrent or iterative behaviour is not yet part of the pattern concepts.

4. Design Patterns for Generative Neuro-Symbolic AI

In this section, we describe and explore several papers that use generative models in a NeuroSymbolic system. The selected papers are chosen, as they represent a diverse set of possibilities to use a generative model, at the start of the system, in the middle and at the end, but also to act as a fluent language interface or a formal language interface. We also included ChatGPT, which is the most famous generative AI system, and although mainly data driven, includes a symbolic component in the reward modelling part of the training phase.

4.1. (Training of) ChatGPT

ChatGPT is an application of the foundational model GPT3 [ 14 ], and later GPT4 [27]. It is trained further to be of aid in the setting of an assistant. The architecture of the training phases is represented in Figure 4. The foundational model GPT3 is used as a basis for further training (1a). Instructions and answers are used to train what will become ChatGPT. Then, based on new prompts the model generates a response (3c).

To further train ChatGPT to give the desired responses the reward model is added. The reward model is a separate model, which can judge if a response is a good one, given the instructions. The reward model is trained by people annotating the multiple answers to instructions. To train the reward model, the model trained on instructions is asked to output multiple answers. These answers are then ranked by annotators to generate a training set for the reward model (1f). The reward model is trained to compare answers of ChatGPT and return their score (3a). This is then used in a loop with the ChatGPT to improve the instruction answering process. As one can view, we have adapted Boxology patterns to be able to accept multiple inputs.

When applying ChatGPT in a pipeline, it sufices to show only pattern 3c, the block containing ChatGPT and 1e to show the user writing the prompt.

4.2. KnowGL

Figure 5 shows KnowGL Parser [28], a NeSy system combining a generative module and symbolic methods. The KnowGL Parser can be used to automatically extract knowledge graphs from collections of documents. It is based on BART-large, which has an encoder-decoder architecture. The encoder receives a sentence (1a) and the decoder generates a list of ‘subject, relation, object’ (3c). These are then parsed (transformed) in preparation of the next step, fact ranking (1d). Here a ranked list is created of distinct facts and their scores (2b). In the final step the generated facts are linked to Wikidata. This is done using a mapping of labels to Wikidata IDs (2b). In the case that the generative model has created a new entity, type or relation label that are not in Wikidata it returns ‘null’.

4.3. KnowBERT

While knowledge is mostly injected to statistical generative models either during the input or during the output stage, also approaches to inject knowledge inside the model have been proposed. A prominent example is KnowBERT, a modified variant of the transformer architecture BERT [29]. Although not a generative model, it stands out for its fusion of contextual and graph representations, attention-enhanced entity spanned knowledge infusion, and flexibility in injecting multiple Knowledge Graphs at various model levels. By integrating so-called Knowledge Attention and Recontextualization (KAR) layers [30], graph entity embeddings are utilized that are processed through an attention mechanism to enhance entity span embeddings. This happens in later layers of the model to stabilize training but may potentially also used to inject knowledge at earlier stages [ 6 ]. The Boxology pattern for KnowBERT is depicted in Figure 6.

4.4. Mathematical Conjecturing and LLMs

The system proposed by Johansson and Smallbone [31] assigns the generative task of discovery of mathematical conjectures to a LLM (3c), while the results can be checked afterwards using a symbolic theorem prover or counter-example finder (2b). The system is prompted with a formal theory (e.g. a sort function), and has the LLM generate lemmas from the theory. These generated lemmas are transformed from data to symbol and can then be used by the semantic model(s). The pattern is depicted in Figure 7. The approach taken in Yang et al. [32] is also captured by this pattern. The system proposed uses a LLM component to produce Prolog code (3c) and a symbolic inference engine to produce answers and reasoning traces by executing the aforementioned code (1d, 2b).

4.5. GENOME

Generative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules (GENOME) [33] focuses on the task of generative software module learning, based on a LLM generating signatures (input/output) and reasoning steps, then have a LLM create the software module based on those and evaluate the module on test cases.

The system consists of three stages: module initialization, module generation, and module execution. The design pattern is depicted in Figure 8. First a LLM assesses a visual-language question and outputs new module signatures and operation steps as a response to the query (3c), if current modules cannot provide an adequate response. In the next step, the LLM creates a module (software code) based on the signature/test case (3c). Finally the module is executed by passing it a visual query (2a). Logic-LM [34] integrates LLMs with symbolic solvers to improve logical problem-solving. The pattern is depicted in Figure 9: the system utilizes LLMs to translate a problem stated in natural language problem into a symbolic formulation (3c). In the next step, a symbolic reasoner performs logical inference on the formulated problem (1d, 2b, 1d). Finally, an LLM interprets the results and outputs natural language (3c). The LLM thus functions as a fluent language interface (both on input and output) to a symbolic reasoner component.

5. Conclusion and Future Work

Generative AI is currently a major technology with many applications and combining datadriven approaches with knowledge-based techniques is a promising development to this end. In this paper, we propose new design patterns for modular generative Neuro-Symbolic systems to be included into the design pattern approach for Neuro-Symbolic systems as proposed by van Bekkum et al. [ 9 ]. We show how the composition of elementary patterns can be used to describe generative models, and we explore several specific generative models, such as ChatGPT, as well as several generative NeSy papers, such as KnowGL, GENOME and Logic-LM.

We acknowledge that this is only the first step in a more elaborate exploration on generative design patterns and the description of generative Neuro-Symbolic architectures. In future work, we would like to validate our proposals for extending the Boxology, by applying them to more examples from additional papers. In addition, we expect to further extend and deepen the Boxology itself. In this paper, it became clear that the temporal or iterative aspect is not yet visualised well, as well as the naming and formalisation of the Boxology, including the do’s and don’ts: which pattern combinations are allowed and which are not? The importance of modelling datasets for generative AI may be taken into account in future specifications of particular subtypes of Instances and Models in the taxonomy. Additionally, the use of graphical tools for software development is well-known from the Unified Modelling Language (UML) and visual programming tools, such as LabView or Scratch. We are mostly concerned with graphical representations of design patterns for system design and documentation, but the promise of templates, low-code or no-code development is appealing for the future.

Acknowledgements

We would like to thank the TNO project GRAIL for their financial support, as well as Frank van Harmelen and Annette ten Teije for their feedback. We would also like to thank Daan Di Scala for his contribution to the KnowGL pattern. J. Bohg, A. Bosselut, E. Brunskill, et al., On the opportunities and risks of foundation models, arXiv:2108.07258 (2021). [26] Y. Liu, H. He, T. Han, X. Zhang, M. Liu, J. Tian, Y. Zhang, J. Wang, X. Gao, T. Zhong, et al., Understanding llms: A comprehensive overview from training to inference, arXiv:2401.02038 (2024). [27] T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, Y. Tang, A brief overview of chatgpt: The history, status quo and potential future development, 2023. [28] G. Rossiello, M. F. M. Chowdhury, N. Mihindukulasooriya, O. Cornec, A. M. Gliozzo,

Knowgl: Knowledge generation and linking from text, in: AAAI, 2023, pp. 16476–16478. [29] M. E. Peters, M. Neumann, R. L. Logan IV, R. Schwartz, V. Joshi, S. Singh, N. A. Smith,

Knowledge enhanced contextual word representations, arXiv:1909.04164 (2019). [30] I. Balažević, C. Allen, T. M. Hospedales, Tucker: Tensor factorization for knowledge graph completion, arXiv:1901.09590 (2019). [31] M. Johansson, N. Smallbone, Exploring mathematical conjecturing with large language models, Proceedings of NeSy (2023). [32] S. Yang, X. Li, L. Cui, L. Bing, W. Lam, Neuro-symbolic integration brings causal and reliable reasoning proofs, 2023. arXiv:2311.09802. [33] Z. Chen, R. Sun, W. Liu, Y. Hong, C. Gan, Genome: Generative neuro-symbolic visual reasoning by growing and reusing modules, 2023. arXiv:2311.04901. [34] L. Pan, A. Albalak, X. Wang, W. Y. Wang, Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning, arXiv:2305.12295 (2023).

[1]

Betker ,

Goh ,

Jing ,

Brooks ,

Wang ,

Li ,

Ouyang ,

Zhuang ,

Lee ,

Guo , et al., Improving image generation with better captions , Computer Science 2 ( 2023 ) 8 .

[2]

Rombach ,

Blattmann ,

Lorenz ,

Esser ,

Ommer , High-resolution image synthesis with latent difusion models , 2021 . arXiv: 2112 . 10752 .

[3]

Chen ,

Jiao ,

Li ,

Qin ,

Ravaut ,

Zhao ,

Xiong ,

Joty , Chatgpt's one-year anniversary: Are open-source large language models catching up? , arXiv:2311.16989 ( 2023 ).

[4]

Ji ,

Lee ,

Frieske ,

Yu ,

Su ,

Xu ,

Ishii ,

Y. J.

Bang ,

Madotto ,

Fung , Survey of hallucination in natural language generation , ACM Comput. Surv . 55 ( 2023 ).

[5]

Zhao ,

Chen ,

Yang ,

Liu ,

Deng ,

Cai ,

Wang ,

Yin ,

Du , Explainability for large language models: A survey , ACM Transactions on Intelligent Systems and Technology 15 ( 2024 ) 1 - 38 .

[6]

Colon-Hernandez ,

Havasi ,

Alonso ,

Huggins ,

Breazeal , Combining pre-trained language models and structured knowledge , arXiv preprint arXiv:2101.12294 ( 2021 ).

[7]

Wei ,

Wang ,

Zhang ,

Bhatia ,

Arnold , Knowledge enhanced pretrained language models: A compreshensive survey , arXiv:2110.08455 ( 2021 ).

[8]

Van Harmelen ,

A. Ten

Teije , A boxology of design patterns for hybrid learning and reasoning systems , Journal of Web Engineering 18 ( 2019 ) 97 - 123 .

[9] M. van Bekkum , M. de Boer , F. van Harmelen , A. Meyer-Vitali , A. t. Teije, Modular design patterns for hybrid learning and reasoning systems: a taxonomy, patterns and use cases, Applied Intelligence 51 ( 2021 ) 6528 - 6546 .

[10]

Meyer-Vitali ,

Mulder , M. H. T. de Boer , Modular design patterns for hybrid actors , 2021 . arXiv: 2109 . 09331 .

[11]

Mossakowski , Modular design patterns for neural-symbolic integration: refinement and combination , arXiv:2206.04724 ( 2022 ).

[12]

Breit ,

Waltersdorfer ,

F. J.

Ekaputra ,

Sabou ,

Ekelhart ,

Iana ,

Paulheim ,

Portisch ,

Revenko , A. t. Teije, et al., Combining machine learning and semantic web: A systematic mapping study , ACM Computing Surveys ( 2023 ).

[13]

Liu ,

Zhang ,

Li ,

Yan ,

Gao ,

Chen ,

Yuan ,

Huang ,

Sun ,

Gao , et al., Sora: A review on background, technology, limitations, and opportunities of large vision models , arXiv:2402.17177 ( 2024 ).

[14]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[15]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , A.

Rodriguez , A.

Joulin , E. Grave, G. Lample, Llama: Open and eficient foundation language models , 2023 . arXiv: 2302 . 13971 .

[16]

Pichai ,

Hassabis , Introducing gemini: our largest and most capable ai model , 2023 .

[17]

Erhan ,

Courville ,

Bengio ,

Vincent , Why does unsupervised pre-training help deep learning? , in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings , 2010 , pp. 201 - 208 .

[18]

Min ,

Ross ,

Sulem ,

A. P. B.

Veyseh ,

T. H.

Nguyen ,

Sainz ,

Agirre ,

Heintz ,

Roth , Recent advances in natural language processing via large pre-trained language models: A survey , ACM Computing Surveys 56 ( 2023 ) 1 - 40 .

[19]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in neural information processing systems 30 ( 2017 ).

[20]

H. W.

Chung ,

Hou ,

Longpre ,

Zoph ,

Tay ,

Fedus ,

Li ,

Wang ,

Dehghani ,

Brahma , et al., Scaling instruction-finetuned language models , arXiv:2210.11416 ( 2022 ).

[21]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , 2019 . arXiv: 1810 .04805.

[22]

Radford , J. Wu ,

Child ,

Luan ,

Amodei ,

Sutskever , et al., Language models are unsupervised multitask learners , OpenAI blog 1 ( 2019 ) 9 .

[23]

Muennighof ,

Wang ,

Sutawika ,

Roberts ,

Biderman ,

T. L.

Scao ,

M. S.

Bari ,

Shen ,

Z.-X.

Yong ,

Schoelkopf , et al., Crosslingual generalization through multitask ifnetuning , arXiv:2211.01786 ( 2022 ).

[24]

Chowdhery ,

Narang ,

Devlin ,

Bosma ,

Mishra ,

Roberts ,

Barham ,

H. W.

Chung ,

Sutton ,

Gehrmann , et al., Palm: Scaling language modeling with pathways , Journal of Machine Learning Research 24 ( 2023 ) 1 - 113 .

[25]

Bommasani ,

D. A.

Hudson , E. Adeli,

Altman ,

Arora , S. von Arx, M. S. Bernstein,