Reusing, Recycling and Reducing Large Models for Developing Green and Responsible Language Technology Ainhoa Vivel-Couso Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Spain Abstract Natural Language (NL) is the most common and efficient tool for humans to transmit information. Natural Language Processing (NLP), which includes NL Understanding (NLU) and NL Generation (NLG), is one of the main challenges in Artificial Intelligence (AI) and has a growing economic impact on the current digital transformation. Despite their impressive capabilities, large pre-trained language models present serious drawbacks from a research, environmental, and ethical advancement perspective. The primary research objective of this doctoral thesis is to advance the state of the art in NL technology by (i) developing efficient methods to extend existing models to new domains, genres, and languages for the official languages of Spain (Spanish, Catalan, Basque, and Galician) and English; (ii) exploring new ways to pre-train and fine-tune language models efficiently in terms of parameters, thereby reducing the carbon footprint required to train such models; (iii) addressing the explainability of large pre- trained language models for NLG tasks; (iv) developing a series of advanced domain-based content applications across multiple languages, sectors and domains (e.g., Meteorology and Health) with an emphasis on explainability and evaluation tasks; and (v) defining (and overseeing compliance with) guidelines and requirements for the development of Responsible NLP with an Ethical, Legal, Social, Economical, and Cultural (ELSEC) perspective. Keywords natural language generation, multilingualism, artificial intelligence, energy efficiency 1. Justification of the Proposed Research The NLP community has contributed to the emergence of new disruptive techniques and tools that are revolutionizing research on AI. Thus, the NLP community is currently undergoing a paradigm shift with the production and exploitation of large, pre-trained language models based on transformers [1, 2]. Despite their impressive capabilities, large pre-trained language models present drawbacks from an environmental and ethical perspective. For example, computing large pre-trained models from scratch is highly demanding and has a significant carbon footprint [3]. Additionally, these models are black boxes, meaning we do not have a clear understanding of how they function, when they fail, what emergent properties they might present, or new ways to efficiently exploit these models. Fortunately, research on Explainable AI can help better understand and utilize these models [4, 5, 6]. Some authors refer to these models as foundation models to highlight their central yet incomplete nature [7]. Furthermore, these models are costly to train and develop, both financially—due to the cost of hardware, electricity, and cloud computing time—and environmentally, due to the carbon footprint required to power modern servers with multi-GPU hardware. This also means that only a limited number of organizations with ample resources in terms of funding, computing capabilities, NLP experts, and corpora can currently afford to develop and implement such models. A growing concern is that due to unequal access to computing power, only certain companies and elite research groups can afford modern AI research [8]. Doctoral Symposium on Natural Language Processing, 26 September 2024, Valladolid, Spain. $ ainhoa.vivel.couso@usc.es (A. Vivel-Couso) € citius.gal/es/team/ainhoa-vivel-couso/ (A. Vivel-Couso)  0000-0002-5860-4849 (A. Vivel-Couso) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Origin and Related Work For an AI to be considered reliable, there are at least four aspects that can be considered: fairness, robustness, explainability and traceability [9]. The concept of Explainable AI (eXplainable AI, XAI) refers to the ease with which a human being can understand the decisions made by the AI and the actions underlying them [10, 11]. In the case of data-driven machine learning methods, depending on the level of transparency of the AI method considered, we speak of white-box, gray-box or black-box methods. An example of a white-box, or fully transparent, method is the so-called decision tree [12], since the tree generated from data makes explicit in a readable form (equivalent to rules) the knowledge at stake (variables, values) when making a prediction. For their part, methods based on fuzzy rule systems are an example of a gray box because they are interpretable to a certain degree [13]. Finally, neural networks are an example of an opaque or black-box method, since it is not possible, especially in complex models such as Deep Learning (DL) architectures, to interpret in a readable or understandable way the relevant elements that the model considers when making a prediction. There are different ways of explaining this type of methods, both in terms of their inner workings and the justification of their output (post-hoc methods) [14, 5, 15]. In the field of application of this dissertation (i.e., Reusing, Recycling and Reducing Pre-trained Models for Developing and Evaluating Green Data-to-text Systems) there are a few publications in the literature where the environmental impact of large languages models (LLMs) is analyzed [16, 17, 18]. However, our proposal in this dissertation provides a new approach to the problem, since it addresses the comparison of knowledge transfer methods to reduce environmental impact. To this end, we will study whether fine-tuning can be replaced by another type of knowledge transfer method with a lower environmental impact by being able to generate narratives that are equally understandable, natural and fluid. We devote the remaining of this section to present the definitions and basic concepts of the methodologies we will use in our research: NLG, transfer learning and automatic evaluation of narratives. 2.1. Natural Language Generation Natural language generation (NLG) is a process that consists of the automatic construction of narratives. This process is traditionally [19] divided into a series of stages: 1. Text planning. The first stage consists of the strategic generation of narratives. Specifically, it studies what to say in writing. 2. Sentence planning. The second stage determines how to organize the information to be conveyed. 3. Linguistic realization. The last stage is based on the application of syntactic and morphological rules to generate a correct text. Numerous technologies exist for NLG [20]. We will focus on the use of pre-trained models [21] which have been trained in large datasets for specific tasks. They learn general patterns and features from extensive data, often using unsupervised ML techniques. After pre-training, language models can be further adapted or fine-tuned [22] for specific tasks, offering a head start in performance for tasks like language understanding, image recognition, or other applications in AI. Pre-trained models have become popular in various domains due to their ability to capture and transfer knowledge from diverse datasets, improving efficiency and performance for specific applications [23]. We will use them as a basis because a lot of resources have been used during their training. By reusing everything they have learned, we reduce the environmental impact of training a new model from scratch. As a preliminary analysis1 , we studied the use of LLMs embedded in the following conversational assistants: 1 I finished my master’s studies in February 2024 with the defense of my Master’s Thesis. I am now starting my PhD thesis on this topic. • Bing AI2 . Tests performed on 13/11/2023 at 11:30. Bing AI uses Copilot with GPT-4.0. • ChatGPT3 . Tests performed on 13/12/2023 at 11:40. The free version of ChatGPT uses GPT-3.5. • Google BARD4 . Tests performed on 13/12/2023 at 11:45. The specific language model that Google BARD uses is not publicly available, but it is known to be based on the Transformer architecture developed by Google AI. Currently, BARD has been completely replaced by Gemini5 . The interaction with these systems is based on prompting. Doing several tests on them, it has been empirically appreciated that ChatGPT is the one that generates better narratives. It fits the data, follows the instructions properly and does not tend to hallucinate. While ChatGPT and similar language models have proven to be valuable and versatile, there are some challenges and limitations associated with their usage. Conversational assistants can generate incorrect or unreliable information due to their reliance on learned data patterns, making independent verification crucial. They are sensitive to input phrasing, leading to different responses with slight changes in wording, and can reflect biases present in their training data. There is a risk of hallucination, where the model generates credible-sounding but inaccurate content. In addition, handling complex or technical queries can be challenging for these models, as they are designed for general use. Prompt engineering, or the specific framing of queries, significantly impacts the model’s responses. Ambiguity in queries may lead the model to make incorrect assumptions rather than seeking clarification. Additionally, updates or changes to the model can alter its response patterns, and accessing the most advanced models often requires subscription plans. When using this type of conversational assistants, it is crucial to be aware of these limitations. For these reasons, we will look for freely worldwide available models which may produce more controlled narratives. Finally, it is worth noting that we will pay attention only to Text-to-Text (T2T) systems because they are the most appropriate for the use case with meteorological data under consideration. To be more specific, we will use Sequence-to-Sequence (seq2seq) language models [24], a type of neural network architecture designed for tasks that involve mapping input sequences to output sequences. These models are particularly popular in NLP tasks such as machine translation, text summarization, and chatbot development. We will reuse pre-trained T2T systems to generate weather descriptions. These descriptions will be based on the meteorological characteristics of each geographical area throughout the year. 2.2. Transfer Learning Transfer learning (TL) [25] is a ML technique where a model trained on one task is adapted for a second related task. Instead of training a model from scratch, transfer learning leverages the knowledge gained from solving a different but related task. This approach is particularly useful when we have a limited amount of labeled data for the target task and it is especially important to reduce the energy cost and carbon footprint. Training DL models from scratch can be computationally intensive and resource-consuming. Leveraging pre-trained models and fine-tuning them for specific tasks can be more efficient in terms of computational resources and time. The process of TL typically involves two steps: 1. Pre-training. Train a model on a large dataset and a related task. This model is often referred to as the pre-trained model or base model. For example, a model might be pre-trained on a massive dataset for text generation. 2. Knowledge transfer. Use the pre-trained model as a starting point and use some technique to adjust the model so that the information learned can be reused. 2 https://www.microsoft.com/en-us/bing 3 https://chat.openai.com 4 https://bard.google.com/ 5 https://gemini.google.com/ TL is especially valuable in DL, where models have many layers and parameters. Popular pre-trained models, such as those based on convolutional neural networks (CNNs) for image-related tasks or models like BERT for NLP, have shown great success in TL scenarios [26]. TL speeds up training and allow you to capture domain-specific features. However, what interests us most for this Thesis is to study the computational cost, specifically the energy cost, of the different ways of transferring knowledge. This section will discuss the different alternatives existing in the current state of the art for knowledge transfer. Fine-tuning (FT) is the process of making small adjustments to achieve the desired output or perfor- mance [22]. It is well known and widely used in ML, especially in DL. In the context of DL, it involves the use of weights of a trained neural network to program another DL algorithm from the same domain. Thus, FT consists in taking a pre-trained model and training at least one internal model parameter (i.e., weights). On the other hand, in the context of LLMs, what FT typically transforms is a general-purpose base model (e.g., GPT-3) into a specialized model for a particular use case (e.g., summarization). While FT is a common and effective approach for adapting pre-trained language models to specific downstream tasks, it requires retraining the entire model, which is usually computationally expensive. Therefore, there are several alternative methods and strategies that we can consider, depending on the use case and data availability. The choice of method depends also on factors such as the complexity of the task, the amount of task-specific data available, and the computational resources at our disposal. Accordingly, we have to try empirically different methods in the search for the most suitable one for the particular NLP task under study. There are many well-known methodologies for TL such as Prompt Engineering [27], Prompt-free Methods [28], Meta-Learning [29], and Reinforcement Learning from Human Feedback [30]. Below, we go deeper with those methods which are the most pertinent for the scope of this Thesis: • Zero-shot Learning (ZSL) is a problem setup in DL where, at test time, a learner observes samples from classes which were not observed during training [31, 32]. Zero-shot methods generally work by associating observed and non-observed classes through some form of auxiliary information, which encodes observable distinguishing properties of objects. If we want to perform tasks that the model has never seen during pre-training, we can explore ZSL techniques. These methods aim to make predictions without task-specific FT. ZSL relies on models’ ability to generalize to new tasks by understanding textual descriptions or examples of those tasks. ZSL is a valuable technique that showcases the generalization and adaptability of pre-trained language models to a wide array of NLP tasks. Nevertheless, even if ZSL offers significant advantages, it may have limitations in cases where the task descriptions are ambiguous or the model’s pre-trained knowledge does not align well with the target tasks. In such cases, few-shot learning or FT on a small amount of task-specific data may be necessary to achieve optimal performance. • Few-shot Learning (FSL), also referred to as low-shot learning (LSL) by some researchers, is a ML method ready to exploit a training dataset which contains limited information [33, 34]. Thanks to FSL, we can train models to perform tasks with only a few examples. This approach leverages the ability of models to generalize from limited examples, and it is particularly useful when, due to the lack of data, FT on a large dataset is not feasible. FSL is an alternative approach to FT language models that aims to perform tasks with very limited labeled data, often as few as one or a few examples per class or category. Instead of extensively FT a model, FSL focuses on enabling models to generalize effectively from a small amount of task-specific data. • Adapters (ADT) add new modules between layers of a pre-trained model [35, 36]. This means that parameters are copied over from pre-training (meaning they remain fixed) and only a few additional task-specific parameters are added for each new task, all without affecting previous ones. In standard FT, the new top-layer and the original weights are co-trained. In contrast, in adapter-tuning, the parameters of the original network are frozen and therefore may be shared by many tasks. ADT modules have two main features: a small number of parameters, and a near-identity initialization. This TL technique provides a flexible and efficient way to extend pre-trained language models for a variety of NLP tasks. 2.3. Metrics for Text Evaluation Text evaluation metrics are used to assess the quality of generated text. There are numerous techniques for evaluating language models [37, 38]. The choice of metrics depends on the specific task or goal, since different metrics capture various aspects of text quality. Text evaluation metrics can be broadly categorized into human evaluation and automatic evaluation metrics: • Human evaluation entails the application of human judgment to assess the quality of generated text. Human evaluators contribute subjective ratings or judgments on various aspects of text quality (e.g., fluency, coherence, relevance, or overall quality). While human evaluation offers rich and subjective insights, it is characterized by being time-consuming, expensive, and susceptible to individual evaluator bias [39]. • Automatic evaluation employs computational methods to assess the quality of generated text, offering valuable insights into the strengths and weaknesses of the model’s outputs. This analytical approach provides detailed quantitative feedback, facilitating a comprehensive understanding of the context in which the generated text exists. A key distinction from human metrics lies in the fact that automatic metrics can be computed by a machine, streamlining the evaluation process. Given the large number of model checkpoints slated for evaluation, conducting human assessments was discarded for this Thesis. Consequently, the assessment of texts generated by diverse models employing various TL techniques will be conducted through automated evaluation only. Although a multitude of automatic evaluation metrics exists, it is important to note that, given the nature of our problem as seq2seq, not all metrics are applicable. For example, Perplexity [40] stands out as one of the most prevalent metrics for appraising language models. Nevertheless, it is pertinent to acknowledge that this metric is tailored specifically to autoregressive language models, often referred to as causal language models. However, perplexity is not commonly used to evaluate seq2seq models, especially those like MT5 (Multilingual Translation Transformer) [41], where the task involves transforming an input sequence into an output sequence. For evaluating seq2seq models like MT5, metrics such as BLEU (Bilingual Evaluation Understudy) [42] or ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [43] are the most commonly used. These metrics compare the generated sequence against a reference or target sequence and are better suited to capture the overall quality and fluency of the generated text. 3. Description of the Proposed Research The main objective of this doctoral thesis is to develop and validate the necessary tools to pave the way for Responsible NLP technology. To achieve this primary objective, the following specific objectives will be addressed: (I) develop efficient methods for extending existing models to new domains, genres, and languages for the official languages of Spain and English; (II) explore new ways to pre-train and fine-tune language models efficiently; (III) address the explainability of large pre-trained language models; (IV) develop advanced applications in multiple sectors and domains (e.g., Meteorology and Health) with an emphasis on explainability and evaluation tasks; and (V) define and oversee the compliance with guidelines and requirements for the development of Responsible NLP Technology from an ELSEC perspective. 4. Methodology and Proposed Experiments To address all the scientific and technical challenges in this doctoral thesis, we will follow the principles of agile software development. This involves engaging users in the design process, seeking continuous improvement, encouraging a quick and flexible response to changes, and supporting the frequent delivery of functional software along with related technical and user manuals. Of course, a thorough review of the state of the art will be conducted at the beginning of each sprint. Thus, requirements analysis tasks are not intended to be completed before subsequent tasks. The entire process is a dynamic cycle around increasingly complex prototypes, covering the following three steps: 1. Research and Design Period. Define the research objective to be achieved in this iteration and investigate related work. The objective will be adjusted according to potential improvements in state-of-the-art methods or to address gaps in the research line. A simple proposal will be written to introduce previous studies and highlight the novelty in this thesis. 2. Development and Validation Period. Develop new algorithms and interfaces customized for the target users. 3. Optimization and Data Collection Period. Test the new developments with experimental data and optimize them by adjusting the hyperparameters. The experimental process comprises the following stages: • Standby Power Measurement involves quantifying the energy consumption of the GPUs when no processes are running. This measurement is expressed in watts-hour. • Data Preprocessing is done to generate the datasets used for defining the Baseline and training the models. • Baseline Definition. The training involves establishing a reference model to serve as a baseline. This model will undergo traditional and resource-intensive training. • Knowledge Transfer. The generation process involves creating alternative language models through less resource-intensive training, employing diverse knowledge transfer techniques. • Automatic Text Generation. Automatically generating narratives from test data for each trained model, including the Baseline. • Evaluation. Evaluating texts generated by all models, including the Baseline, using automatic metrics. In our doctoral thesis, both public and private data (confidential and/or personal) will be handled in use cases exclusively for research purposes. All data will be appropriately managed and anonymized before analysis, and the data protection officer of USC will be contacted in case of any conflict. Addi- tionally, authorization from the USC Research Ethics Committee will be sought before conducting any experiments that may involve ethical issues. Furthermore, data collection and usage will comply with the General Data Protection Regulation (GDPR) and the new European AI regulation (AI Act). Finally, we will apply the Assessment List for Trustworthy AI, developed by the High-Level Expert Group on AI established by the European Commission (EU HLEG), to evaluate the compliance of the new AI systems developed with the require- ments (Human Agency and Oversight; Technical Robustness and Safety; Privacy and Data Governance; Transparency; Diversity, Non-discrimination, and Fairness; Societal and Environmental Well-being; Accountability) in their Ethical Guidelines for Trustworthy AI. Additionally, the doctoral thesis su- pervisor leads the Trustworthy AI laboratory at CiTIUS, which is affiliated with the Z-inspection® initiative. This is a bottom-up holistic inspection process for ethical AI that can be applied to a variety of domains such as business, healthcare, the public sector, and many others. Z-inspection uses the EU HLEG guidelines for Trustworthy AI and is listed in the OECD AI tools and metrics catalog. 5. Specific Research Elements Proposed for Discussion Our doctoral thesis is focused on advancing NLP technology, particularly in the development of responsible NLP, efficient model training, and application across various domains. The key areas are: 1. Technological Innovation in NLP: (I) review the methodologies for extending language models to new domains, genres, and languages; (II) evaluate the efficiency of new pre-training and fine-tuning techniques; and (III) assess the advances in explainability of large pre-trained models. 2. Responsible AI and Ethical Considerations: (I) discuss the compliance with GDPR, AI Act, and the principles of FAIR; (II) examine the use of the Assessment List for Trustworthy AI and the Z-inspection® initiative; and (III) consider the environmental impact and the measures taken to reduce the carbon footprint. 3. Application and Impact: (I) review the practical applications in sectors like meteorology and healthcare; (II) analyze the socio-economic impact of recycling pre-trained models for various domains; and (III) discuss the open-source distribution of software and models and their potential for broader industry adoption. 4. Data Handling and Ethics: (I) review the protocols for handling public and private data, including anonymization and ethical approvals; and (II) examine the strategies for ensuring data privacy and security in compliance with legal standards. With this, we want to ensure that all critical aspects of our thesis are thoroughly evaluated. The elements proposed for discussion in this Symposium are: 1. What methods for extending language models do you recommend? 2. How to measure the efficiency in pre-training and fine-tuning? Rigth now, I’m using CodeCarbon to take measurements. 3. How can we guarantee the compliance with legal standards (GDPR, AI Act, etc.)? And with ethical guidelines (FAIR, Z-inspection)? 4. Any suggestion for domain-specific applications? I’ve been working with meteorological data mainly. Acknowledgments This research work is supported under Grant TED2021-130295B-C33 funded by MCIN/AEI/10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”, but also under Grants PID2020-112623GBI00 and PID2021-123152OB-C21 funded by MCIN/AEI/10.13039/501100011033 and by “ESF Investing in your future”. We also acknowl- edge the support of the Galician Ministry of Culture, Education, Professional Training and University (Grants ED481A-2024-059, Centro de investigación de Galicia accreditation 2024-2027 ED431G-2023/04 and ED431C2022/19 co-funded by the European Regional Development Fund, ERDF/FEDER program). References [1] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han, M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song, J. Tang, J.-R. Wen, J. Yuan, W. X. Zhao, J. Zhu, Pre-trained models: Past, present and future, AI Open 2 (2021) 225–250. URL: https://www.sciencedirect.com/science/article/pii/S2666651021000231. doi:https://doi.org/ 10.1016/j.aiopen.2021.08.002. [2] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, D. Roth, Recent advances in natural language processing via large pre-trained language models: A survey, ACM Comput. Surv. 56 (2023). URL: https://doi.org/10.1145/3605943. doi:10.1145/3605943. [3] E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations for deep learning in NLP, in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 3645–3650. URL: https://aclanthology.org/P19-1355. doi:10.18653/v1/P19-1355. [4] A. Barredo Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, F. Herrera, Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai, Information Fu- sion 58 (2020) 82–115. URL: https://www.sciencedirect.com/science/article/pii/S1566253519308103. doi:https://doi.org/10.1016/j.inffus.2019.12.012. [5] S. Ali et al., Explainable artificial intelligence (XAI): What we know and what is left to attain trustworthy artificial intelligence, Information Fusion 99 (2023) 101805. Https://doi.org/10.1016/j.inffus.2023.101805. [6] S. Ali, T. Abuhmed, S. El-Sappagh, K. Muhammad, J. M. Alonso-Moral, R. Confalonieri, R. Guidotti, J. Del Ser, N. Díaz-Rodríguez, F. Herrera, Explainable artificial intelligence (xai): What we know and what is left to attain trustworthy artificial intelligence, Information Fusion 99 (2023) 101805. URL: https://www.sciencedirect.com/science/article/pii/S1566253523001148. doi:https: //doi.org/10.1016/j.inffus.2023.101805. [7] R. B. et al., On the opportunities and risks of foundation models, ArXiv preprint (2021). URL: https://crfm.stanford.edu/assets/report.pdf. [8] N. Ahmed, M. Wahed, The de-democratization of ai: Deep learning and the compute divide in artificial intelligence research, 2020. arXiv:2010.15581. [9] S. Barro, A. Bugarín, J. Alonso, La confianza en las máquinas inteligentes, Thomson Reuters Aranzadi, 2020. [10] A. Arrieta et al., Explainable Artificial Intelligence (XAI): Concepts, taxonomies, op- portunities and challenges toward responsible AI, Information Fusion 58 (2020) 82–115. Https://doi.org/10.1016/j.inffus.2019.12.012. [11] D. Gunning et al., DARPA’s explainable AI (XAI) program: A retrospective, Applied AI Letters 2 (2021) e61. Https://doi.org/10.1002/ail2.61. [12] J. Quinlan, Induction of decision trees, Machine learning 1 (1986) 81–106. [13] J. Alonso, C. Castiello, L. Magdalena, C. Mencar, Explainable Fuzzy Systems - Paving the Way from Interpretable Fuzzy Systems to Explainable AI Systems, volume 970, Springer International Publishing, 2021. Https://doi.org/10.1007/978-3-030-71098-9. [14] R. Guidotti et al., A Survey of Methods for Explaining Black Box Models, ACM Computing Surveys 51 (2019) 1–42. URL: https://dl.acm.org/doi/10.1145/3236009. doi:10.1145/3236009, https://doi.org/10.1145/3236009. [15] G. Ras, N. Xie, M. V. Gerven, D. Doran, Explainable deep learning: A field guide for the uninitiated, Journal of Artificial Intelligence Research 73 (2022) 329–396. [16] L. Weidinger et al., Taxonomy of risks posed by language models, in: Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 214–229. [17] A. Luccioni, S. Viguier, A.-L. Ligozat., Estimating the carbon footprint of bloom, a 176b parameter language model, Journal of Machine Learning Research 24 (2023) 1–15. [18] M. Rillig et al., Risks and benefits of large language models for the environment, Environmental Science & Technology 57 (2023) 3464–3466. Https://doi.org/10.1021/acs.est.3c01106. [19] E. Reiter, R. Dale, Building Natural Language Generation Systems, Studies in Natural Language Processing, Cambridge University Press, 2000. doi:10.1017/CBO9780511519857, https://doi.org/10.1017/CBO9780511519857. [20] A. Gatt, E. Krahmer, Survey of the state of the art in natural language generation: Core tasks, applications and evaluation, Journal of Artificial Intelligence Research 61 (2018) 65–170. [21] H. Wang et al., Pre-trained language models and their applications, Engineering (2022). Https://doi.org/10.1016/j.eng.2022.04.024. [22] J. Dodge et al., Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping, arXiv:2002.06305 (2020). Https://doi.org/10.48550/arXiv.2002.06305. [23] R. Tinn et al., Fine-tuning large neural language models for biomedical natural language processing, Patterns 4 (2023). Https://doi.org/10.1016/j.patter.2023.100729. [24] G. Neubig, Neural machine translation and sequence-to-sequence models: A tutorial, arXiv:1703.01619 (2017). Https://doi.org/10.48550/arXiv.1703.01619. [25] L. Torrey, J. Shavlik, Transfer learning, in: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, IGI Global, 2010, pp. 242–264. [26] M. Mozafari, R. Farahbakhsh, N. Crespi, A BERT-based transfer learning approach for hate speech detection in online social media, in: Proceedings of the Eighth International Conference on Complex Networks and Their Applications, Springer, 2020, pp. 928–940. [27] H. Strobelt et al., Interactive and visual prompt engineering for ad-hoc task adaptation with large language models, IEEE Transactions on Visualization and Computer Graphics 29 (2022) 1146–1156. Https://doi.org/10.1109/TVCG.2022.3209479. [28] R. Mahabadi et al., Prompt-free and efficient few-shot learning with language models, in: Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), ACL, 2022, pp. 3638–3652. URL: https://aclanthology.org/2022.acl-long.254. doi:10.18653/v1/2022.acl-long.254, https://doi.org/10.18653/v1/2022.acl-long.254. [29] X. Wu, L. Varshney, A meta-learning perspective on transformers for causal language modeling, arXiv:2310.05884 (2023). [30] Y. Bai et al., Training a helpful and harmless assistant with reinforcement learning from human feedback, arXiv:2204.05862 (2022). Https://doi.org/10.48550/arXiv.2204.05862. [31] J. Lu et al., What makes pre-trained language models better zero-shot learners?, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, Toronto, Canada, 2023, pp. 2288– 2303. URL: https://aclanthology.org/2023.acl-long.128. doi:10.18653/v1/2023.acl-long.128, https://doi.org/10.18653/v1/2023.acl-long.128. [32] Y. Meng, J. Huang, Y. Zhang, J. Han, Generating training data with language models: Towards zero-shot language understanding, Advances in Neural Information Processing Systems 35 (2022) 462–477. [33] X. Lin et al., Few-shot learning with multilingual generative language models, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, ACL, Abu Dhabi, United Arab Emirates, 2022, pp. 9019–9052. URL: https://aclanthology.org/2022.emnlp-main.616. doi:10.18653/v1/2022.emnlp-main.616, https://doi.org/10.18653/v1/2022.emnlp-main.616. [34] T. Brown et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [35] N. Houlsby et al., Parameter-efficient transfer learning for nlp, in: International Conference on Machine Learning, PMLR, 2019, pp. 2790–2799. [36] R. He et al., On the effectiveness of adapter-based tuning for pretrained language model adaptation, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), ACL, Online, 2021, pp. 2208–2222. URL: https://aclanthology.org/2021.acl-long.172. doi:10. 18653/v1/2021.acl-long.172, https://doi.org/10.18653/v1/2021.acl-long.172. [37] G. Melis, C. Dyer, P. Blunsom, On the state of the art of evaluation in neural language models, arXiv:1707.05589 (2017). [38] Y. Chang et al., A survey on evaluation of large language models, ACM Transactions on Intelligent Systems and Technology (2023). Https://doi.org/10.1145/3641289. [39] E. Reiter, R. Robertson, L. Osman, Lessons from a failure: Generating tailored smoking cessation letters, Artificial Intelligence 144 (2003) 41–58. URL: https://www.sciencedirect.com/science/ article/pii/S0004370202003703. doi:https://doi.org/10.1016/S0004-3702(02)00370-3, https://doi.org/10.1016/S0004-3702(02)00370-3. [40] Y. Bengio, R. Ducharme, P. Vincent, A neural probabilistic language model, Advances in neural information processing systems 13 (2000). [41] L. Xue et al., mT5: A massively multilingual pre-trained text-to-text transformer, in: Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, ACL, Online, 2021, pp. 483–498. URL: https://aclanthology.org/2021.naacl-main.41. doi:10.18653/v1/2021.naacl-main.41. [42] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, ACL, 2002, pp. 311–318. Https://doi.org/10.3115/1073083.1073135. [43] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81. Https://aclanthology.org/W04-1013.