1. Introduction

Dec

Are All Languages Equal? Curriculum Learning over Diferent Languages

Giulia Pucci

Leonardo Ranaldi

0 1

Fabio Massimo Zanzotto

1 0 Idiap Research Institute , Switzerland 1 University of Rome Tor Vergata

2023

02 2023

Curriculum Learning (CL) is emerging as a relevant technique to reduce the cost of pre-training Large Language Models. The idea, tested for the English language, is to train LLMs by organizing training examples from the simplest to the most complex. Complexity measures may depend on the specific language. Hence, this paper aims to investigate whether CL and the complexity measure can be easily exported to other languages. For this reason, we present a set of linguistically motivated measures to determine the complexity of examples, which has been used in English: these measures are based on text length, rarity, and comprehensibility. We then test the approach to two Romance languages: Italian and French. Our results show that the technique can be easily exported to languages other than English without adaptation.

eol>Eficient Pre-training Multilingual LLMs Natural Language Processing

1. Introduction In this paper, we deeply analyze the learning diver

gencies training from scratch with BERT [11] and GPT2 Transformers-based models have disrupted natural lan- [12] on the same corpus in multiple languages. Furtherguage understanding methods outperforming previous more, following our CL-LRC metrics [13] based on length, methods and sometimes even humans in many tasks rarity, and comprehensibility, computational costs are [1, 2, 3, 4]. Unsupervised learning on huge corpora, no reduced, and the divergences are filled. matter the domain, seems to be the way to increase per- Hence, using the same small corpus in three diferent formance; however, besides the onerous costs, there are languages, English (original), Italian, and French (transdificulties with the data. lated), experimental results show that loss values during

Therefore, this results in a significant carbon footprint the training vary in the diferent languages. Moreover, [5], contrary to global sustainability goals. There are this diference seems to be softened in terms of perplexity many approaches to address the AI carbon footprint prob- scores when the pre-training block-sizes increase increlem, ranging from using more carbon-eficient energy mentally. sources to applying eficient AI models and training algorithms. Indeed, Transformers seem to be only huge memories [6, 7] and, thus, better ways to train these 2. Background models are necessary. Bengio et al. [8] in Curriculum Learning (CL) proposes a specific class of eficient train- Optimizing the use of computational resources to ining strategies for deep learning models. crease the learning capabilities of Large Language Models

The naïve approach for training Large Language Mod- (LLMs) is a widely studied problem. The main approaches els involves feeding textual batches randomly sampled are based on architecture, learning, and, finally, data. from the training corpora is re-visited in the CL, where Although current optimization methods at the architecthe model is refined with a sequence of progressively tural level have demonstrated extensive functionality on more challenging examples [9]. This is motivated by and further fine-tuning, there still needs to be gaps at the emulates how humans learn, starting with more straight- pre-training level. forward concepts and gradually building up more com- Clark et al. [ 14 ] propose a method for reducing compuplex ones. Soviany et al. [ 10 ] show that CL helps the tational costs by modifying the Masked Language Modmodel to perform better and converge faster. els with a discriminator, but it may have limitations in tasks that require a deep understanding of long-term dependencies or complex relationships between words.

Sanh et al. [15] proposed parameter reduction techniques and obtained a lightweight version of BERT that is less compelling than the original in adapting parameters on specific tasks.

Finally, the last approach in vogue concerns the efifcient adjustment of parameters. Parameter-Eficient English, Italian and French, we studied the dificulties Tuning (PEFT) is an eficient technique for tuning a small faced in learning more languages. We propose text comportion of model parameters and freezing others. Stan- plexity techniques combined with input text block-size dard techniques for PEFT: LoRA [16], Prefix Tuning [ 17], in the context of the self-attention mechanism. The two P-Tuning [18] reduce computational and storage and approaches measure the dificulty of pre-training two lanmaintain the performance. However, these PEFT meth- guage models: BERT [11] and GPT2 [12]. Our proposal ods are applied to fine-tuning a model for a specific task adds to the incremental CL brought in [9], an additional and not to pre-training from scratch. While these topics light step for calculating the pre-training text complexhave been extensively studied, the data-level approach ity. Our model performs better than the baselines and has yet to be explored. methods proposed in [9] regarding loss and perplexity.

Many studies have found that the multi-headed selfattention mechanism requires tremendous computational efort. Since each head of this mechanism appears to 3. Our Methods be more attentive to local dependencies than global ones [ 19, 20, 21 ], training local self-attention in shorter blocks seems to be less complex than training global selfattention in more extended blocks. Nagatsuka et al. [9] proposed a Curriculum Learning (CL) strategy concentrating on hands-on self-attention mechanism training to enhance this aspect. They applied the strategy directly to BERT pre-training, manipulating the size of the input text block in the self-attention mechanism as a measure of dificulty.

Further the world of transformer-based models, many CL studies have used sentence length, external resources, or input sequences to measure dificulty in various NLP tasks such as in parsing tasks [22], reading comprehension [23], and concept masking for pre-training of the knowledge graph-related models [24].

In this paper, to solve the gap of LLMs in learning

Starting from the fact that language has a structure that varies between diferent languages, we searched for a strategy to alleviate these divergences [25, 26]. Hence organizing the examples during pre-training could improve the model’s performance. Therefore, starting from the concept of Curriculum Learning (CL) shown by Bengio et al. [8], according to which learning algorithms perform better when the data are presented following the current competencies of the model, we used the methodology proposed in [9] applying an incremental learning technique on increasing block-sizes. We propose to use these techniques in diferent languages and extend the work done with a generative model. Finally, we study the impact of language complexity by intruding LRC, a measure used to determine the complexity of examples during pre-training before standard CL.

The application of the CL-LRC method consists of Examples of the complexity values produced by the metrics defined in Section 3.1. three steps (Figure 1): (i) sorting the corpus according to Following obtaining the and , we normalize our complexity measure starting from the least complex the values: (2) (3) (4) (5) ˆ() = − () − , ∀ ∈ [0, ||]. defined as:

Rarity

The repetitiveness of words is a significant factor. We use the metric introduced in [27] where rarity is defined as the probability product of unigrams. This metric represents sentence information since the scores of longer sentences are the sum of more words and thus are likely to be more meaningful. Given a corpus of sentences, {}=0, the complexity metric for word rarity is () =Δ − ∑︁ log )︁

︁( =1 where we use logarithms of word probabilities. The component () is defined as: sentences to the most complex ones; (ii) partitioning the corpus according to input blocks of predefined sizes; (iii) stepwise pre-training by increasing the block size. 3.1. Complexity to define.

LLMs. The increasing block-size techniques and complexity measures are our method’s core. While the dynamic resizing technique is fixed and does not change in diferent scenarios, the complexity of a text example is challenging Since the tasks used in pre-training should aim to learn language from context, precisely as humans do, organizing the complexity of examples could improve CL in We propose combining three factors: the number of tokens or sentence length, the repetitiveness or rarity of words in the corpus, and finally, the comprehensibility or,

denoted with = {0 , 1 , ..., }. more commonly, the Flesch-Kincaid readability metric.

Aggregating these three heuristics forms , one of

the foundational elements of our framework. Hence, we denote our training corpus as a collection of sentences, {}=0, where each sentence is a sequence of words

Number of tokens

The number of occurrences or sentence length is critical since longer sequences are more normalize the values: dificult to encode, as the possibility of them being cut is high. Therefore, longer sentences would be more prone to losing context during the pre-training tasks. We compute sentence length for each period of our corpus : () = ℎ() (1) () =Δ

1 =1 =1 ∑︁ ∑︁ 1= for each unique word in a corpus and 1, is the indicator function equal to 1 if its condition is satisfied or 0. We compute this value for each sentence of our corpus , obtaining the and and we ˆ() = − () − , ∀ ∈ [0, ||].

Readability Metric Comprehensibility or, more commonly, readability may be related to the speed of perception, reflex blink technique, reading speed, reading fatigue, cognitively motivated characteristics, and word

English Italian French Loss Perplexity Loss Perplexity Loss Perplexity Secondly, following the work of Nagatsuka et al. [9], we split the original corpora into training samples of the specified size. Each input text (block) for BERT and GPT2 pre-training should not be linguistically consistent as a sentence but a fixed interval of contiguous text. Thus, it is not guaranteed that the input is a period or begins with the first word of a sentence. Moreover, after extensive experiments, Liu et al. [29] argue that the input sequence should be at most 512 tokens. However, we follow an incremental approach that difers from the static sizing of 512 tokens per batch. The diference is the order, which is the reason why it could be easier for a Transformer to learn by order of complexity. We train a Byte-Pair Encoding (BPE) at the byte level [30] to split the raw text into a sequence of tokens. Byte-level BPE allows the decomposition of words, including words outside the vocabulary likely to appear during testing, especially when using a small training dataset. In the experiment, we set the vocabulary size to 20, 000. 3.4. Gradual Training dificulty for a specific reader. Unfortunately, it is not always possible to collect these characteristics.

We used the Flesch-Kincaid metric [28] as an assessment tool for text comprehension. This metric is based on the length of sentences and words within a text by quantifying dificulty with a score. The lower the score, the easier it is to read and understand the text. We use the following formula:

(()) () = 0.39 +

100 (6) (()) 11.8 − 15.59

100 where (()) average sentence length is the number of words in a sentence divided by the number of sentences, and (() is the average word length, i.e., does the number of words divides the number of syllables per word. The value 0.39 is used to scale the efect of the average sentence length to compare it to the efect of the average word length, weighted by 11.8. The final score is then adjusted by subtracting the value of 15.59, which adjusts the score scale to match the grading levels used in education more closely. We calculate this value for each sentence and obtain the maximum and the minimum scores. Finally, we normalize these values: ˆ () = () − , ∀ ∈ [0, ||].

− 3.2. Applying Complexity Heuristics In the first phase, we compute the complexity of each sentence () by adding the normalized values of length ˆ(), rarity ˆ(), and readability score ˆ (), that is: () = ˆ() + ˆ() + ˆ () (8)

Then, we sort the sentences of the original corpus by order of increasing complexity before the pre-training phase. Finally, we recompose the re-ordered corpus ready for pre-training.

Using the corpus sorted by complexity order, we train a step model with four block sizes, namely 64, 128, 256, (7) and 512. At first, we train the model with the shortest block-size, 64, for an arbitrary number of steps. Then, we continue to train the model with block-sizes of 128 and 256, respectively, for the same number of steps. Finally, we finish with the largest block-size of 512.

4. Experimental Results and Discussion

We evaluated our proposed CL-LRC approach in model performance in the experiments. Therefore, we show that performances increase to the proposed state of the art in [9]. We use Wikitext-2 [31] to reproduce the results proposed. Hence, we perform the pre-training from scratch for BERT [11] and GPT2 [30]. Therefore, we investigated perplexity, loss, and learning curves during and at the end of the pre-training. All experiments were performed on two NVIDIA RTX A6000 with 48 GB of memory. The code and model will be released for further research.

The linguistically motivated pre-training by our metrics has improved the technique proposed in [9] and outperformed the baseline models. In particular, (BERT) outperforms the version without LRC of 5 points for English and more than 30 points for Italian and French over perplexity scores. The same is true for GPT2 with less striking results (ranging from 4.1. Data 16 to 4 points). Hence, this measure seems to have less BERT and GPT2 are pre-trained with huge corpora, i.e., impact on the Italian and French, as we can observe from bookcorpus and Wikipedia-dump with about 3 billion Baseline models for English pre-training and others. words [32]. In this work, we used Wikitext-2 [31], a Finally, in Fig. 5, we can observe a clear gap in perplexity small corpus for simulations, allowing pre-training with a in the presence of portions of text with a small number limited computational resource. Wikitext-2 is a standard of tokens, which is reduced to zero or almost zero when language model corpus with 720 good-quality articles the number of tokens is more significant. from English Wikipedia. In addition, we introduced two further corpora from the Italian and French translations 4.3.2. Languages over Complexity of Wikitext-2. 4.2. Experimental setup We use the same corpus in three diferent languages to analyze learning divergences between diferent languages. Hence, we perform pre-training from scratch with the baseline methods, and then with complexity metrics (Baseline ), the Total-Curriculum (CL proposed in [9]), and our CL-LRC called using the settings proposed in [9]. In particular, in our , we sort the corpora according to complexity, split the corpora according to the dificulty level of the training samples, and perform the pre-training phase by increasing the block size. We performed these steps for all corpora and pre-train BERT and GPT2 from scratch. Finally, we report the losses during learning, the final losses on the evaluation set, and the average perplexity of diferent cuts of the evaluation set. 4.3. Results Dificulties in learning a language depend on the complexity of the language itself. However, it can be alleviated using curricular techniques and greatly improved using linguistically motivated methods, maintaining reduced training times as shown in Table 6. These conclusions derive from the pre-training results from scratch in three languages using Baseline, Total-Curriculum, and our CLLRC techniques visible in Table 3. In Figure 5, it can be observed from the baselines of the diferent corpora that English language learners, on average, are less perplexed.

Moreover, the outperforms the others in all corpora. However, the batch-size increase supports the performance achieved by Curriculum Learning. Finally, in Figure 4, learning curves explain the trade-of between pre-training steps and loss values.

With the aim of studying intrinsic learning dificulties,

we propose our line of experiments from the same corpus translated into three diferent languages: English (original), French, and Italian. We can observe that the models started from scratch have more dificulty learning the French and Italian corpora than the English ones. We believe this result’s origin stems from the structure and complexity of the languages concerned. It is widely known that being both Romance languages, French and Italian have a very complex grammatical structure, very diferent from English. Regarding verb conjugation, while English verbs have relatively simple and regular conjugation patterns, French and Italian ones are very intricate, with various tenses, moods, aspects, and verb endings. For the agreement rules, unlike French and Italian, English has no grammatical gender distinction, so there is no agreement based on gender. Moreover, in contrast to the skinny use in English, French, and Italian have complex systems of clauses and subordination. Therefore, it is more dificult for a non-native speaker of Italian or French to learn these two languages from scratch, for the same reasons it is also for the models we tested. 4.4. Convergence Speed & Training time Our CL-LRC outperforms the Total-Curriculum regarding loss during pre-training. However, in Figure 4, it can be seen that the loss of the basic model converges to around 50; in contrast, both models with curriculum steadily decrease and reach a higher convergence rate. Moreover, it can be observed that the loss of the curriculumbased model decreased steadily whenever the dificulty of the training samples was changed. Finally, in Table 6, it is possible to observe how curricular approaches can significantly reduce training time and consecutively consumption and costs.

5. Conclusion In this paper, we explored the efectiveness of Curricu

lum Learning (CL) in reducing the cost of pre-training and increasing the results. We trained LLMs by organizing examples from the simplest to the most complex, thereby leveraging the concept of complexity measures.

Hence, we pre-trained from scratch BERT and GPT2 using standard baselines and CL approaches. After deep analysis, we show that divergence in learning can be mitigated using CL approaches reinforced by measures to determine the complexity of examples. These measures, applied during pre-training to sort the corpus according to complexity, show outstanding results. While the original approach was tested and validated for the English language, this research aimed to investigate whether CL and its associated complexity measure could be applied to other languages without significant adaptation. Experiments conducted in a low-resource environment show that the proposed method leads to better performance in terms of loss during learning and perplexity on test data.

In future works, we will continue to propose pedagogically motivated mechanisms to analyze weaknesses [33] and empower Cross-lingual abilities to deliver multistepreasoning answers [34]. [ 14 ] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, 116.

ELECTRA: Pre-training text encoders as discrimi- [23] B. Xu, L. Zhang, Z. Mao, Q. Wang, H. Xie, Y. Zhang, nators rather than generators, in: ICLR, 2020. URL: Curriculum learning for natural language underhttps://openreview.net/pdf ?id=r1xMH1BtvB. standing, in: Annual Meeting of the Association [15] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, for Computational Linguistics, 2020. a distilled version of bert: smaller, faster, cheaper [24] M. Lee, J.-H. Park, J. Kim, K.-M. Kim, S. Lee, Efiand lighter, ArXiv abs/1910.01108 (2019). cient pre-training of masked language model via [16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, concept-based curriculum masking, in: ProceedS. Wang, L. Wang, W. Chen, LoRA: Low-rank ings of the 2022 Conference on Empirical Methadaptation of large language models, in: Inter- ods in Natural Language Processing, Association national Conference on Learning Representations, for Computational Linguistics, Abu Dhabi, United 2022. URL: https://openreview.net/f orum?id=nZeV Arab Emirates, 2022, pp. 7417–7427. URL: https: KeeFYf 9. //aclanthology.org/2022.emnlp-main.502. [17] X. L. Li, P. Liang, Prefix-tuning: Optimizing contin- [25] F. M. Zanzotto, A. Santilli, L. Ranaldi, D. Onorati, uous prompts for generation, in: Proceedings of the P. Tommasino, F. Fallucchi, KERMIT: Complement59th Annual Meeting of the Association for Com- ing transformer architectures with encoders of exputational Linguistics and the 11th International plicit syntactic interpretations, in: Proceedings of Joint Conference on Natural Language Processing the 2020 Conference on Empirical Methods in Natu(Volume 1: Long Papers), Association for Compu- ral Language Processing (EMNLP), Association for tational Linguistics, Online, 2021, pp. 4582–4597. Computational Linguistics, Online, 2020, pp. 256– URL: https://aclanthology.org/2021.acl-long.353. 267. URL: https://aclanthology.org/2020.emnlp-mai doi:10.18653/v1/2021.acl-long.353. n.18. doi:10.18653/v1/2020.emnlp-main.18. [18] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, J. Tang, [26] L. Ranaldi, F. Fallucchi, F. M. Zanzotto, Dis-cover P-tuning v2: Prompt tuning can be comparable to ai minds to preserve human knowledge, Future ifne-tuning universally across scales and tasks, 2022. Internet 14 (2022). URL: https://www.mdpi.com/1 arXiv:2110.07602. 999-5903/14/1/10. doi:10.3390/fi14010010. [ 19 ] O. Kovaleva, A. Romanov, A. Rogers, A. Rumshisky, [27] E. A. Platanios, O. Stretcu, G. Neubig, B. Poczos, Revealing the dark secrets of BERT, in: Proceedings T. Mitchell, Competence-based curriculum learnof the 2019 Conference on Empirical Methods in ing for neural machine translation, in: ProceedNatural Language Processing and the 9th Interna- ings of the 2019 Conference of the North American tional Joint Conference on Natural Language Pro- Chapter of the Association for Computational Lincessing (EMNLP-IJCNLP), Association for Compu- guistics: Human Language Technologies, Volume tational Linguistics, Hong Kong, China, 2019, pp. 1 (Long and Short Papers), Association for Compu4365–4374. URL: https://aclanthology.org/D19-144 tational Linguistics, Minneapolis, Minnesota, 2019, 5. doi:10.18653/v1/D19-1445. pp. 1162–1172. URL: https://aclanthology.org/N19 [20] S. Sukhbaatar, E. Grave, P. Bojanowski, A. Joulin, -1119. doi:10.18653/v1/N19-1119.

Adaptive attention span in transformers, in: Pro- [28] J. Talburt, The flesch index: An easily proceedings of the 57th Annual Meeting of the Associa- grammable readability analysis algorithm, in: Protion for Computational Linguistics, Association for ceedings of the 4th Annual International ConferComputational Linguistics, Florence, Italy, 2019, pp. ence on Systems Documentation, SIGDOC ’85, As331–335. URL: https://aclanthology.org/P19-1032. sociation for Computing Machinery, New York, NY, doi:10.18653/v1/P19-1032. USA, 1986, p. 114–122. URL: https://doi.org/10.114 [21] M. Podkorytov, D. Biś, X. Liu, How can the [mask] 5/10563.10583. doi:10.1145/10563.10583. know? the sources and limitations of knowledge [29] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, in bert, in: 2021 International Joint Conference on O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Neural Networks (IJCNN), 2021, pp. 1–8. doi:10.1 Roberta: A robustly optimized bert pretraining ap109/IJCNN52387.2021.9534299. proach, ArXiv abs/1907.11692 (2019). [22] V. I. Spitkovsky, H. Alshawi, D. Jurafsky, From baby [30] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, steps to leapfrog: How “less is more” in unsuper- I. Sutskever, Language models are unsupervised vised dependency parsing, in: Human Language multitask learners, 2019.

Technologies: The 2010 Annual Conference of the [31] S. Merity, C. Xiong, J. Bradbury, R. Socher, Pointer North American Chapter of the Association for sentinel mixture models, ArXiv abs/1609.07843 Computational Linguistics, Association for Compu- (2017). tational Linguistics, Los Angeles, California, 2010, [32] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urpp. 751–759. URL: https://aclanthology.org/N10-1 tasun, A. Torralba, S. Fidler, Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 19–27. doi:10.1109/ICCV .2015.11. [33] L. Ranaldi, F. M. Zanzotto, Hans, are you clever? clever hans efect analysis of neural systems, 2023.

arXiv:2309.12481. [34] L. Ranaldi, F. M. Zanzotto, Empowering multi-step reasoning across languages via tree-of-thoughts, 2023. arXiv:2311.08097.

Appendix A Appendix B Model

(BERT) Baseline (BERT) Total-Curriculum (BERT) (BERT) (GPT2) Baseline (GPT2) Total-Curriculum (GPT2) (GPT2)

Training Time (English) Training Time (Italian) Training Time (French)

14th International Conference on Recent Advances

Shoumen , Bulgaria, Varna, Bulgaria, 2023 , pp. 961 -

967. URL: https://aclanthology.org/ 2023 .ranlp- 1 . 1

03. [5]

Strubell ,

Ganesh ,

McCallum , Energy and

Proceedings of the 57th Annual Meeting of the As-

2019 , pp. 3645 - 3650 . URL: https://aclanthology.org

/P19-1355. doi: 10 .18653/v1/ P19 -1355. [6]

Ranaldi ,

Nourbakhsh ,

E. S.

Ruzzetti , A . Patrizi,

the darknet , in: Proceedings of RANLP , 2023 . [7]

Ranaldi ,

E. S.

Ruzzetti ,

F. M.

Zanzotto , Precog:

Proceedings of

RANLP

, 2023 . [8]

Bengio ,

Louradour ,

Collobert , J. Weston,

Curriculum learning , in: Proceedings of the 26th

ing , 2009 , pp. 41 - 48 . [9]

Nagatsuka ,

Broni-Bediako ,

Atsumi , Pre-

in Natural Language Processing (RANLP 2021 ), IN-

COMA

Ltd ., Held

Online

, 2021 , pp. 989 - 996 . URL:

https://aclanthology.org/ 2021 .ranlp- 1 . 112 . [10]

Soviany , R. T. Ionescu,

Rota , N. Sebe,

Curriculum learning: A survey , 2022 .

arXiv:2101 . 10382 . [1]

Wang ,

Singh ,

Michael ,

Hill , O. Levy ,

ing, in: Proceedings of the 2018 EMNLP Work-

putational Linguistics , Brussels, Belgium, 2018 , pp.

353- 355 . URL: https://aclanthology.org/W18-5446. [11]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT:

doi:10 .18653/v1/ W18 -5446. Pre-training of deep bidirectional transformers for [2 ]

Ranaldi ,

Nourbakhsh ,

E. S.

Ruzzetti , A. Pa- language understanding , in: Proceedings of the

trizi , D.

Onorati , M.

Mastromattei , F.

Fallucchi , F. M.

2019 Conference of the North American Chapter of

trained transformers in the DarkNet , in: R. Mitkov, man Language Technologies , Volume 1 (Long and

Angelova (Eds.), Proceedings of the 14th Inter- Short Papers) , Association for Computational Lin-

national Conference on Recent Advances in Natu- guistics , Minneapolis, Minnesota, 2019 , pp. 4171 -

ral Language

Processing

, INCOMA Ltd., Shoumen , 4186. URL: https://aclanthology.org/N19- 1423.

Bulgaria , Varna, Bulgaria, 2023 , pp. 949 - 960 . URL: doi:10.18653/v1/ N19 -1423.

https://aclanthology.org/ 2023 .ranlp- 1 . 102 . [12]

Radford ,

Narasimhan , Improving language [3]

Ranaldi , G. Pucci, Knowing knowledge: Epis- understanding by generative pre-training , 2018 .

temological study of knowledge in transformers , [13]

Ranaldi ,

Pucci ,

F. M.

Zanzotto , Modeling eas-

Applied Sciences 13 ( 2023 ). URL: https://www.md iness for training transformers with curriculum

pi.com/2076-3417/13/2/677. doi: 10 .3390/app130 learning, in: R. Mitkov, G. Angelova (Eds.), Pro-

20677. ceedings of the 14th International Conference on [4]

Ranaldi ,

E. S.

Ruzzetti ,

F. M.

Zanzotto , PreCog: Recent Advances in Natural Language Processing,

performance in pre-trained language models , in: 2023 , pp. 937 - 948 . URL: https://aclanthology.org/2

023.ranlp- 1 . 101 .