<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Dec</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Are All Languages Equal? Curriculum Learning over Diferent Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giulia Pucci</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Ranaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Massimo Zanzotto</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Idiap Research Institute</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Rome Tor Vergata</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>02</volume>
      <issue>2023</issue>
      <abstract>
        <p>Curriculum Learning (CL) is emerging as a relevant technique to reduce the cost of pre-training Large Language Models. The idea, tested for the English language, is to train LLMs by organizing training examples from the simplest to the most complex. Complexity measures may depend on the specific language. Hence, this paper aims to investigate whether CL and the complexity measure can be easily exported to other languages. For this reason, we present a set of linguistically motivated measures to determine the complexity of examples, which has been used in English: these measures are based on text length, rarity, and comprehensibility. We then test the approach to two Romance languages: Italian and French. Our results show that the technique can be easily exported to languages other than English without adaptation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Eficient Pre-training</kwd>
        <kwd>Multilingual LLMs</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>In this paper, we deeply analyze the learning diver</title>
        <p>gencies training from scratch with BERT [11] and GPT2
Transformers-based models have disrupted natural lan- [12] on the same corpus in multiple languages.
Furtherguage understanding methods outperforming previous more, following our CL-LRC metrics [13] based on length,
methods and sometimes even humans in many tasks rarity, and comprehensibility, computational costs are
[1, 2, 3, 4]. Unsupervised learning on huge corpora, no reduced, and the divergences are filled.
matter the domain, seems to be the way to increase per- Hence, using the same small corpus in three diferent
formance; however, besides the onerous costs, there are languages, English (original), Italian, and French
(transdificulties with the data. lated), experimental results show that loss values during</p>
        <p>Therefore, this results in a significant carbon footprint the training vary in the diferent languages. Moreover,
[5], contrary to global sustainability goals. There are this diference seems to be softened in terms of perplexity
many approaches to address the AI carbon footprint prob- scores when the pre-training block-sizes increase
increlem, ranging from using more carbon-eficient energy mentally.
sources to applying eficient AI models and training
algorithms. Indeed, Transformers seem to be only huge
memories [6, 7] and, thus, better ways to train these 2. Background
models are necessary. Bengio et al. [8] in Curriculum
Learning (CL) proposes a specific class of eficient train- Optimizing the use of computational resources to
ining strategies for deep learning models. crease the learning capabilities of Large Language Models</p>
        <p>
          The naïve approach for training Large Language Mod- (LLMs) is a widely studied problem. The main approaches
els involves feeding textual batches randomly sampled are based on architecture, learning, and, finally, data.
from the training corpora is re-visited in the CL, where Although current optimization methods at the
architecthe model is refined with a sequence of progressively tural level have demonstrated extensive functionality on
more challenging examples [9]. This is motivated by and further fine-tuning, there still needs to be gaps at the
emulates how humans learn, starting with more straight- pre-training level.
forward concepts and gradually building up more com- Clark et al. [
          <xref ref-type="bibr" rid="ref1">14</xref>
          ] propose a method for reducing
compuplex ones. Soviany et al. [
          <xref ref-type="bibr" rid="ref20">10</xref>
          ] show that CL helps the tational costs by modifying the Masked Language
Modmodel to perform better and converge faster. els with a discriminator, but it may have limitations in
tasks that require a deep understanding of long-term
dependencies or complex relationships between words.
        </p>
        <p>Sanh et al. [15] proposed parameter reduction techniques
and obtained a lightweight version of BERT that is less
compelling than the original in adapting parameters on
specific tasks.</p>
        <p>Finally, the last approach in vogue concerns the
efifcient adjustment of parameters. Parameter-Eficient English, Italian and French, we studied the dificulties
Tuning (PEFT) is an eficient technique for tuning a small faced in learning more languages. We propose text
comportion of model parameters and freezing others. Stan- plexity techniques combined with input text block-size
dard techniques for PEFT: LoRA [16], Prefix Tuning [ 17], in the context of the self-attention mechanism. The two
P-Tuning [18] reduce computational and storage and approaches measure the dificulty of pre-training two
lanmaintain the performance. However, these PEFT meth- guage models: BERT [11] and GPT2 [12]. Our proposal
ods are applied to fine-tuning a model for a specific task adds to the incremental CL brought in [9], an additional
and not to pre-training from scratch. While these topics light step for calculating the pre-training text
complexhave been extensively studied, the data-level approach ity. Our model performs better than the baselines and
has yet to be explored. methods proposed in [9] regarding loss and perplexity.</p>
        <p>
          Many studies have found that the multi-headed
selfattention mechanism requires tremendous computational
efort. Since each head of this mechanism appears to 3. Our Methods
be more attentive to local dependencies than global
ones [
          <xref ref-type="bibr" rid="ref7">19, 20, 21</xref>
          ], training local self-attention in shorter
blocks seems to be less complex than training global
selfattention in more extended blocks. Nagatsuka et al. [9]
proposed a Curriculum Learning (CL) strategy
concentrating on hands-on self-attention mechanism training
to enhance this aspect. They applied the strategy directly
to BERT pre-training, manipulating the size of the input
text block in the self-attention mechanism as a measure
of dificulty.
        </p>
        <p>Further the world of transformer-based models, many
CL studies have used sentence length, external resources,
or input sequences to measure dificulty in various NLP
tasks such as in parsing tasks [22], reading
comprehension [23], and concept masking for pre-training of the
knowledge graph-related models [24].</p>
        <p>In this paper, to solve the gap of LLMs in learning</p>
        <p>Starting from the fact that language has a structure that
varies between diferent languages, we searched for a
strategy to alleviate these divergences [25, 26]. Hence
organizing the examples during pre-training could improve
the model’s performance. Therefore, starting from the
concept of Curriculum Learning (CL) shown by Bengio
et al. [8], according to which learning algorithms
perform better when the data are presented following the
current competencies of the model, we used the
methodology proposed in [9] applying an incremental learning
technique on increasing block-sizes. We propose to use
these techniques in diferent languages and extend the
work done with a generative model. Finally, we study
the impact of language complexity by intruding LRC, a
measure used to determine the complexity of examples
during pre-training before standard CL.</p>
        <p>The application of the CL-LRC method consists of
Examples of the complexity values produced by the metrics defined in Section 3.1.
three steps (Figure 1): (i) sorting the corpus according to
Following obtaining the  and  , we normalize
our complexity measure starting from the least complex
the values:
(2)
(3)
(4)
(5)
ˆ() =
 − 
() −  , ∀ ∈ [0, ||].
defined as:</p>
        <sec id="sec-1-1-1">
          <title>Rarity</title>
          <p>The repetitiveness of words is a significant
factor. We use the metric introduced in [27] where rarity
is defined as the probability product of unigrams. This
metric represents sentence information since the scores
of longer sentences are the sum of more words and thus
are likely to be more meaningful. Given a corpus of
sentences, {}=0, the complexity metric for word rarity is
() =Δ −

∑︁ log  )︁</p>
          <p>︁(
=1
where we use logarithms of word probabilities. The
component () is defined as:
sentences to the most complex ones; (ii) partitioning the
corpus according to input blocks of predefined sizes; (iii)
stepwise pre-training by increasing the block size.
3.1. Complexity
to define.</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>LLMs.</title>
      </sec>
      <sec id="sec-1-3">
        <title>The increasing block-size techniques and complexity measures are our method’s core. While the dynamic resizing technique is fixed and does not change in diferent scenarios, the complexity of a text example is challenging</title>
      </sec>
      <sec id="sec-1-4">
        <title>Since the tasks used in pre-training should aim to learn language from context, precisely as humans do, organizing the complexity of examples could improve CL in</title>
      </sec>
      <sec id="sec-1-5">
        <title>We propose combining three factors: the number of tokens or sentence length, the repetitiveness or rarity of words in the corpus, and finally, the comprehensibility or,</title>
        <p>denoted with  = {0 , 1 , ..., }.
more commonly, the Flesch-Kincaid readability metric.</p>
      </sec>
      <sec id="sec-1-6">
        <title>Aggregating these three heuristics forms  , one of</title>
        <p>the foundational elements of our framework. Hence, we
denote our training corpus as a collection of  sentences,
{}=0, where each sentence is a sequence of words</p>
        <sec id="sec-1-6-1">
          <title>Number of tokens</title>
          <p>The number of occurrences or
sentence length is critical since longer sequences are more
normalize the values:
dificult to encode, as the possibility of them being cut is
high. Therefore, longer sentences would be more prone
to losing context during the pre-training tasks. We
compute sentence length for each period  of our corpus
:
() = ℎ()
(1)
() =Δ</p>
          <p>1
 =1 =1
 
∑︁ ∑︁ 1=
for each  unique word in a corpus and 1, is the
indicator function equal to 1 if its condition is satisfied
or 0. We compute this value for each sentence  of
our corpus , obtaining the  and  and we
ˆ() =
 − 
() −  , ∀ ∈ [0, ||].</p>
        </sec>
        <sec id="sec-1-6-2">
          <title>Readability Metric</title>
        </sec>
      </sec>
      <sec id="sec-1-7">
        <title>Comprehensibility or, more commonly, readability may be related to the speed of perception, reflex blink technique, reading speed, reading fatigue, cognitively motivated characteristics, and word</title>
        <p>English Italian French
Loss Perplexity Loss Perplexity Loss Perplexity
Secondly, following the work of Nagatsuka et al. [9],
we split the original corpora into training samples of the
specified size. Each input text (block) for BERT and GPT2
pre-training should not be linguistically consistent as a
sentence but a fixed interval of contiguous text. Thus, it
is not guaranteed that the input is a period or begins with
the first word of a sentence. Moreover, after extensive
experiments, Liu et al. [29] argue that the input sequence
should be at most 512 tokens. However, we follow an
incremental approach that difers from the static sizing of
512 tokens per batch. The diference is the order, which
is the reason why it could be easier for a Transformer
to learn by order of complexity. We train a Byte-Pair
Encoding (BPE) at the byte level [30] to split the raw
text into a sequence of tokens. Byte-level BPE allows
the decomposition of words, including words outside
the vocabulary likely to appear during testing, especially
when using a small training dataset. In the experiment,
we set the vocabulary size to 20, 000.
3.4. Gradual Training
dificulty for a specific reader. Unfortunately, it is not
always possible to collect these characteristics.</p>
        <p>We used the Flesch-Kincaid metric [28] as an
assessment tool for text comprehension. This metric is based
on the length of sentences and words within a text by
quantifying dificulty with a score. The lower the score,
the easier it is to read and understand the text. We use
the following formula:</p>
        <p>(())
 () = 0.39 +</p>
        <p>100 (6)
(())
11.8 − 15.59</p>
        <p>100
where (()) average sentence length is the
number of words in a sentence divided by the number of
sentences, and (() is the average word length,
i.e., does the number of words divides the number of
syllables per word. The value 0.39 is used to scale the efect
of the average sentence length to compare it to the efect
of the average word length, weighted by 11.8. The final
score is then adjusted by subtracting the value of 15.59,
which adjusts the score scale to match the grading levels
used in education more closely. We calculate this value
for each sentence  and obtain the maximum  and
the minimum  scores. Finally, we normalize these
values:
ˆ () =  () −  , ∀ ∈ [0, ||].</p>
        <p>− 
3.2. Applying Complexity Heuristics
In the first phase, we compute the complexity of each
sentence  () by adding the normalized values
of length ˆ(), rarity ˆ(), and readability score
ˆ
 (), that is:
 () = ˆ() + ˆ() + ˆ ()
(8)</p>
      </sec>
      <sec id="sec-1-8">
        <title>Then, we sort the sentences of the original corpus by order of increasing complexity before the pre-training phase. Finally, we recompose the re-ordered corpus ready for pre-training.</title>
        <p>Using the corpus sorted by complexity order, we train a
step model with four block sizes, namely 64, 128, 256,
(7) and 512. At first, we train the model with the shortest
block-size, 64, for an arbitrary number of steps. Then, we
continue to train the model with block-sizes of 128 and
256, respectively, for the same number of steps. Finally,
we finish with the largest block-size of 512.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experimental Results and</title>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>We evaluated our proposed CL-LRC approach in model
performance in the experiments. Therefore, we show
that performances increase to the proposed state of the
art in [9]. We use Wikitext-2 [31] to reproduce the
results proposed. Hence, we perform the pre-training from
scratch for BERT [11] and GPT2 [30]. Therefore, we
investigated perplexity, loss, and learning curves during
and at the end of the pre-training. All experiments were
performed on two NVIDIA RTX A6000 with 48 GB of
memory. The code and model will be released for further
research.</p>
      <p>The linguistically motivated pre-training by our
metrics has improved the technique proposed in [9]
and outperformed the baseline models. In particular,
 (BERT) outperforms the version
without LRC of 5 points for English and more than 30 points
for Italian and French over perplexity scores. The same
is true for GPT2 with less striking results (ranging from
4.1. Data 16 to 4 points). Hence, this measure seems to have less
BERT and GPT2 are pre-trained with huge corpora, i.e., impact on the Italian and French, as we can observe from
bookcorpus and Wikipedia-dump with about 3 billion Baseline models for English pre-training and others.
words [32]. In this work, we used Wikitext-2 [31], a Finally, in Fig. 5, we can observe a clear gap in perplexity
small corpus for simulations, allowing pre-training with a in the presence of portions of text with a small number
limited computational resource. Wikitext-2 is a standard of tokens, which is reduced to zero or almost zero when
language model corpus with 720 good-quality articles the number of tokens is more significant.
from English Wikipedia. In addition, we introduced two
further corpora from the Italian and French translations 4.3.2. Languages over Complexity
of Wikitext-2.
4.2. Experimental setup
We use the same corpus in three diferent languages
to analyze learning divergences between diferent
languages. Hence, we perform pre-training from scratch
with the baseline methods, and then with complexity
metrics (Baseline ), the Total-Curriculum (CL
proposed in [9]), and our CL-LRC called 
using the settings proposed in [9]. In particular, in our
 , we sort the corpora according to
complexity, split the corpora according to the dificulty level
of the training samples, and perform the pre-training
phase by increasing the block size. We performed these
steps for all corpora and pre-train BERT and GPT2 from
scratch. Finally, we report the losses during learning,
the final losses on the evaluation set, and the average
perplexity of diferent cuts of the evaluation set.
4.3. Results
Dificulties in learning a language depend on the
complexity of the language itself. However, it can be alleviated
using curricular techniques and greatly improved using
linguistically motivated methods, maintaining reduced
training times as shown in Table 6. These conclusions
derive from the pre-training results from scratch in three
languages using Baseline, Total-Curriculum, and our
CLLRC techniques visible in Table 3. In Figure 5, it can be
observed from the baselines of the diferent corpora that
English language learners, on average, are less perplexed.</p>
      <p>Moreover, the  outperforms the others
in all corpora. However, the batch-size increase supports
the performance achieved by Curriculum Learning.
Finally, in Figure 4, learning curves explain the trade-of
between pre-training steps and loss values.</p>
      <sec id="sec-3-1">
        <title>With the aim of studying intrinsic learning dificulties,</title>
        <p>we propose our line of experiments from the same corpus
translated into three diferent languages: English
(original), French, and Italian. We can observe that the models
started from scratch have more dificulty learning the
French and Italian corpora than the English ones. We
believe this result’s origin stems from the structure and
complexity of the languages concerned. It is widely known
that being both Romance languages, French and Italian
have a very complex grammatical structure, very
diferent from English. Regarding verb conjugation, while
English verbs have relatively simple and regular
conjugation patterns, French and Italian ones are very intricate,
with various tenses, moods, aspects, and verb endings.
For the agreement rules, unlike French and Italian,
English has no grammatical gender distinction, so there is
no agreement based on gender. Moreover, in contrast to
the skinny use in English, French, and Italian have
complex systems of clauses and subordination. Therefore,
it is more dificult for a non-native speaker of Italian or
French to learn these two languages from scratch, for the
same reasons it is also for the models we tested.
4.4. Convergence Speed &amp; Training time
Our CL-LRC outperforms the Total-Curriculum regarding
loss during pre-training. However, in Figure 4, it can be
seen that the loss of the basic model converges to around
50; in contrast, both models with curriculum steadily
decrease and reach a higher convergence rate.
Moreover, it can be observed that the loss of the
curriculumbased model decreased steadily whenever the dificulty
of the training samples was changed. Finally, in Table
6, it is possible to observe how curricular approaches
can significantly reduce training time and consecutively
consumption and costs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <sec id="sec-4-1">
        <title>In this paper, we explored the efectiveness of Curricu</title>
        <p>lum Learning (CL) in reducing the cost of pre-training
and increasing the results. We trained LLMs by
organizing examples from the simplest to the most complex,
thereby leveraging the concept of complexity measures.</p>
        <p>Hence, we pre-trained from scratch BERT and GPT2
using standard baselines and CL approaches. After deep
analysis, we show that divergence in learning can be
mitigated using CL approaches reinforced by measures to
determine the complexity of examples. These measures,
applied during pre-training to sort the corpus according
to complexity, show outstanding results. While the
original approach was tested and validated for the English
language, this research aimed to investigate whether CL
and its associated complexity measure could be applied
to other languages without significant adaptation.
Experiments conducted in a low-resource environment show
that the proposed method leads to better performance in
terms of loss during learning and perplexity on test data.</p>
        <p>
          In future works, we will continue to propose
pedagogically motivated mechanisms to analyze weaknesses [33]
and empower Cross-lingual abilities to deliver
multistepreasoning answers [34].
[
          <xref ref-type="bibr" rid="ref1">14</xref>
          ] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, 116.
        </p>
        <p>
          ELECTRA: Pre-training text encoders as discrimi- [23] B. Xu, L. Zhang, Z. Mao, Q. Wang, H. Xie, Y. Zhang,
nators rather than generators, in: ICLR, 2020. URL: Curriculum learning for natural language
underhttps://openreview.net/pdf ?id=r1xMH1BtvB. standing, in: Annual Meeting of the Association
[15] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, for Computational Linguistics, 2020.
a distilled version of bert: smaller, faster, cheaper [24] M. Lee, J.-H. Park, J. Kim, K.-M. Kim, S. Lee,
Efiand lighter, ArXiv abs/1910.01108 (2019). cient pre-training of masked language model via
[16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, concept-based curriculum masking, in:
ProceedS. Wang, L. Wang, W. Chen, LoRA: Low-rank ings of the 2022 Conference on Empirical
Methadaptation of large language models, in: Inter- ods in Natural Language Processing, Association
national Conference on Learning Representations, for Computational Linguistics, Abu Dhabi, United
2022. URL: https://openreview.net/f orum?id=nZeV Arab Emirates, 2022, pp. 7417–7427. URL: https:
KeeFYf 9. //aclanthology.org/2022.emnlp-main.502.
[17] X. L. Li, P. Liang, Prefix-tuning: Optimizing contin- [25] F. M. Zanzotto, A. Santilli, L. Ranaldi, D. Onorati,
uous prompts for generation, in: Proceedings of the P. Tommasino, F. Fallucchi, KERMIT:
Complement59th Annual Meeting of the Association for Com- ing transformer architectures with encoders of
exputational Linguistics and the 11th International plicit syntactic interpretations, in: Proceedings of
Joint Conference on Natural Language Processing the 2020 Conference on Empirical Methods in
Natu(Volume 1: Long Papers), Association for Compu- ral Language Processing (EMNLP), Association for
tational Linguistics, Online, 2021, pp. 4582–4597. Computational Linguistics, Online, 2020, pp. 256–
URL: https://aclanthology.org/2021.acl-long.353. 267. URL: https://aclanthology.org/2020.emnlp-mai
doi:10.18653/v1/2021.acl-long.353. n.18. doi:10.18653/v1/2020.emnlp-main.18.
[18] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, J. Tang, [26] L. Ranaldi, F. Fallucchi, F. M. Zanzotto, Dis-cover
P-tuning v2: Prompt tuning can be comparable to ai minds to preserve human knowledge, Future
ifne-tuning universally across scales and tasks, 2022. Internet 14 (2022). URL: https://www.mdpi.com/1
arXiv:2110.07602. 999-5903/14/1/10. doi:10.3390/fi14010010.
[
          <xref ref-type="bibr" rid="ref7">19</xref>
          ] O. Kovaleva, A. Romanov, A. Rogers, A. Rumshisky, [27] E. A. Platanios, O. Stretcu, G. Neubig, B. Poczos,
Revealing the dark secrets of BERT, in: Proceedings T. Mitchell, Competence-based curriculum
learnof the 2019 Conference on Empirical Methods in ing for neural machine translation, in:
ProceedNatural Language Processing and the 9th Interna- ings of the 2019 Conference of the North American
tional Joint Conference on Natural Language Pro- Chapter of the Association for Computational
Lincessing (EMNLP-IJCNLP), Association for Compu- guistics: Human Language Technologies, Volume
tational Linguistics, Hong Kong, China, 2019, pp. 1 (Long and Short Papers), Association for
Compu4365–4374. URL: https://aclanthology.org/D19-144 tational Linguistics, Minneapolis, Minnesota, 2019,
5. doi:10.18653/v1/D19-1445. pp. 1162–1172. URL: https://aclanthology.org/N19
[20] S. Sukhbaatar, E. Grave, P. Bojanowski, A. Joulin, -1119. doi:10.18653/v1/N19-1119.
        </p>
        <p>Adaptive attention span in transformers, in: Pro- [28] J. Talburt, The flesch index: An easily
proceedings of the 57th Annual Meeting of the Associa- grammable readability analysis algorithm, in:
Protion for Computational Linguistics, Association for ceedings of the 4th Annual International
ConferComputational Linguistics, Florence, Italy, 2019, pp. ence on Systems Documentation, SIGDOC ’85,
As331–335. URL: https://aclanthology.org/P19-1032. sociation for Computing Machinery, New York, NY,
doi:10.18653/v1/P19-1032. USA, 1986, p. 114–122. URL: https://doi.org/10.114
[21] M. Podkorytov, D. Biś, X. Liu, How can the [mask] 5/10563.10583. doi:10.1145/10563.10583.
know? the sources and limitations of knowledge [29] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
in bert, in: 2021 International Joint Conference on O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Neural Networks (IJCNN), 2021, pp. 1–8. doi:10.1 Roberta: A robustly optimized bert pretraining
ap109/IJCNN52387.2021.9534299. proach, ArXiv abs/1907.11692 (2019).
[22] V. I. Spitkovsky, H. Alshawi, D. Jurafsky, From baby [30] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
steps to leapfrog: How “less is more” in unsuper- I. Sutskever, Language models are unsupervised
vised dependency parsing, in: Human Language multitask learners, 2019.</p>
        <p>Technologies: The 2010 Annual Conference of the [31] S. Merity, C. Xiong, J. Bradbury, R. Socher, Pointer
North American Chapter of the Association for sentinel mixture models, ArXiv abs/1609.07843
Computational Linguistics, Association for Compu- (2017).
tational Linguistics, Los Angeles, California, 2010, [32] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R.
Urpp. 751–759. URL: https://aclanthology.org/N10-1 tasun, A. Torralba, S. Fidler, Aligning books and
movies: Towards story-like visual explanations by
watching movies and reading books, in: 2015
IEEE International Conference on Computer
Vision (ICCV), 2015, pp. 19–27. doi:10.1109/ICCV
.2015.11.
[33] L. Ranaldi, F. M. Zanzotto, Hans, are you clever?
clever hans efect analysis of neural systems, 2023.</p>
        <p>arXiv:2309.12481.
[34] L. Ranaldi, F. M. Zanzotto, Empowering multi-step
reasoning across languages via tree-of-thoughts,
2023. arXiv:2311.08097.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Appendix A</title>
    </sec>
    <sec id="sec-6">
      <title>Appendix B</title>
      <sec id="sec-6-1">
        <title>Model</title>
        <p>(BERT)
Baseline (BERT)
Total-Curriculum (BERT)
 (BERT)
 (GPT2)
Baseline (GPT2)
Total-Curriculum (GPT2)
 (GPT2)</p>
      </sec>
      <sec id="sec-6-2">
        <title>Training Time (English)</title>
      </sec>
      <sec id="sec-6-3">
        <title>Training Time (Italian) Training Time (French)</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>14th International Conference on Recent Advances</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Shoumen</surname>
          </string-name>
          , Bulgaria, Varna, Bulgaria,
          <year>2023</year>
          , pp.
          <fpage>961</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          967. URL: https://aclanthology.org/
          <year>2023</year>
          .ranlp-
          <volume>1</volume>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          03. [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Strubell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          , Energy and
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Proceedings of the 57th Annual Meeting of the As-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <year>2019</year>
          , pp.
          <fpage>3645</fpage>
          -
          <lpage>3650</lpage>
          . URL: https://aclanthology.org
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          /P19-1355. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1355. [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nourbakhsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Patrizi,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>the darknet</article-title>
          ,
          <source>in: Proceedings of RANLP</source>
          ,
          <year>2023</year>
          . [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Zanzotto</surname>
          </string-name>
          , Precog:
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Proceedings of</surname>
            <given-names>RANLP</given-names>
          </string-name>
          ,
          <year>2023</year>
          . [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Louradour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          , J. Weston,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>Curriculum learning</article-title>
          ,
          <source>in: Proceedings of the 26th</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>ing</surname>
          </string-name>
          ,
          <year>2009</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          . [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nagatsuka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Broni-Bediako</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Atsumi</surname>
          </string-name>
          , Pre-
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>in Natural Language Processing (RANLP</source>
          <year>2021</year>
          ), IN-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>COMA</given-names>
            <surname>Ltd</surname>
          </string-name>
          .,
          <string-name>
            <surname>Held</surname>
            <given-names>Online</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>989</fpage>
          -
          <lpage>996</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          https://aclanthology.org/
          <year>2021</year>
          .ranlp-
          <volume>1</volume>
          .
          <fpage>112</fpage>
          . [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Soviany</surname>
          </string-name>
          , R. T. Ionescu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rota</surname>
          </string-name>
          , N. Sebe,
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>Curriculum learning: A survey</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>arXiv:2101</source>
          .
          <fpage>10382</fpage>
          . [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          ing,
          <source>in: Proceedings of the 2018 EMNLP Work-</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>putational Linguistics</source>
          , Brussels, Belgium,
          <year>2018</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          353-
          <fpage>355</fpage>
          . URL: https://aclanthology.org/W18-5446. [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>doi:10</source>
          .18653/v1/
          <fpage>W18</fpage>
          -5446.
          <article-title>Pre-training of deep bidirectional transformers for [2</article-title>
          ]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nourbakhsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Pa- language understanding</article-title>
          ,
          <source>in: Proceedings of the</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>trizi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Onorati</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mastromattei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Fallucchi</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <year>2019</year>
          Conference of the North American Chapter of
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>trained transformers in the DarkNet</article-title>
          , in: R. Mitkov,
          <source>man Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Angelova</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 14th Inter- Short Papers)</source>
          , Association for Computational Lin-
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>national Conference on Recent Advances in Natu- guistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>ral Language</surname>
            <given-names>Processing</given-names>
          </string-name>
          , INCOMA Ltd.,
          <source>Shoumen</source>
          , 4186. URL: https://aclanthology.org/N19- 1423.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Bulgaria</surname>
          </string-name>
          , Varna, Bulgaria,
          <year>2023</year>
          , pp.
          <fpage>949</fpage>
          -
          <lpage>960</lpage>
          . URL: doi:10.18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          https://aclanthology.org/
          <year>2023</year>
          .ranlp-
          <volume>1</volume>
          .
          <fpage>102</fpage>
          . [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          , Improving language [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          , G. Pucci,
          <article-title>Knowing knowledge: Epis- understanding by generative pre-training</article-title>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>temological study of knowledge in transformers</article-title>
          , [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Zanzotto</surname>
          </string-name>
          , Modeling eas-
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          ).
          <article-title>URL: https://www.md iness for training transformers with curriculum</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          pi.com/2076-3417/13/2/677. doi:
          <volume>10</volume>
          .3390/app130 learning, in: R. Mitkov, G. Angelova (Eds.), Pro-
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          20677. ceedings of the 14th International Conference on [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Zanzotto</surname>
          </string-name>
          ,
          <source>PreCog: Recent Advances in Natural Language Processing,</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <article-title>performance in pre-trained language models</article-title>
          ,
          <source>in: 2023</source>
          , pp.
          <fpage>937</fpage>
          -
          <lpage>948</lpage>
          . URL: https://aclanthology.org/2
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          023.ranlp-
          <volume>1</volume>
          .
          <fpage>101</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>