How far does the sequence of compositions impact
                                Multilingual Pre-Training?
                                Leonardo Ranaldi1 , Giulia Pucci2 and Fabio Massimo Zanzotto3
                                1
                                  School of Informatics, University of Edinburgh, UK.
                                2
                                  Department of Computing Science, University of Aberdeen, UK.
                                3
                                  Università degli Studi Roma "Tor Vergata", Roma, Italy.


                                                 Abstract
                                                 An Efficient strategy for conducting pre-training of language models is the concatenation of contiguous sequences of text of
                                                 fixed length through causal masking that estimates the probability of each token given its context. Yet earlier work suggests
                                                 that this technique affects the performance of the model as it might include misleading information from previous text
                                                 sequences during pre-training. To fill this gap, intra-context and rank-based causal masking techniques have been proposed,
                                                 in which the probability of each token is conditional only on the previous ones in the same document or ranked sequences,
                                                 avoiding misleading information from different contexts. However, the sequences provided by the use of these techniques have
                                                 been little explored, overlooking the opportunity to optimise the composition by manipulating the volume and heterogeneity
                                                 in the sequences and improving unbalance pre-training settings. In this paper, we demonstrate that organising text chunks
                                                 based on a policy that aligns with text similarity effectively improve pre-training, enhances the learning and cross-lingual
                                                 generalisation capabilities of language models, maintains efficiency, and allows for fewer instances.

                                                 Keywords
                                                 Large Language Models, Pre-training Methods, Cross-lingual Generalisation,


                                1. Introduction                                                                                           To study the role of heterogeneity and volume of sam-
                                                                                                                                       ples in sequence composition strategies (i.e., packing and
                                Large language models (LLMs) are pre-trained on huge masking pipelines), we pre-train language models using
                                amounts of documents by optimizing a language mod- different masking approaches (described in §2.2) and com-
                                elling objective and show an intriguing ability to solve pare them with models pre-trained via the traditional
                                various downstream NLP tasks. Ranaldi et al. [1] in mul- causal masking with different packing approaches by
                                tilingual settings and later Zhao et al. [2] highlighted varying amount of the sequence composition of the doc-
                                the importance of pre-training data quality, diversity and uments in the pre-training chunks. Whilst for studying
                                composition methodologies. Our research takes a step the impact on cross-lingual generalisation we use cross-
                                further by exploring the influence of the pre-training lingual settings (i.e., Italian English). Complementing
                                sequences heterogeneity for cross-lingual generalisation. the foundation approaches proposed in [1, 2],we oper-
                                This potentially leads to significant advancements in un- ate via bilingual corpora. Hence, we analyse the results
                                derstanding LLMs’ learning properties.                                                                 produced by a commonly used baseline method that ran-
                                   In decoder-only architectures pre-training, the con- domly samples and packs documents (RandomChunk), a
                                structions of the instances are based on packing that process that samples and packs documents from the same
                                combines randomly sampled texts (i.e., documents) into source based on their composition and origin (UniChunk),
                                a chunk that matches the size of the context window with- and then operate via efficient retrieval-based packing
                                out using any selection policy. Then, the causal mask- method, which retrieves and packs related documents
                                ing predicts the next token conditioned on the previous, (§2.1).
                                including those from different documents (portions of                                                     The experimental results indicate that operating via
                                non-contiguous texts) in the chunk. The ways to mitigate causal masking (RandomChunk) with arbitrary sequence
                                this arbitrary procedure are: (i) intra-document causal patterns of documents leads to the inclusion of mislead-
                                masking [3], where the likelihood of each token is condi- ing information that stems from different context during
                                tioned on the previous from the same document [3] and pre-training (§3), impacting in a negatively the perfor-
                                retrieval-based masking [2] where similar documents mance of the models in downstream tasks (§4). Instead,
                                retrieved by retrieval systems condition likelihood.                                                   intra-document causal masking, which avoids the mis-
                                                                                                                                       leading phenomena during pre-training, significantly im-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, proves the models’ performance and does not impact
                                Dec 04 — 06, 2024, Pisa, Italy                                                                         the runtime. Although intra-document causal masking
                                $ lranaldi@ed.ac.uk (L. Ranaldi); g.pucci.24@abdn.uk (G. Pucci);                                       performs well, it limits the operability of sequence com-
                                fabio.massimo.zanzotto@uniroma2.it (F. M. Zanzotto)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License position mixing documents from different corpora (in our
                                           Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                         Baseline chunking

                                                  document-1                 document-2                    document-3


                       C4                                                                   Sequence-based chunking

                                                                                       document-4                          document-5

                  Wikipedia
                                                                                             Retrieve-based chunking

                                                                             doc-1                                                 doc-2
                   CulturaX
                                                                                   retrieve using doc-1                   retrieve using doc-2
                                                          start: doc-1
                                                                                                           return doc-2


                                                                         indexCC            indexCultura   indexHLPT

                                                                                                                           Index Collector


Figure 1: Packing strategies for pre-training chunks construction: Baseline randomly samples documents from all corpora
to construct pre-training sequences, which can pack documents from different sources; Sequence-based randomly samples
documents from a single source to construct a sequence; Retrieve-based operate via ranking-based construction process. The
down block represents a document Collector that caches a set of documents randomly sampled between the corpora.


case in different languages as well). As revealed by Zhao            2. Pre-Training Strategies
et al. [2] as well, this is partly solved by UniChunk’s avoid-
ance of packing documents from different distributions,              2.1. Packing Approaches
which improves the performance of causal masking mod-                                                                            ⋃︀
                                                                     Given 𝒟𝑖 that represents a corpus, and 𝒟 = 𝑠 𝒟𝑠
els in downstream tasks but still does not allow individual
                                                                     denote resulting from the union of such corpora. Specif-
sequences to be selected.
                                                                     ically, each corpus 𝒟𝑠 is as a set of documents 𝒟𝑠 =
   Hence, we use a retrieval-based packing method,
                                                                     {𝑑1 , . . . , 𝑑|𝒟𝑠 | },(︀ where each 𝑑)︀𝑖 is defined as a sequence
which allows operating directly on sequences by improv-
                                                                     of tokens 𝑑𝑖 = 𝑥1 , . . . , 𝑥|𝑑𝑖 | .
ing cross-lingual models’ language modeling, in-context
                                                                        The packing strategy involves first selecting a set of
learning and generative capabilities by using causal mask-
                                                                     documents {𝑑𝑖 }𝑛        𝑖=1 from 𝒟, and then packing them into
ing and thus paying a small fee for document sorting but
                                                                     a chunk 𝐶 with a fixed length |𝐶| = 𝐿. The documents
achieving tangible results.
                                                                     {𝑑𝑖 }𝑛𝑖=1 are concatenated by interleaving them with end-
   Our main findings can be summarised as follows:
                                                                     of-sentence ([eos]) tokens. Hence, 𝐶 is denoted as:
• By analyzing different pre-trained strategies in cross-
   lingual settings we reveal that operating through
   causal masking and considering the order and patterns                       𝐶 = {𝑑𝑖 ⊕ [eos] | 𝑖 = 1 . . . 𝑛 − 1} ⊕ s(𝑑𝑛 ), (1)
   sequence represented in documents, leads to signifi-
   cant improvements. In addition, retrieval-based tech-             where [eos] is the end-of-sentence token, s() truncates
   niques provide resilience and allow for the selection of          the last document such that |𝐶| = 𝐿, and the content
   pre-training sequences by guaranteeing heterogeneity              of the chunk 𝐶 is removed from the dataset 𝒟 to avoid
   and reducing data (§3).                                           sampling the same documents multiple times.
• We show important benefits on the in-context learn-                   Following the strategies proposed in [2], we use three
   ing capabilities of downstream models. We observe                 strategies to sample the documents {𝑑𝑖 }𝑛   𝑖=1 from the

   that in low-resource settings, it is possible to achieve          dataset 𝒟 for composing pre-training chunk.
   the same performance and in some cases cross-lingual                 In contrast to the previous works, we use 𝛼 ∈ [0, 1]
   generalisation (in our case, English-Italian) (§4).               to control the fraction of the corpus used. Hence, we use
• In conclusion, we show that the retrieval-based pack-              𝒮 ⊆ 𝒟 and |𝒮| = ⌊𝛼 × |𝒟|⌋.
   ing method allowing for a flexible sequence composi-                 We define the three strategies (Baseline, Sequence-
   tion process benefits unbalanced cross-lingual learning           based and Ranking based) as follow:
   tangible benefits by using less pre-training data.
                                                                     Baseline The common baseline approach called
                                                                     RandomChunk, with documents 𝑑𝑖 ∈ 𝒟 are sampled uni-
formly at random from the entire pre-training corpus          Causal Masking In causal masking, each token in a
𝒟:                                                            sequence is predicted based on all previous tokens. Specif-
                                                              ically, given a chunk 𝐶 = (𝑥1 , . . . , 𝑥|𝐶| ), the likelihood
            {︃ 𝑛                              }︃
              ⨁︁
  (𝒟, 𝛼) =       𝑑𝑖 ⊕ [eos] | 𝑑𝑖 ∼ Uniform(𝒮)     (2)         of 𝐶 is given by:
               𝑖=1
                                                                                   |𝐶|
where 𝒮 ⊆ 𝒟 and |𝒮| = ⌊𝛼 × |𝒟|⌋. As a result, in
                                                                                   ∏︁
                                                                         𝑃 (𝐶) =         𝑃 (𝑥𝑖 | 𝑥1 , . . . , 𝑥𝑖−1 ),
RandomChunk, a chunk can contain documents from a                                  𝑖=1
different source, as shown in Figure 1.                    where 𝑃 (𝑥𝑖 | 𝑥1 , . . . , 𝑥𝑖−1 ) is the probability of the to-
                                                           ken 𝑥𝑖 given previous tokens 𝑥1 , . . . , 𝑥𝑖−1 in the chunk.
Sequence-based The UniChunk approach is sequence- During the pre-training, causal masking indicates that,
based and respects the sequences of the corpora. Hence, given a chunk 𝐶, the likelihood of each token in 𝐶 is
each chunk is composed of documents from a single conditioned on all previous tokens, including those that
source corpus 𝒟𝑠 :                                         stem from different documents.
              {︃ 𝑛                                  }︃
                ⨁︁
  (𝒟𝑠 , 𝛼) =        𝑑𝑖 ⊕ [eos] | 𝑑𝑖 ∼ Uniform(𝒮𝑠 ) (3) Intra-Document Causal Masking In intra-document
                𝑖=1
                                                           causal masking, the probability of each token is influ-
where 𝒮𝑠 ⊆ 𝒟𝑠 and |𝒮𝑠 | = ⌊𝛼 × |𝒟𝑠 |⌋ and 𝒟𝑠 ⊆ 𝒟.          enced by the previous tokens within the same document
   This strategy avoids packing documents from different and, consequently, the same context. Hence, using a frac-
corpora and allows control over the amount of data uti- tion 𝒮 ⊆ 𝒟 where |𝒮| = ⌊𝛼 × |𝒟|⌋ we construct the
lized from each specific corpus, enhancing efficient usage chunks 𝐶 asdefined as in §1. The probability of each
of computational resources while preserving thematic token 𝑑𝑖𝑗 belonging to document 𝑑𝑖 is only conditioned
coherence.                                                 on the previous tokens within 𝑑𝑖 :

Ranking-based To empower the relevance of doc-                                   |𝑑𝑖 |
                                                                               𝑛 ∏︁
                                                                              ∏︁          (︀                          )︀
uments in pre-training chunks, we use a retriever-                  𝑃 (𝐶) =              𝑃 𝑑𝑖𝑗 | 𝑑𝑖1 , . . . , 𝑑𝑖(𝑗−1) ,   (4)
based pipeline (BM25-based [4]) to construct pre-training                     𝑖=1 𝑗
chunks, which we define Bm25Chunk. Hence, given               where each 𝑑𝑖 is sampled from 𝐶 as defined above. The
a document 𝑑𝑖 ∈ 𝒟𝑠 , a sequence of documents                  models trained using this approach are called IntraDoc
{𝑑𝑖 }𝑛𝑖=1 by 𝑑𝑖+1 = Retrieve(𝑑𝑖 , 𝒟𝑠 ) are retrieved; here,   in the rest of the paper.
Retrieve(𝑑𝑖 , 𝒟𝑠 ) collects the most similar documents
to 𝑑𝑖 from 𝒟𝑠 using BM25 ranking.
   However, since the retrieval process can be computa-       3. Language Modeling Settings
tionally heavy due to the size of the pre-training corpus
𝒟𝑠 . To improve the efficiency of the retrieval step, a       Models The implementation is based on the GPT-2 [5].
subset ℬ𝑠 ⊆ 𝒟𝑠 of the corpus 𝒟𝑠 is used, reducing the         We pre-train 124 million parameter models using context
computational complexity of retrieval as proposed in [2].     windows of 256, 512 tokens. To observe the effect of
   In particular, ℬ𝑠 ⊆ 𝒟𝑠 contains 𝑘 documents uni-           different data compositions, we fix the vocabulary and
formly sampled from 𝒟𝑠 . To control the number of             model parameters described in Appendix A.
utilised documents, we operate via 𝛼 that regulates           Corpora & Settings We combine three high-quality
the fractions of 𝑘. Hence we use ℬ𝛼 ⊆ ℬ𝑠 where                open-source corpora1 best exemplified from C4, Cul-
|ℬ𝛼 | = ⌊𝛼 × |ℬ𝑠 |⌋.                                          turaX, and Wikipedia. We construct the corpus 𝒟 by
   This approach strategically serves as the retrieval        operating through the methods proposed in §2 both on
source for constructing pre-training chunks:                  𝒟𝐸𝑛 and 𝒟𝐼𝑡 and then we combine them. Moreover,
                                                              to observe the impact of the quantity of pre-training in-
   𝑑1 ∼ Uniform(ℬ𝑠 ),      𝑑𝑖+1 = Retrieve(𝑑𝑖 , ℬ𝛼 ).         stances, we use a scaling factor 𝛼 that operates during
After retrieving a sequence of documents {𝑑𝑖 }𝑛 𝑖=1 from      the construction of 𝒟𝐸𝑛 and 𝒟𝐼𝑡 .
the ℬ𝛼 for constructing a chunk, the buffer is refilled by
sampling novel documents from 𝒟𝑠 .
                                                              4. Experiments
2.2. Masking Approaches                                  To analyse the operation of proposed approaches, we
The masking strategy is the other critical stage of lan- evaluate the model perplexities (§4.1), in-context learn-
guage model pre-training, which defines how next-token ing (§4.2), understanding (§4.3) and question-answering
prediction distributions are conditioned on further to- capabilities (§4.4) under different configurations.
                                                         1
kens in a provided sequence.                               The statistics are reported in Table 4
4.1. Perplexity                                            and UniChunk lowers directly as the amount of pre-
                                                           training data used boosts. While intra-document causal
We compute the perplexity (PPL) on two different se-
                                                           masking performs similarly to Bm25Chunk in resource-
tups: (i) models pre-trained with an equal quantity of
                                                           based settings (red line and green line Figure 2), improv-
data and then evaluated on a held-out set of documents
                                                           ing the intra-document causal masking alpha reduces
where each document is independently treated, (ii) mod-
                                                           the PPL less consistently. Finally, it can be observed
els pre-trained with an equal quantity of data scaled by
                                                           that Bm25Chunk reaches stable performance even with
an 𝛼 factor, which is 𝛼 in {0.1, 0.25, 0.5, 0.75} and then
                                                           𝛼 = 0.75.
evaluated on a held-out set of documents where each
document is independently treated. While the first con-
figuration allows one to observe whether the proposed
methods induce overfitting (data-contamination [6]), the
second experiment analyses the impact of the amount of
data used.

The impact of Sequence Composition Table 1
shows that Bm25Chunk achieves the lowest PPL among
the three causal masking models, yielding a lower aver-
age PPL compared to RandomChunk (in both settings more
than about +5) and UniChunk (in both settings around
+3.2). Increasing the correlation of documents in a se-
quence empowers the language modelling ability of the
pre-trained models. Instead, when considering models
trained via intra-document causal masking, it emerges
that IntraDoc achieves the lowest PPL compared to the
models trained via causal masking.
                                                           Figure 2: Average Perplexities decreasing training set.
   𝐿    Model          C4     CulturaX    Wiki    Avg.
        RandomChunk   20.12     19.61      9.89   16.5
        UniChunk      18.83     15.65      8.56   14.3
  256                                                      4.2. In-Context Learning
        Bm25Chunk     14.96     15.07      5.23   11.4
        IntraDoc      14.04     13.57      5.08   10.7
                                                             Following Zhao et al. [2], we evaluate the in-context
        RandomChunk   19.32     18.76      9.55   15.9       learning abilities of the models using GLUE-X [7] (SST2,
        UniChunk      18.22     15.11      7.89   13.4
  512                                                        CoLA and RTE) both in English and Italian.
        Bm25Chunk     13.85     13.27      5.02   10.7
        IntraDoc      12.98     13.07      4.39   10.0
                                                                Table 2 reports the average in-context learning accu-
                                                             racy values of the models in few-shots settings, using
Table 1                                                      15 for 256 and 20 demonstrations for the 512 model, re-
Evaluation of perplexity on test set created by sampling the spectively. Bm25Chunk yields a higher average accuracy
original pre-training corpora (Appendix D). 𝐿 is the context than RandomChunk for 256 (+5.12%) and 512 (+1.55%).
window for pre-training (next-token accuracy in Appendix B). These demonstrate that increasing the correlation of the
                                                             documents in pre-training chunks improves the models’
   Generally, all methods obtain significantly lower PPLs in-context learning abilities.
(particularly Bm25Chunk than IntraDoc) in Wikipedia.            Figure 3, we report the average accuracy using dif-
This phenomenon could imply that the pre-training ferent numbers of few-shot demonstrations. Bm25Chunk
sources are very common (lower PPL is better-known has an on-par accuracy with IntraDoc on the 256 set-
text), these texts is more influenced by documents with ting; however, IntraDoc obtains a significantly higher
different contexts (misleading contexts) and the proposed accuracy than Bm25Chunk on the 512 setting. Finally,
strategies can improve this problem.                         RandomChunk and UniChunk obtain comparable results
                                                             using different context lengths, and they do not consis-
The role of Quantity Figure 2 shows that Bm25Chunk tently improve accuracy when increasing the number of
consistently achieves a lower average PPL than the other demonstrations. This might be due to the tighter levels of
approaches even when decreasing the amount of pre- distraction in both settings, which use arbitrary packing
training data. In fact, in both settings (Figure 2), it strategies.
can be observed that the average PPL of RandomChunk
   𝐿     Model            SST2     CoLA     RTE       Avg.       formances, IntraDoc obtains the best average perfor-
         RandomChunk      50.53    60.62    24.76     45.33      mance. It indicates that eliminating potential distractions
         UniChunk         56.13    62.68    18.73     45.72      from unrelated documents and learning each document
  256
         Bm25Chunk        62.12    64.06    25.16     50.45      separately empowers understanding and generation abil-
         IntraDoc         53.22    61.16    24.23     46.20      ities. This finding is different from the ideas in previous
         RandomChunk      55.13    62.85    36.38     51.38      works, which suggested that pre-training with multi-
         UniChunk         58.53    63.04    22.12     47.85      ple documents in one context and adding distraction in
  512
         Bm25Chunk        60.30    63.21    35.26     52.93      context during pre-training benefit in-context and under-
         IntraDoc         59.32    65.62    36.65     53.81      standing ability.
Table 2
Average In-context learning performance evaluated by text
classification accuracy across three tasks. Accuracies for En-
glish and Italian are reported in Appendix E.


                                                                 Figure 4: Evaluation results of MultiLingual Question Answer-
                                                                 ing by providing cross-lingual input (en-it means context in
                                                                 English and question in Italian and vice versa as described in
                                                                 Appendix C).


Figure 3: Average in-context learning accuracy using different
numbers of input demonstrations.                                 4.4. Multilinguality
                                                                 To assess code-switching abilities, we experimented with
  𝐿     Model           MLQA      XCOPA     SQuAD      Avg.      cross-lingual input by operating with MLQA. We crossed
                                                                 the languages, delivering contexts in English and ques-
        RandomChunk     21.48      30.21     28.04     26.5
        UniChunk        23.97      32.19     27.16     27.7      tions in Italian and vice versa (Appendix C). Figure 4
  256
        Bm25Chunk       28.18      33.97     27.26     29.8      show that Bm25Chunk outperforms both RandomChunk
        IntraDoc        33.63      38.05     30.51     34.0      and intra-document causal masking. At the same time,
        RandomChunk     26.05      31.93     31.39     29.7      IntraDoc, as discussed in §4.3 for MLQA, outperforms
        UniChunk        27.14      33.34     31.22     30.5      Bm25Chunk. This result confirms that IntraDoc’s per-
  512
        Bm25Chunk       30.71      35.82     34.85     33.7      formance is not only related to monolingual learning
        IntraDoc        32.42      37.71     36.04     35.2
                                                                 sequences but also more complex dynamics.
Table 3
Evaluation results of natural language understanding, com-
monsense reasoning and QA tasks.                                 5. Conclusion
                                                                 The role of pre-training sampling is a strategic com-
                                                                 ponent. We analyse the impact of sequencing by pre-
4.3. Understanding & Commonsense                                 training several language models on multilingual corpora.
We evaluate the pre-trained models on natural lan-               We showed that causal masking involves misleading doc-
guage understanding, commonsense reasoning tasks (i.e.,          uments that confound the pre-training of language mod-
XSQuAD [8], XCOPA [9]), and question-answering (i.e.,            els and impact the performance in downstream tasks.
MLQA [10]). It emerges that Bm25Chunk outperforms                Hence, we find that improving sequence correlation in
RandomChunk and UniChunk in all tasks, confirming that           pre-training chunks reduces potential distractions while
increasing the similarity of documents in pre-training           improving the performance of language models without
chunks improve understanding abilities. Specifically,            reducing pre-training efficiency. In the future, we will
Bm25Chunk obtains a significantly better accuracy on             study whether these findings archive benefits in fine-
MLQA, showing it can operate in-context information              tuning pipelines [11, 12, 13, 14, 15, 16] as well.
provided in the input question.
  However, even though Bm25Chunk archives solid per-
References                                                       tive, in: A. Rogers, J. Boyd-Graber, N. Okazaki
                                                                 (Eds.), Findings of the Association for Com-
[1] L. Ranaldi, G. Pucci, F. M. Zanzotto, Modeling eas-          putational Linguistics: ACL 2023, Association
    iness for training transformers with curriculum              for Computational Linguistics, Toronto, Canada,
    learning, in: R. Mitkov, G. Angelova (Eds.), Pro-            2023, pp. 12731–12750. URL: https://aclanthology.
    ceedings of the 14th International Conference on             org/2023.findings-acl.806. doi:10.18653/v1/2023.
    Recent Advances in Natural Language Processing,              findings-acl.806.
    INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria,         [8] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad:
    2023, pp. 937–948. URL: https://aclanthology.org/            100, 000+ questions for machine comprehension
    2023.ranlp-1.101.                                            of text, in: J. Su, X. Carreras, K. Duh (Eds.),
[2] Y. Zhao, Y. Qu, K. Staniszewski, S. Tworkowski,              Proceedings of the 2016 Conference on Empirical
    W. Liu, P. Miłoś, Y. Wu, P. Minervini, Analysing the         Methods in Natural Language Processing, EMNLP
    impact of sequence composition on language model             2016, Austin, Texas, USA, November 1-4, 2016, The
    pre-training, 2024. URL: https://arxiv.org/abs/2402.         Association for Computational Linguistics, 2016,
    13991. arXiv:2402.13991.                                     pp. 2383–2392. URL: https://doi.org/10.18653/v1/
[3] W. Shi, S. Min, M. Lomeli, C. Zhou, M. Li, V. Lin,           d16-1264. doi:10.18653/V1/D16-1264.
    N. A. Smith, L. Zettlemoyer, S. Yih, M. Lewis,           [9] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu,
    In-context pretraining: Language modeling beyond             I. Vulić, A. Korhonen, XCOPA: A multilingual
    document boundaries, ArXiv abs/2310.10638 (2023).            dataset for causal commonsense reasoning, in:
    URL: https://api.semanticscholar.org/CorpusID:               B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceed-
    264172290.                                                   ings of the 2020 Conference on Empirical Meth-
[4] S. Robertson, H. Zaragoza,          The probabilis-          ods in Natural Language Processing (EMNLP), As-
    tic relevance framework: Bm25 and beyond,                    sociation for Computational Linguistics, Online,
    Found. Trends Inf. Retr. 3 (2009) 333–389. URL:              2020, pp. 2362–2376. URL: https://aclanthology.
    https://doi.org/10.1561/1500000019. doi:10.1561/             org/2020.emnlp-main.185. doi:10.18653/v1/2020.
    1500000019.                                                  emnlp-main.185.
[5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.          [10] P. Lewis, B. Oguz, R. Rinott, S. Riedel, H. Schwenk,
    Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,               MLQA: Evaluating cross-lingual extractive ques-
    G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,           tion answering, in: D. Jurafsky, J. Chai, N. Schluter,
    G. Krueger, T. Henighan, R. Child, A. Ramesh,                J. Tetreault (Eds.), Proceedings of the 58th Annual
    D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,             Meeting of the Association for Computational Lin-
    E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,           guistics, Association for Computational Linguis-
    C. Berner, S. McCandlish, A. Radford, I. Sutskever,          tics, Online, 2020, pp. 7315–7330. URL: https://
    D. Amodei, Language models are few-shot learners,            aclanthology.org/2020.acl-main.653. doi:10.18653/
    in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,        v1/2020.acl-main.653.
    H. Lin (Eds.), Advances in Neural Information           [11] L. Ranaldi, G. Pucci, Knowing knowledge: Epis-
    Processing Systems, volume 33, Curran Associates,            temological study of knowledge in transform-
    Inc., 2020, pp. 1877–1901. URL: https://proceedings.         ers, Applied Sciences 13 (2023). URL: https://
    neurips.cc/paper_files/paper/2020/file/                      www.mdpi.com/2076-3417/13/2/677. doi:10.3390/
    1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.                  app13020677.
[6] F. Ranaldi, E. S. Ruzzetti, D. Onorati, L. Ranaldi,     [12] L. Ranaldi, G. Pucci, Does the English matter? elicit
    C. Giannone, A. Favalli, R. Romagnoli, F. M. Zan-            cross-lingual abilities of large language models, in:
    zotto, Investigating the impact of data contamina-           D. Ataman (Ed.), Proceedings of the 3rd Workshop
    tion of large language models in text-to-SQL trans-          on Multi-lingual Representation Learning (MRL),
    lation, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.),        Association for Computational Linguistics, Singa-
    Findings of the Association for Computational Lin-           pore, 2023, pp. 173–183. URL: https://aclanthology.
    guistics ACL 2024, Association for Computational             org/2023.mrl-1.14. doi:10.18653/v1/2023.mrl-1.
    Linguistics, Bangkok, Thailand and virtual meeting,          14.
    2024, pp. 13909–13920. URL: https://aclanthology.       [13] L. Ranaldi, G. Pucci, F. Ranaldi, E. S. Ruzzetti,
    org/2024.findings-acl.827. doi:10.18653/v1/2024.             F. M. Zanzotto, A tree-of-thoughts to broaden
    findings-acl.827.                                            multi-step reasoning across languages, in: K. Duh,
[7] L. Yang, S. Zhang, L. Qin, Y. Li, Y. Wang, H. Liu,           H. Gomez, S. Bethard (Eds.), Findings of the Associ-
    J. Wang, X. Xie, Y. Zhang,           GLUE-X: Eval-           ation for Computational Linguistics: NAACL 2024,
    uating natural language understanding models                 Association for Computational Linguistics, Mex-
    from an out-of-distribution generalization perspec-          ico City, Mexico, 2024, pp. 1229–1241. URL: https:
     //aclanthology.org/2024.findings-naacl.78. doi:10.
     18653/v1/2024.findings-naacl.78.
[14] L. Ranaldi, G. Pucci, A. Freitas, Does the lan-
     guage matter? curriculum learning over neo-Latin
     languages, in: N. Calzolari, M.-Y. Kan, V. Hoste,
     A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of
     the 2024 Joint International Conference on Com-
     putational Linguistics, Language Resources and
     Evaluation (LREC-COLING 2024), ELRA and ICCL,
     Torino, Italia, 2024, pp. 5212–5220. URL: https:
     //aclanthology.org/2024.lrec-main.464.
[15] L. Ranaldi, A. Freitas, Aligning large and small
     language models via chain-of-thought reasoning,
     in: Y. Graham, M. Purver (Eds.), Proceedings of the
     18th Conference of the European Chapter of the
     Association for Computational Linguistics (Volume
     1: Long Papers), Association for Computational
     Linguistics, St. Julian’s, Malta, 2024, pp. 1812–1827.
     URL: https://aclanthology.org/2024.eacl-long.109.
[16] L. Ranaldi, A. Freitas, Self-refine instruction-tuning
     for aligning reasoning in language models, in: Y. Al-
     Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings
     of the 2024 Conference on Empirical Methods in
     Natural Language Processing, Association for Com-
     putational Linguistics, Miami, Florida, USA, 2024,
     pp. 2325–2347. URL: https://aclanthology.org/2024.
     emnlp-main.139.
A. Pre-training Corpora
In our experiments, we use the GPT-2 small, the 124 million
model with 12 layers, a hidden size of 768, and 12 attention
heads. We use a batch size of 0.5 million tokens for both the
models with 256 and 512 context window sizes and pre-train
models using 20B tokens with 100,000 steps. We use Adam
optimiser with 𝛽1 = 0.90, 𝛽2 = 0.95, a weight decay of 0.1,
and a cosine learning rate scheduler. The peak learning rate is
3 × 10−4 , decreasing to 3 × 10−5 at the end. We perform the
experiments using 16 Nvidia RTX A6000 with 48GB of VRAM.

          Subset                # documents      # words
          C4 (it)                 ∼ 8𝑀            ∼ 4𝐵
          CulturaX (it)           ∼ 2.5𝑀         ∼ 2.6𝑀
          Wikipedia (it)          ∼ 1.5𝑀         ∼ 780𝑀
          C4 (it)                 ∼ 8𝑀           ∼ 3.4𝐵
          CulturaX (it)           ∼ 2.5𝑀         ∼ 2.1𝑀
          Wikipedia (it)          ∼ 1.5𝑀         ∼ 760𝑀

Table 4
Size of pre-training corpora. For computational reasons, we
produced equivalent samples for both English and Italian.


B. Next Token Accuracy of
   Pre-Trained Language Models
In addition to PPL, we report the next token accuracy of
pre-trained language models in Table 5.
The "next-token accuracy" is calculated as follows:
Specifically we define Acc as:
                                  𝑁
                            1 ∑︁
                    Acc =         I(𝑦^𝑖 = 𝑦𝑖 )                  (5)
                            𝑁 𝑖=1

where:
        • 𝑁 is the total number of tokens in the test set.
        • 𝑦^𝑖 is the token predicted by the model at position 𝑖.
        • 𝑦𝑖 is the correct (ground truth) token at position 𝑖.
        • I is the indicator function, which is 1 if 𝑦^𝑖 = 𝑦𝑖 and 0
          otherwise.

  𝐿       Model            C4      CulturaX   Wikipedia     Avg.
          RandomChunk    0.242        0.431      0.336     0.336
          UniChunk       0.248        0.463      0.415     0.375
  256
          Bm25Chunk      0.332        0.451      0.424     0.402
          IntraDoc       0.357        0.472      0.442     0.423
          RandomChunk    0.346        0.456      0.368     0.393
          UniChunk       0.389        0.462      0.405     0.419
  512
          Bm25Chunk      0.419        0.493      0.423     0.445
          IntraDoc       0.440        0.498      0.463     0.467

Table 5
Evaluation of next token accuracy on proposed test-set.
C. Multilingual Question Answering Examples
 Lang      Context                                                 Question                                                Target Answer
           Barack Obama was the 44th President of the United
 en                                                                Who was the 44th President of the United States?        Barack Obama
           States, serving two terms from 2009 to 2017.
           Barack Obama è stato il 44º Presidente degli Stati
 it                                                                Chi è stato il 44º Presidente degli Stati Uniti?        Barack Obama
           Uniti, in carica per due mandati dal 2009 al 2017.
           Barack Obama was the 44th President of the United
 en-it                                                             Chi è stato il 44º Presidente degli Stati Uniti?        Barack Obama
           States, serving two terms from 2009 to 2017.
           Barack Obama è stato il 44º Presidente degli Stati
 it-en                                                             Who was the 44th President of the United States?        Barack Obama
           Uniti, in carica per due mandati dal 2009 al 2017.

Table 6
Examples from the MLQA dataset in English, Italian and Cross-lingual.


D. In-context Learning
                                                                  E. Understanding and
   performances English and
                                                                     Commonsense performances
   Italian
                                                                     English and Italian
This section reports the results obtained on the tasks
introduced in Section 4.2. To conduct a more detailed analysis,   This section reports the results obtained on the tasks
we have used the original (English) and Italian versions of       introduced in Section 4.3. We have used the original (English)
three tasks belonging to the GLUE family. We selected SST2,       and Italian versions of MLQA, XCOPA, and SQuAD to
CoLA, and RTE. The bilingual versions were taken from the         conduct a more detailed analysis.
contribution previously proposed by Yang et al. [7].
                                                                    𝐿     Model            MLQA      XCOPA       SQuAD    Avg.
         Model         SST2-En    CoLA-En    RTE-En     Avg.              RandomChunk      22.63       30.71      30.52   30.22
                       51.34      61.73      25.71      46.26             UniChunk         24.09       23.15      27.34   24.83
         RandomChunk                                                256
         UniChunk      57.16      63.21      19.17      43.15             Bm25Chunk        29.16       34.19      27.16   30.11
  256                                                                     IntraDoc         34.06       38.21      30.85   34.3
         Bm25Chunk     61.9       65.02      26.31      50.42
         IntraDoc      53.39      61.67      25.27      46.76             RandomChunk      26.63       32.16      31.82   30.32
         RandomChunk   55.49      63.42      38.19      52.46             UniChunk         27.05       33.26      31.54   30.65
                                                                    512
         UniChunk      59.16      63.12      21.87      48.02             Bm25Chunk        30.66       36.51      34.73   34.08
  512
         Bm25Chunk     60.81      64.69      36.23      53.93             IntraDoc         32.88       38.15      38.23   36.23
         IntraDoc      59.21      66.25      36.19      53.73
                                                                  Table 9
Table 7                                                           Evaluation results of natural language understanding, com-
In-context learning performance evaluated by text classifica-     monsense reasoning and QA tasks in English.
tion accuracy across three English tasks.
                                                                    𝐿     Model            MLQA      XCOPA       SQuAD    Avg.
         Model          SST2-It   CoLA-It    RTE-It    Avg.
                                                                          RandomChunk      20.33       29.62      30.18   29.31
         RandomChunk    49.41     59.62      23.51     44.17              UniChunk         23.85       23.42      26.73   25.06
                        55.13     62.92      18.32     46.76        256
  256
         UniChunk                                                         Bm25Chunk        27.21       33.16      27.32   29.05
         Bm25Chunk      61.24     63.07      23.92     49.40              IntraDoc         33.26       37.88      30.18   33.65
         IntraDoc       52.93     60.81      23.92     46.08
                                                                          RandomChunk      25.88       31.78      30.97   x.x
         RandomChunk    54.71     62.63      34.36     50.64              UniChunk         27.23       33.42      30.94   30.32
                        57.92     62.94      22.46     47.82        512
  512
         UniChunk                                                         Bm25Chunk        30.77       35.92      34.66   33.42
         Bm25Chunk      59.83     63.38      34.25     52.36              IntraDoc         31.97       37.28      38.46   35.64
         IntraDoc       59.06     65.23      35.16     52.55
                                                                  Table 10
Table 8                                                           Evaluation results of natural language understanding, com-
In-context learning performance evaluated by text classifica-     monsense reasoning and QA tasks in Italian.
tion accuracy across three Italian tasks.