1. Introduction

How far does the sequence of compositions impact Multilingual Pre-Training?

Leonardo Ranaldi

Giulia Pucci

Fabio Massimo Zanzotto

2 0 Department of Computing Science, University of Aberdeen , UK 1 School of Informatics, University of Edinburgh , UK 2 Università degli Studi Roma "Tor Vergata" , Roma , Italy

An Eficient strategy for conducting pre-training of language models is the concatenation of contiguous sequences of text of ifxed length through causal masking that estimates the probability of each token given its context. Yet earlier work suggests that this technique afects the performance of the model as it might include misleading information from previous text sequences during pre-training. To fill this gap, intra-context and rank-based causal masking techniques have been proposed, in which the probability of each token is conditional only on the previous ones in the same document or ranked sequences, avoiding misleading information from diferent contexts. However, the sequences provided by the use of these techniques have been little explored, overlooking the opportunity to optimise the composition by manipulating the volume and heterogeneity in the sequences and improving unbalance pre-training settings. In this paper, we demonstrate that organising text chunks based on a policy that aligns with text similarity efectively improve pre-training, enhances the learning and cross-lingual generalisation capabilities of language models, maintains eficiency, and allows for fewer instances.

eol>Large Language Models Pre-training Methods Cross-lingual Generalisation

1. Introduction

To study the role of heterogeneity and volume of samples in sequence composition strategies (i.e., packing and Large language models (LLMs) are pre-trained on huge masking pipelines), we pre-train language models using amounts of documents by optimizing a language mod- diferent masking approaches (described in §2.2) and comelling objective and show an intriguing ability to solve pare them with models pre-trained via the traditional various downstream NLP tasks. Ranaldi et al. [1] in mul- causal masking with diferent packing approaches by tilingual settings and later Zhao et al. [2] highlighted varying amount of the sequence composition of the docthe importance of pre-training data quality, diversity and uments in the pre-training chunks. Whilst for studying composition methodologies. Our research takes a step the impact on cross-lingual generalisation we use crossfurther by exploring the influence of the pre-training lingual settings (i.e., Italian English). Complementing sequences heterogeneity for cross-lingual generalisation. the foundation approaches proposed in [1, 2],we operThis potentially leads to significant advancements in un- ate via bilingual corpora. Hence, we analyse the results derstanding LLMs’ learning properties. produced by a commonly used baseline method that ran

In decoder-only architectures pre-training, the con- domly samples and packs documents (RandomChunk), a structions of the instances are based on packing that process that samples and packs documents from the same combines randomly sampled texts (i.e., documents) into source based on their composition and origin (UniChunk), a chunk that matches the size of the context window with- and then operate via eficient retrieval-based packing out using any selection policy. Then, the causal mask- method, which retrieves and packs related documents ing predicts the next token conditioned on the previous, (§2.1). including those from diferent documents (portions of The experimental results indicate that operating via non-contiguous texts) in the chunk. The ways to mitigate causal masking (RandomChunk) with arbitrary sequence this arbitrary procedure are: (i) intra-document causal patterns of documents leads to the inclusion of misleadmasking [3], where the likelihood of each token is condi- ing information that stems from diferent context during tioned on the previous from the same document [3] and pre-training (§3), impacting in a negatively the perforretrieval-based masking [2] where similar documents mance of the models in downstream tasks (§4). Instead, retrieved by retrieval systems condition likelihood. intra-document causal masking, which avoids the misleading phenomena during pre-training, significantly improves the models’ performance and does not impact the runtime. Although intra-document causal masking performs well, it limits the operability of sequence com

C4 Wikipedia CulturaX

Baseline chunking document-1 document-2

document-3

Sequence-based chunking document-4

document-5

Retrieve-based chunking doc-1

retrieve using doc-1 start: doc-1

return doc-2 indexCC indexCultura indexHLPT

doc-2 retrieve using doc-2 Index Collector 2. Pre-Training Strategies case in diferent languages as well). As revealed by Zhao et al. [2] as well, this is partly solved by UniChunk’s avoidance of packing documents from diferent distributions, 2.1. Packing Approaches which improves the performance of causal masking models in downstream tasks but still does not allow individual Given that represents a corpus, and = ⋃︀ sequences to be selected. denote resulting from the union of such corpora. Specif

Hence, we use a retrieval-based packing method, ically, each corpus is as a set of documents = winhgicchroaslsl-olwinsgoupalermatoidneglsd’irlaenctgluyaogne smeqoudeenlicnegs, biny-icmonptreoxvt- o{f1to, k.e.n.,s|=|}(︀ ,w1h,e.r.e. ,eac|h|︀) . is defined as a sequence learning and generative capabilities by using causal mask- The packing strategy involves first selecting a set of ing and thus paying a small fee for document sorting but documents {}=1 from , and then packing them into achieving tangible results. a chunk with a fixed length || = . The documents

Our main findings can be summarised as follows: {}=1 are concatenated by interleaving them with end• By analyzing diferent pre-trained strategies in cross- of-sentence ([eos]) tokens. Hence, is denoted as: lingual settings we reveal that operating through causal masking and considering the order and patterns = { ⊕ [eos] | = 1 . . . − 1} ⊕ s(), (1) sequence represented in documents, leads to significant improvements. In addition, retrieval-based tech- where [eos] is the end-of-sentence token, s() truncates niques provide resilience and allow for the selection of the last document such that || = , and the content pre-training sequences by guaranteeing heterogeneity of the chunk is removed from the dataset to avoid and reducing data (§3). sampling the same documents multiple times. • We show important benefits on the in-context learn- Following the strategies proposed in [2],we use three ing capabilities of downstream models. We observe strategies to sample the documents {}=1 from the that in low-resource settings, it is possible to achieve dataset for composing pre-training chunk. the same performance and in some cases cross-lingual In contrast to the previous works, we use ∈ [0, 1] generalisation (in our case, English-Italian) (§4). to control the fraction of the corpus used. Hence, we use • In conclusion, we show that the retrieval-based pack- ⊆ and || = ⌊ × ||⌋ .

ing method allowing for a flexible sequence composi- We define the three strategies (Baseline, Sequencetion process benefits unbalanced cross-lingual learning based and Ranking based) as follow: tangible benefits by using less pre-training data.

Baseline The common baseline approach called RandomChunk, with documents ∈ are sampled uniformly at random from the entire pre-training corpus : (, ) = {︃ ⨁︁ ⊕ [eos] | ∼ Uniform() =1 }︃

Causal Masking In causal masking, each token in a sequence is predicted based on all previous tokens. Specifically, given a chunk = (1, . . . , ||), the likelihood (2) of is given by:

|| () = ∏︁ ( | 1, . . . , − 1), =1 where ⊆ and || = ⌊ × ||⌋ . As a result, in RandomChunk, a chunk can contain documents from a diferent source, as shown in Figure 1. where ( | 1, . . . , − 1) is the probability of the token given previous tokens 1, . . . , − 1 in the chunk.

Sequence-based The UniChunk approach is sequence- During the pre-training, causal masking indicates that, based and respects the sequences of the corpora. Hence, given a chunk , the likelihood of each token in is each chunk is composed of documents from a single conditioned on all previous tokens, including those that source corpus : stem from diferent documents.

{︃ ⨁︁ ⊕ [eos] | ∼ Uniform() =1

}︃ (, ) = (3) Intra-Document Causal Masking In intra-document causal masking, the probability of each token is influwhere ⊆ and || = ⌊ × | |⌋ and ⊆ . enced by the previous tokens within the same document

This strategy avoids packing documents from diferent and, consequently, the same context. Hence, using a fraccorpora and allows control over the amount of data uti- tion ⊆ where || = ⌊ × ||⌋ we construct the lized from each specific corpus, enhancing eficient usage chunks asdefined as in §1. The probability of each of computational resources while preserving thematic token belonging to document is only conditioned coherence. on the previous tokens within : Ranking-based To empower the relevance of doc- || uments in pre-training chunks, we use a retriever- () = ∏︁ ∏︁ (︀ | 1, . . . , (− 1)︀) , (4) based pipeline (BM25-based [4]) to construct pre-training =1 chunks, which we define Bm25Chunk. Hence, given where each is sampled from as defined above. The a document ∈ , a sequence of documents models trained using this approach are called IntraDoc {}=1 by +1 = Retrieve(, ) are retrieved; here, in the rest of the paper.

Retrieve(, ) collects the most similar documents to from using BM25 ranking.

However, since the retrieval process can be computa- 3. Language Modeling Settings tionally heavy due to the size of the pre-training corpus . To improve the eficiency of the retrieval step, a Models The implementation is based on the GPT-2 [5]. subset ℬ ⊆ of the corpus is used, reducing the We pre-train 124 million parameter models using context computational complexity of retrieval as proposed in [2]. windows of 256, 512 tokens. To observe the efect of

In particular, ℬ ⊆ contains documents uni- diferent data compositions, we fix the vocabulary and formly sampled from . To control the number of model parameters described in Appendix A. utilised documents, we operate via that regulates Corpora & Settings We combine three high-quality the fractions of . Hence we use ℬ ⊆ ℬ where open-source corpora1 best exemplified from C4, Cul|ℬ | = ⌊ × |ℬ |⌋. turaX, and Wikipedia. We construct the corpus by

This approach strategically serves as the retrieval operating through the methods proposed in §2 both on source for constructing pre-training chunks: and and then we combine them. Moreover, to observe the impact of the quantity of pre-training instances, we use a scaling factor that operates during the construction of and . 1 ∼ Uniform(ℬ), +1 = Retrieve(, ℬ ).

After retrieving a sequence of documents {}=1 from the ℬ for constructing a chunk, the bufer is refilled by sampling novel documents from .

4. Experiments 2.2. Masking Approaches

To analyse the operation of proposed approaches, we The masking strategy is the other critical stage of lan- evaluate the model perplexities (§4.1), in-context learnguage model pre-training, which defines how next-token ing (§4.2), understanding (§4.3) and question-answering prediction distributions are conditioned on further to- capabilities (§4.4) under diferent configurations. kens in a provided sequence. 1The statistics are reported in Table 4

4.1. Perplexity

We compute the perplexity (PPL) on two diferent setups: (i) models pre-trained with an equal quantity of data and then evaluated on a held-out set of documents where each document is independently treated, (ii) models pre-trained with an equal quantity of data scaled by an factor, which is in {0.1, 0.25, 0.5, 0.75} and then evaluated on a held-out set of documents where each document is independently treated. While the first conifguration allows one to observe whether the proposed methods induce overfitting (data-contamination [ 6]), the second experiment analyses the impact of the amount of data used.

The impact of Sequence Composition Table 1 shows that Bm25Chunk achieves the lowest PPL among the three causal masking models, yielding a lower average PPL compared to RandomChunk (in both settings more than about +5) and UniChunk (in both settings around +3.2). Increasing the correlation of documents in a sequence empowers the language modelling ability of the pre-trained models. Instead, when considering models trained via intra-document causal masking, it emerges that IntraDoc achieves the lowest PPL compared to the models trained via causal masking.

256 512

Model

RandomChunk UniChunk Bm25Chunk IntraDoc RandomChunk UniChunk Bm25Chunk IntraDoc

The role of Quantity Figure 2 shows that Bm25Chunk consistently achieves a lower average PPL than the other approaches even when decreasing the amount of pretraining data. In fact, in both settings (Figure 2), it can be observed that the average PPL of RandomChunk and UniChunk lowers directly as the amount of pretraining data used boosts. While intra-document causal masking performs similarly to Bm25Chunk in resourcebased settings (red line and green line Figure 2), improving the intra-document causal masking alpha reduces the PPL less consistently. Finally, it can be observed that Bm25Chunk reaches stable performance even with = 0.75.

4.2. In-Context Learning

Following Zhao et al. [2], we evaluate the in-context learning abilities of the models using GLUE-X [7] (SST2, CoLA and RTE) both in English and Italian.

Table 2 reports the average in-context learning accuracy values of the models in few-shots settings, using 15 for 256 and 20 demonstrations for the 512 model, respectively. Bm25Chunk yields a higher average accuracy than RandomChunk for 256 (+5.12%) and 512 (+1.55%). These demonstrate that increasing the correlation of the documents in pre-training chunks improves the models’ in-context learning abilities.

Figure 3, we report the average accuracy using different numbers of few-shot demonstrations. Bm25Chunk has an on-par accuracy with IntraDoc on the 256 setting; however, IntraDoc obtains a significantly higher accuracy than Bm25Chunk on the 512 setting. Finally, RandomChunk and UniChunk obtain comparable results using diferent context lengths, and they do not consistently improve accuracy when increasing the number of demonstrations. This might be due to the tighter levels of distraction in both settings, which use arbitrary packing strategies. formances, IntraDoc obtains the best average performance. It indicates that eliminating potential distractions from unrelated documents and learning each document separately empowers understanding and generation abilities. This finding is diferent from the ideas in previous works, which suggested that pre-training with multiple documents in one context and adding distraction in context during pre-training benefit in-context and understanding ability.

5. Conclusion

The role of pre-training sampling is a strategic component. We analyse the impact of sequencing by pre4.3. Understanding & Commonsense training several language models on multilingual corpora. We evaluate the pre-trained models on natural lan- We showed that causal masking involves misleading docguage understanding, commonsense reasoning tasks (i.e., uments that confound the pre-training of language modXSQuAD [8], XCOPA [9]), and question-answering (i.e., els and impact the performance in downstream tasks. MLQA [10]). It emerges that Bm25Chunk outperforms Hence, we find that improving sequence correlation in RandomChunk and UniChunk in all tasks, confirming that pre-training chunks reduces potential distractions while increasing the similarity of documents in pre-training improving the performance of language models without chunks improve understanding abilities. Specifically, reducing pre-training eficiency. In the future, we will Bm25Chunk obtains a significantly better accuracy on study whether these findings archive benefits in fineMLQA, showing it can operate in-context information tuning pipelines [11, 12, 13, 14, 15, 16] as well. provided in the input question.

However, even though Bm25Chunk archives solid per//aclanthology. org/2024 .findings-naacl.78. doi: 10.

18653/v1/2024.findings-naacl.78. [14] L. Ranaldi, G. Pucci, A. Freitas, Does the language matter? curriculum learning over neo-Latin languages, in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, Torino, Italia, 2024, pp. 5212–5220. URL: https: //aclanthology. org/2024 .lrec-main.464. [15] L. Ranaldi, A. Freitas, Aligning large and small language models via chain-of-thought reasoning, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Lin guistics, St. Julian’s, Malta, 2024 , pp. 1812–1827.

URL: https://aclanthology. org/2024 .eacl-long.109. [16] L. Ranaldi, A. Freitas, Self-refine instruction-tuning for aligning reasoning in language models, in: Y. AlOnaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Lin guistics, Miami, Florida, USA, 2024 , pp. 2325–2347. URL: https://aclanthology. org/2024 .

emnlp-main.139.

A. Pre-training Corpora

In our experiments, we use the GPT-2 small, the 124 million model with 12 layers, a hidden size of 768, and 12 attention heads. We use a batch size of 0.5 million tokens for both the models with 256 and 512 context window sizes and pre-train models using 20B tokens with 100,000 steps. We use Adam optimiser with 1 = 0.90, 2 = 0.95, a weight decay of 0.1, and a cosine learning rate scheduler. The peak learning rate is 3 × 10− 4, decreasing to 3 × 10− 5 at the end. We perform the experiments using 16 Nvidia RTX A6000 with 48GB of VRAM.

Subset C4 (it) CulturaX (it) Wikipedia (it) C4 (it) CulturaX (it) Wikipedia (it) # documents

# words ∼ ∼ ∼ ∼ ∼ ∼ • ^ is the token predicted by the model at position . • is the correct (ground truth) token at position . • I is the indicator function, which is 1 if ^ = and 0 otherwise. 256 512

Model RandomChunk UniChunk Bm25Chunk IntraDoc RandomChunk UniChunk Bm25Chunk IntraDoc

B. Next Token Accuracy of Pre-Trained Language Models

In addition to PPL, we report the next token accuracy of pre-trained language models in Table 5.

The "next-token accuracy" is calculated as follows: Specifically we define Acc as:

Acc =

1 ∑︁ I(^ = ) =1 where: Lang Barack Obama

Barack Obama

D. In-context Learning

performances English and

Italian

This section reports the results obtained on the tasks introduced in Section 4.2. To conduct a more detailed analysis, we have used the original (English) and Italian versions of three tasks belonging to the GLUE family. We selected SST2, CoLA, and RTE. The bilingual versions were taken from the contribution previously proposed by Yang et al. [7].

SST2-En

CoLA-En

RTE-En

E. Understanding and Commonsense performances English and Italian

This section reports the results obtained on the tasks introduced in Section 4.3. We have used the original (English) and Italian versions of MLQA, XCOPA, and SQuAD to conduct a more detailed analysis.

MLQA

XCOPA

MLQA

XCOPA 30.52 27.34 27.16 30.85

(Eds.), Findings of the Association for Com [1]

Ranaldi ,

Pucci ,

F. M.

Zanzotto , Modeling eas- putational Linguistics: ACL 2023 , Association

learning, in: R. Mitkov, G. Angelova (Eds.), Pro- 2023 , pp. 12731 - 12750 . URL: https://aclanthology.

ceedings of the 14th International Conference on org/2023.findings-acl.806. doi: 10 .18653/v1/ 2023 .

Recent Advances in Natural Language Processing, findings-acl.806.

INCOMA

Ltd ., Shoumen , Bulgaria, Varna, Bulgaria, [8] P.

Rajpurkar , J.

Zhang , K.

Lopyrev , P.

Liang , Squad:

2023 , pp. 937 - 948 . URL: https://aclanthology.org/ 100, 000+ questions for machine comprehension

2023.ranlp- 1 .101. of text, in: J. Su , X. Carreras , K. Duh (Eds.), [2]

Zhao ,

Qu ,

Staniszewski ,

Tworkowski , Proceedings of the 2016 Conference on Empirical

impact of sequence composition on language model 2016 , Austin, Texas, USA, November 1- 4 , 2016 , The

pre-training , 2024 . URL: https://arxiv.org/abs/2402. Association for Computational Linguistics, 2016 ,

13991. arXiv: 2402 .13991. pp. 2383 - 2392 . URL: https://doi.org/10.18653/v1/ [3]

Shi ,

Min ,

Lomeli ,

Zhou ,

Li ,

Lin , d16 - 1264 . doi: 10 .18653/V1/D16-1264.

N. A.

Smith ,

Zettlemoyer ,

Yih ,

Lewis , [9]

E. M.

Ponti ,

Glavaš ,

Majewska , Q. Liu,

document boundaries , ArXiv abs/2310 .10638 ( 2023 ). dataset for causal commonsense reasoning , in:

264172290. ings of the 2020 Conference on Empirical Meth [4]

Robertson ,

Zaragoza , The probabilis- ods in Natural Language Processing (EMNLP) , As-

tic relevance framework: Bm25 and beyond, sociation for Computational Linguistics , Online,

Found. Trends

Inf . Retr . 3 ( 2009 ) 333 - 389 . URL: 2020 , pp. 2362 - 2376 . URL: https://aclanthology.

https://doi.org/10.1561/1500000019. doi: 10 .1561/ org/2020.emnlp-main. 185 . doi: 10 .18653/v1/ 2020 .

1500000019. emnlp-main. 185 . [5]

Brown ,

Mann ,

Ryder ,

Subbiah , J. D. [10]

Lewis ,

Oguz ,

Rinott ,

Riedel , H. Schwenk,

Krueger ,

Henighan ,

Child ,

Ramesh , J. Tetreault (Eds.), Proceedings of the 58th Annual

Berner ,

McCandlish ,

Radford , I. Sutskever, tics, Online, 2020 , pp. 7315 - 7330 . URL: https://

Amodei , Language models are few-shot learners, aclanthology .org/ 2020 .acl-main. 653 . doi: 10 .18653/

in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , v1/ 2020 .acl-main. 653 .

Lin (Eds.), Advances in Neural Information [11]

Ranaldi , G. Pucci, Knowing knowledge: Epis-

Processing Systems , volume 33 , Curran

Associates

, temological study of knowledge in transform-

Inc. , 2020 , pp. 1877 - 1901 . URL: https://proceedings. ers, Applied Sciences 13 ( 2023 ). URL: https://

neurips.cc/paper_files/paper/2020/file/ www.mdpi.com/2076-3417/13/2/677. doi: 10 .3390/

1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. app13020677. [6]

Ranaldi ,

E. S.

Ruzzetti ,

Onorati , L. Ranaldi, [12]

Ranaldi , G.

Pucci, Does the English matter? elicit

zotto, Investigating the impact of data contamina- D . Ataman (Ed.), Proceedings of the 3rd Workshop

Findings of the Association for Computational Lin- pore, 2023 , pp. 173 - 183 . URL: https://aclanthology.

guistics ACL 2024 , Association for Computational org/ 2023 .mrl- 1 .14. doi: 10 .18653/v1/ 2023 .mrl-1.

Linguistics , Bangkok, Thailand and virtual meeting, 14 .

2024 , pp. 13909 - 13920 . URL: https://aclanthology. [13]

Ranaldi ,

Pucci ,

Ranaldi ,

E. S.

Ruzzetti ,

org/ 2024 .findings-acl. 827 . doi: 10 .18653/v1/ 2024 . F.

Zanzotto , A tree-of-thoughts to broaden

findings-acl.827. multi-step reasoning across languages , in: K. Duh, [7]

Yang ,

Zhang , L. Qin,

Li ,

Wang , H. Liu,

Gomez , S. Bethard (Eds.), Findings of the Associ-

Wang ,

Xie ,

Zhang , GLUE-X: Eval- ation for Computational Linguistics: NAACL 2024 ,

from an out-of-distribution generalization perspec- ico City , Mexico, 2024 , pp. 1229 - 1241 . URL: https: Barack Obama was the 44th President of the United States, serving two terms from 2009 to 2017 . Barack Obama è stato il 44º Presidente degli Stati Uniti, in carica per due mandati dal 2009 al 2017 . Barack Obama was the 44th President of the United States, serving two terms from 2009 to 2017 . Barack Obama è stato il 44º Presidente degli Stati Uniti, in carica per due mandati dal 2009 al 2017 . Who was the 44th President of the United States? Chi è stato il 44º Presidente degli Stati Uniti? Chi è stato il 44º Presidente degli Stati Uniti? Who was the 44th President of the United States? Target Answer Barack Obama Barack Obama