How far does the sequence of compositions impact Multilingual Pre-Training? Leonardo Ranaldi1 , Giulia Pucci2 and Fabio Massimo Zanzotto3 1 School of Informatics, University of Edinburgh, UK. 2 Department of Computing Science, University of Aberdeen, UK. 3 Università degli Studi Roma "Tor Vergata", Roma, Italy. Abstract An Efficient strategy for conducting pre-training of language models is the concatenation of contiguous sequences of text of fixed length through causal masking that estimates the probability of each token given its context. Yet earlier work suggests that this technique affects the performance of the model as it might include misleading information from previous text sequences during pre-training. To fill this gap, intra-context and rank-based causal masking techniques have been proposed, in which the probability of each token is conditional only on the previous ones in the same document or ranked sequences, avoiding misleading information from different contexts. However, the sequences provided by the use of these techniques have been little explored, overlooking the opportunity to optimise the composition by manipulating the volume and heterogeneity in the sequences and improving unbalance pre-training settings. In this paper, we demonstrate that organising text chunks based on a policy that aligns with text similarity effectively improve pre-training, enhances the learning and cross-lingual generalisation capabilities of language models, maintains efficiency, and allows for fewer instances. Keywords Large Language Models, Pre-training Methods, Cross-lingual Generalisation, 1. Introduction To study the role of heterogeneity and volume of sam- ples in sequence composition strategies (i.e., packing and Large language models (LLMs) are pre-trained on huge masking pipelines), we pre-train language models using amounts of documents by optimizing a language mod- different masking approaches (described in §2.2) and com- elling objective and show an intriguing ability to solve pare them with models pre-trained via the traditional various downstream NLP tasks. Ranaldi et al. [1] in mul- causal masking with different packing approaches by tilingual settings and later Zhao et al. [2] highlighted varying amount of the sequence composition of the doc- the importance of pre-training data quality, diversity and uments in the pre-training chunks. Whilst for studying composition methodologies. Our research takes a step the impact on cross-lingual generalisation we use cross- further by exploring the influence of the pre-training lingual settings (i.e., Italian English). Complementing sequences heterogeneity for cross-lingual generalisation. the foundation approaches proposed in [1, 2],we oper- This potentially leads to significant advancements in un- ate via bilingual corpora. Hence, we analyse the results derstanding LLMs’ learning properties. produced by a commonly used baseline method that ran- In decoder-only architectures pre-training, the con- domly samples and packs documents (RandomChunk), a structions of the instances are based on packing that process that samples and packs documents from the same combines randomly sampled texts (i.e., documents) into source based on their composition and origin (UniChunk), a chunk that matches the size of the context window with- and then operate via efficient retrieval-based packing out using any selection policy. Then, the causal mask- method, which retrieves and packs related documents ing predicts the next token conditioned on the previous, (§2.1). including those from different documents (portions of The experimental results indicate that operating via non-contiguous texts) in the chunk. The ways to mitigate causal masking (RandomChunk) with arbitrary sequence this arbitrary procedure are: (i) intra-document causal patterns of documents leads to the inclusion of mislead- masking [3], where the likelihood of each token is condi- ing information that stems from different context during tioned on the previous from the same document [3] and pre-training (§3), impacting in a negatively the perfor- retrieval-based masking [2] where similar documents mance of the models in downstream tasks (§4). Instead, retrieved by retrieval systems condition likelihood. intra-document causal masking, which avoids the mis- leading phenomena during pre-training, significantly im- CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, proves the models’ performance and does not impact Dec 04 — 06, 2024, Pisa, Italy the runtime. Although intra-document causal masking $ lranaldi@ed.ac.uk (L. Ranaldi); g.pucci.24@abdn.uk (G. Pucci); performs well, it limits the operability of sequence com- fabio.massimo.zanzotto@uniroma2.it (F. M. Zanzotto) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License position mixing documents from different corpora (in our Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Baseline chunking document-1 document-2 document-3 C4 Sequence-based chunking document-4 document-5 Wikipedia Retrieve-based chunking doc-1 doc-2 CulturaX retrieve using doc-1 retrieve using doc-2 start: doc-1 return doc-2 indexCC indexCultura indexHLPT Index Collector Figure 1: Packing strategies for pre-training chunks construction: Baseline randomly samples documents from all corpora to construct pre-training sequences, which can pack documents from different sources; Sequence-based randomly samples documents from a single source to construct a sequence; Retrieve-based operate via ranking-based construction process. The down block represents a document Collector that caches a set of documents randomly sampled between the corpora. case in different languages as well). As revealed by Zhao 2. Pre-Training Strategies et al. [2] as well, this is partly solved by UniChunk’s avoid- ance of packing documents from different distributions, 2.1. Packing Approaches which improves the performance of causal masking mod- ⋃︀ Given 𝒟𝑖 that represents a corpus, and 𝒟 = 𝑠 𝒟𝑠 els in downstream tasks but still does not allow individual denote resulting from the union of such corpora. Specif- sequences to be selected. ically, each corpus 𝒟𝑠 is as a set of documents 𝒟𝑠 = Hence, we use a retrieval-based packing method, {𝑑1 , . . . , 𝑑|𝒟𝑠 | },(︀ where each 𝑑)︀𝑖 is defined as a sequence which allows operating directly on sequences by improv- of tokens 𝑑𝑖 = 𝑥1 , . . . , 𝑥|𝑑𝑖 | . ing cross-lingual models’ language modeling, in-context The packing strategy involves first selecting a set of learning and generative capabilities by using causal mask- documents {𝑑𝑖 }𝑛 𝑖=1 from 𝒟, and then packing them into ing and thus paying a small fee for document sorting but a chunk 𝐶 with a fixed length |𝐶| = 𝐿. The documents achieving tangible results. {𝑑𝑖 }𝑛𝑖=1 are concatenated by interleaving them with end- Our main findings can be summarised as follows: of-sentence ([eos]) tokens. Hence, 𝐶 is denoted as: • By analyzing different pre-trained strategies in cross- lingual settings we reveal that operating through causal masking and considering the order and patterns 𝐶 = {𝑑𝑖 ⊕ [eos] | 𝑖 = 1 . . . 𝑛 − 1} ⊕ s(𝑑𝑛 ), (1) sequence represented in documents, leads to signifi- cant improvements. In addition, retrieval-based tech- where [eos] is the end-of-sentence token, s() truncates niques provide resilience and allow for the selection of the last document such that |𝐶| = 𝐿, and the content pre-training sequences by guaranteeing heterogeneity of the chunk 𝐶 is removed from the dataset 𝒟 to avoid and reducing data (§3). sampling the same documents multiple times. • We show important benefits on the in-context learn- Following the strategies proposed in [2], we use three ing capabilities of downstream models. We observe strategies to sample the documents {𝑑𝑖 }𝑛 𝑖=1 from the that in low-resource settings, it is possible to achieve dataset 𝒟 for composing pre-training chunk. the same performance and in some cases cross-lingual In contrast to the previous works, we use 𝛼 ∈ [0, 1] generalisation (in our case, English-Italian) (§4). to control the fraction of the corpus used. Hence, we use • In conclusion, we show that the retrieval-based pack- 𝒮 ⊆ 𝒟 and |𝒮| = ⌊𝛼 × |𝒟|⌋. ing method allowing for a flexible sequence composi- We define the three strategies (Baseline, Sequence- tion process benefits unbalanced cross-lingual learning based and Ranking based) as follow: tangible benefits by using less pre-training data. Baseline The common baseline approach called RandomChunk, with documents 𝑑𝑖 ∈ 𝒟 are sampled uni- formly at random from the entire pre-training corpus Causal Masking In causal masking, each token in a 𝒟: sequence is predicted based on all previous tokens. Specif- ically, given a chunk 𝐶 = (𝑥1 , . . . , 𝑥|𝐶| ), the likelihood {︃ 𝑛 }︃ ⨁︁ (𝒟, 𝛼) = 𝑑𝑖 ⊕ [eos] | 𝑑𝑖 ∼ Uniform(𝒮) (2) of 𝐶 is given by: 𝑖=1 |𝐶| where 𝒮 ⊆ 𝒟 and |𝒮| = ⌊𝛼 × |𝒟|⌋. As a result, in ∏︁ 𝑃 (𝐶) = 𝑃 (𝑥𝑖 | 𝑥1 , . . . , 𝑥𝑖−1 ), RandomChunk, a chunk can contain documents from a 𝑖=1 different source, as shown in Figure 1. where 𝑃 (𝑥𝑖 | 𝑥1 , . . . , 𝑥𝑖−1 ) is the probability of the to- ken 𝑥𝑖 given previous tokens 𝑥1 , . . . , 𝑥𝑖−1 in the chunk. Sequence-based The UniChunk approach is sequence- During the pre-training, causal masking indicates that, based and respects the sequences of the corpora. Hence, given a chunk 𝐶, the likelihood of each token in 𝐶 is each chunk is composed of documents from a single conditioned on all previous tokens, including those that source corpus 𝒟𝑠 : stem from different documents. {︃ 𝑛 }︃ ⨁︁ (𝒟𝑠 , 𝛼) = 𝑑𝑖 ⊕ [eos] | 𝑑𝑖 ∼ Uniform(𝒮𝑠 ) (3) Intra-Document Causal Masking In intra-document 𝑖=1 causal masking, the probability of each token is influ- where 𝒮𝑠 ⊆ 𝒟𝑠 and |𝒮𝑠 | = ⌊𝛼 × |𝒟𝑠 |⌋ and 𝒟𝑠 ⊆ 𝒟. enced by the previous tokens within the same document This strategy avoids packing documents from different and, consequently, the same context. Hence, using a frac- corpora and allows control over the amount of data uti- tion 𝒮 ⊆ 𝒟 where |𝒮| = ⌊𝛼 × |𝒟|⌋ we construct the lized from each specific corpus, enhancing efficient usage chunks 𝐶 asdefined as in §1. The probability of each of computational resources while preserving thematic token 𝑑𝑖𝑗 belonging to document 𝑑𝑖 is only conditioned coherence. on the previous tokens within 𝑑𝑖 : Ranking-based To empower the relevance of doc- |𝑑𝑖 | 𝑛 ∏︁ ∏︁ (︀ )︀ uments in pre-training chunks, we use a retriever- 𝑃 (𝐶) = 𝑃 𝑑𝑖𝑗 | 𝑑𝑖1 , . . . , 𝑑𝑖(𝑗−1) , (4) based pipeline (BM25-based [4]) to construct pre-training 𝑖=1 𝑗 chunks, which we define Bm25Chunk. Hence, given where each 𝑑𝑖 is sampled from 𝐶 as defined above. The a document 𝑑𝑖 ∈ 𝒟𝑠 , a sequence of documents models trained using this approach are called IntraDoc {𝑑𝑖 }𝑛𝑖=1 by 𝑑𝑖+1 = Retrieve(𝑑𝑖 , 𝒟𝑠 ) are retrieved; here, in the rest of the paper. Retrieve(𝑑𝑖 , 𝒟𝑠 ) collects the most similar documents to 𝑑𝑖 from 𝒟𝑠 using BM25 ranking. However, since the retrieval process can be computa- 3. Language Modeling Settings tionally heavy due to the size of the pre-training corpus 𝒟𝑠 . To improve the efficiency of the retrieval step, a Models The implementation is based on the GPT-2 [5]. subset ℬ𝑠 ⊆ 𝒟𝑠 of the corpus 𝒟𝑠 is used, reducing the We pre-train 124 million parameter models using context computational complexity of retrieval as proposed in [2]. windows of 256, 512 tokens. To observe the effect of In particular, ℬ𝑠 ⊆ 𝒟𝑠 contains 𝑘 documents uni- different data compositions, we fix the vocabulary and formly sampled from 𝒟𝑠 . To control the number of model parameters described in Appendix A. utilised documents, we operate via 𝛼 that regulates Corpora & Settings We combine three high-quality the fractions of 𝑘. Hence we use ℬ𝛼 ⊆ ℬ𝑠 where open-source corpora1 best exemplified from C4, Cul- |ℬ𝛼 | = ⌊𝛼 × |ℬ𝑠 |⌋. turaX, and Wikipedia. We construct the corpus 𝒟 by This approach strategically serves as the retrieval operating through the methods proposed in §2 both on source for constructing pre-training chunks: 𝒟𝐸𝑛 and 𝒟𝐼𝑡 and then we combine them. Moreover, to observe the impact of the quantity of pre-training in- 𝑑1 ∼ Uniform(ℬ𝑠 ), 𝑑𝑖+1 = Retrieve(𝑑𝑖 , ℬ𝛼 ). stances, we use a scaling factor 𝛼 that operates during After retrieving a sequence of documents {𝑑𝑖 }𝑛 𝑖=1 from the construction of 𝒟𝐸𝑛 and 𝒟𝐼𝑡 . the ℬ𝛼 for constructing a chunk, the buffer is refilled by sampling novel documents from 𝒟𝑠 . 4. Experiments 2.2. Masking Approaches To analyse the operation of proposed approaches, we The masking strategy is the other critical stage of lan- evaluate the model perplexities (§4.1), in-context learn- guage model pre-training, which defines how next-token ing (§4.2), understanding (§4.3) and question-answering prediction distributions are conditioned on further to- capabilities (§4.4) under different configurations. 1 kens in a provided sequence. The statistics are reported in Table 4 4.1. Perplexity and UniChunk lowers directly as the amount of pre- training data used boosts. While intra-document causal We compute the perplexity (PPL) on two different se- masking performs similarly to Bm25Chunk in resource- tups: (i) models pre-trained with an equal quantity of based settings (red line and green line Figure 2), improv- data and then evaluated on a held-out set of documents ing the intra-document causal masking alpha reduces where each document is independently treated, (ii) mod- the PPL less consistently. Finally, it can be observed els pre-trained with an equal quantity of data scaled by that Bm25Chunk reaches stable performance even with an 𝛼 factor, which is 𝛼 in {0.1, 0.25, 0.5, 0.75} and then 𝛼 = 0.75. evaluated on a held-out set of documents where each document is independently treated. While the first con- figuration allows one to observe whether the proposed methods induce overfitting (data-contamination [6]), the second experiment analyses the impact of the amount of data used. The impact of Sequence Composition Table 1 shows that Bm25Chunk achieves the lowest PPL among the three causal masking models, yielding a lower aver- age PPL compared to RandomChunk (in both settings more than about +5) and UniChunk (in both settings around +3.2). Increasing the correlation of documents in a se- quence empowers the language modelling ability of the pre-trained models. Instead, when considering models trained via intra-document causal masking, it emerges that IntraDoc achieves the lowest PPL compared to the models trained via causal masking. Figure 2: Average Perplexities decreasing training set. 𝐿 Model C4 CulturaX Wiki Avg. RandomChunk 20.12 19.61 9.89 16.5 UniChunk 18.83 15.65 8.56 14.3 256 4.2. In-Context Learning Bm25Chunk 14.96 15.07 5.23 11.4 IntraDoc 14.04 13.57 5.08 10.7 Following Zhao et al. [2], we evaluate the in-context RandomChunk 19.32 18.76 9.55 15.9 learning abilities of the models using GLUE-X [7] (SST2, UniChunk 18.22 15.11 7.89 13.4 512 CoLA and RTE) both in English and Italian. Bm25Chunk 13.85 13.27 5.02 10.7 IntraDoc 12.98 13.07 4.39 10.0 Table 2 reports the average in-context learning accu- racy values of the models in few-shots settings, using Table 1 15 for 256 and 20 demonstrations for the 512 model, re- Evaluation of perplexity on test set created by sampling the spectively. Bm25Chunk yields a higher average accuracy original pre-training corpora (Appendix D). 𝐿 is the context than RandomChunk for 256 (+5.12%) and 512 (+1.55%). window for pre-training (next-token accuracy in Appendix B). These demonstrate that increasing the correlation of the documents in pre-training chunks improves the models’ Generally, all methods obtain significantly lower PPLs in-context learning abilities. (particularly Bm25Chunk than IntraDoc) in Wikipedia. Figure 3, we report the average accuracy using dif- This phenomenon could imply that the pre-training ferent numbers of few-shot demonstrations. Bm25Chunk sources are very common (lower PPL is better-known has an on-par accuracy with IntraDoc on the 256 set- text), these texts is more influenced by documents with ting; however, IntraDoc obtains a significantly higher different contexts (misleading contexts) and the proposed accuracy than Bm25Chunk on the 512 setting. Finally, strategies can improve this problem. RandomChunk and UniChunk obtain comparable results using different context lengths, and they do not consis- The role of Quantity Figure 2 shows that Bm25Chunk tently improve accuracy when increasing the number of consistently achieves a lower average PPL than the other demonstrations. This might be due to the tighter levels of approaches even when decreasing the amount of pre- distraction in both settings, which use arbitrary packing training data. In fact, in both settings (Figure 2), it strategies. can be observed that the average PPL of RandomChunk 𝐿 Model SST2 CoLA RTE Avg. formances, IntraDoc obtains the best average perfor- RandomChunk 50.53 60.62 24.76 45.33 mance. It indicates that eliminating potential distractions UniChunk 56.13 62.68 18.73 45.72 from unrelated documents and learning each document 256 Bm25Chunk 62.12 64.06 25.16 50.45 separately empowers understanding and generation abil- IntraDoc 53.22 61.16 24.23 46.20 ities. This finding is different from the ideas in previous RandomChunk 55.13 62.85 36.38 51.38 works, which suggested that pre-training with multi- UniChunk 58.53 63.04 22.12 47.85 ple documents in one context and adding distraction in 512 Bm25Chunk 60.30 63.21 35.26 52.93 context during pre-training benefit in-context and under- IntraDoc 59.32 65.62 36.65 53.81 standing ability. Table 2 Average In-context learning performance evaluated by text classification accuracy across three tasks. Accuracies for En- glish and Italian are reported in Appendix E. Figure 4: Evaluation results of MultiLingual Question Answer- ing by providing cross-lingual input (en-it means context in English and question in Italian and vice versa as described in Appendix C). Figure 3: Average in-context learning accuracy using different numbers of input demonstrations. 4.4. Multilinguality To assess code-switching abilities, we experimented with 𝐿 Model MLQA XCOPA SQuAD Avg. cross-lingual input by operating with MLQA. We crossed the languages, delivering contexts in English and ques- RandomChunk 21.48 30.21 28.04 26.5 UniChunk 23.97 32.19 27.16 27.7 tions in Italian and vice versa (Appendix C). Figure 4 256 Bm25Chunk 28.18 33.97 27.26 29.8 show that Bm25Chunk outperforms both RandomChunk IntraDoc 33.63 38.05 30.51 34.0 and intra-document causal masking. At the same time, RandomChunk 26.05 31.93 31.39 29.7 IntraDoc, as discussed in §4.3 for MLQA, outperforms UniChunk 27.14 33.34 31.22 30.5 Bm25Chunk. This result confirms that IntraDoc’s per- 512 Bm25Chunk 30.71 35.82 34.85 33.7 formance is not only related to monolingual learning IntraDoc 32.42 37.71 36.04 35.2 sequences but also more complex dynamics. Table 3 Evaluation results of natural language understanding, com- monsense reasoning and QA tasks. 5. Conclusion The role of pre-training sampling is a strategic com- ponent. We analyse the impact of sequencing by pre- 4.3. Understanding & Commonsense training several language models on multilingual corpora. We evaluate the pre-trained models on natural lan- We showed that causal masking involves misleading doc- guage understanding, commonsense reasoning tasks (i.e., uments that confound the pre-training of language mod- XSQuAD [8], XCOPA [9]), and question-answering (i.e., els and impact the performance in downstream tasks. MLQA [10]). It emerges that Bm25Chunk outperforms Hence, we find that improving sequence correlation in RandomChunk and UniChunk in all tasks, confirming that pre-training chunks reduces potential distractions while increasing the similarity of documents in pre-training improving the performance of language models without chunks improve understanding abilities. Specifically, reducing pre-training efficiency. In the future, we will Bm25Chunk obtains a significantly better accuracy on study whether these findings archive benefits in fine- MLQA, showing it can operate in-context information tuning pipelines [11, 12, 13, 14, 15, 16] as well. provided in the input question. However, even though Bm25Chunk archives solid per- References tive, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the Association for Com- [1] L. Ranaldi, G. Pucci, F. M. Zanzotto, Modeling eas- putational Linguistics: ACL 2023, Association iness for training transformers with curriculum for Computational Linguistics, Toronto, Canada, learning, in: R. Mitkov, G. Angelova (Eds.), Pro- 2023, pp. 12731–12750. URL: https://aclanthology. ceedings of the 14th International Conference on org/2023.findings-acl.806. doi:10.18653/v1/2023. Recent Advances in Natural Language Processing, findings-acl.806. INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria, [8] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 2023, pp. 937–948. URL: https://aclanthology.org/ 100, 000+ questions for machine comprehension 2023.ranlp-1.101. of text, in: J. Su, X. Carreras, K. Duh (Eds.), [2] Y. Zhao, Y. Qu, K. Staniszewski, S. Tworkowski, Proceedings of the 2016 Conference on Empirical W. Liu, P. Miłoś, Y. Wu, P. Minervini, Analysing the Methods in Natural Language Processing, EMNLP impact of sequence composition on language model 2016, Austin, Texas, USA, November 1-4, 2016, The pre-training, 2024. URL: https://arxiv.org/abs/2402. Association for Computational Linguistics, 2016, 13991. arXiv:2402.13991. pp. 2383–2392. URL: https://doi.org/10.18653/v1/ [3] W. Shi, S. Min, M. Lomeli, C. Zhou, M. Li, V. Lin, d16-1264. doi:10.18653/V1/D16-1264. N. A. Smith, L. Zettlemoyer, S. Yih, M. Lewis, [9] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, In-context pretraining: Language modeling beyond I. Vulić, A. Korhonen, XCOPA: A multilingual document boundaries, ArXiv abs/2310.10638 (2023). dataset for causal commonsense reasoning, in: URL: https://api.semanticscholar.org/CorpusID: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceed- 264172290. ings of the 2020 Conference on Empirical Meth- [4] S. Robertson, H. Zaragoza, The probabilis- ods in Natural Language Processing (EMNLP), As- tic relevance framework: Bm25 and beyond, sociation for Computational Linguistics, Online, Found. Trends Inf. Retr. 3 (2009) 333–389. URL: 2020, pp. 2362–2376. URL: https://aclanthology. https://doi.org/10.1561/1500000019. doi:10.1561/ org/2020.emnlp-main.185. doi:10.18653/v1/2020. 1500000019. emnlp-main.185. [5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. [10] P. Lewis, B. Oguz, R. Rinott, S. Riedel, H. Schwenk, Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, MLQA: Evaluating cross-lingual extractive ques- G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, tion answering, in: D. Jurafsky, J. Chai, N. Schluter, G. Krueger, T. Henighan, R. Child, A. Ramesh, J. Tetreault (Eds.), Proceedings of the 58th Annual D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, Meeting of the Association for Computational Lin- E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, guistics, Association for Computational Linguis- C. Berner, S. McCandlish, A. Radford, I. Sutskever, tics, Online, 2020, pp. 7315–7330. URL: https:// D. Amodei, Language models are few-shot learners, aclanthology.org/2020.acl-main.653. doi:10.18653/ in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, v1/2020.acl-main.653. H. Lin (Eds.), Advances in Neural Information [11] L. Ranaldi, G. Pucci, Knowing knowledge: Epis- Processing Systems, volume 33, Curran Associates, temological study of knowledge in transform- Inc., 2020, pp. 1877–1901. URL: https://proceedings. ers, Applied Sciences 13 (2023). URL: https:// neurips.cc/paper_files/paper/2020/file/ www.mdpi.com/2076-3417/13/2/677. doi:10.3390/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. app13020677. [6] F. Ranaldi, E. S. Ruzzetti, D. Onorati, L. Ranaldi, [12] L. Ranaldi, G. Pucci, Does the English matter? elicit C. Giannone, A. Favalli, R. Romagnoli, F. M. Zan- cross-lingual abilities of large language models, in: zotto, Investigating the impact of data contamina- D. Ataman (Ed.), Proceedings of the 3rd Workshop tion of large language models in text-to-SQL trans- on Multi-lingual Representation Learning (MRL), lation, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Association for Computational Linguistics, Singa- Findings of the Association for Computational Lin- pore, 2023, pp. 173–183. URL: https://aclanthology. guistics ACL 2024, Association for Computational org/2023.mrl-1.14. doi:10.18653/v1/2023.mrl-1. Linguistics, Bangkok, Thailand and virtual meeting, 14. 2024, pp. 13909–13920. URL: https://aclanthology. [13] L. Ranaldi, G. Pucci, F. Ranaldi, E. S. Ruzzetti, org/2024.findings-acl.827. doi:10.18653/v1/2024. F. M. Zanzotto, A tree-of-thoughts to broaden findings-acl.827. multi-step reasoning across languages, in: K. Duh, [7] L. Yang, S. Zhang, L. Qin, Y. Li, Y. Wang, H. Liu, H. Gomez, S. Bethard (Eds.), Findings of the Associ- J. Wang, X. Xie, Y. Zhang, GLUE-X: Eval- ation for Computational Linguistics: NAACL 2024, uating natural language understanding models Association for Computational Linguistics, Mex- from an out-of-distribution generalization perspec- ico City, Mexico, 2024, pp. 1229–1241. URL: https: //aclanthology.org/2024.findings-naacl.78. doi:10. 18653/v1/2024.findings-naacl.78. [14] L. Ranaldi, G. Pucci, A. Freitas, Does the lan- guage matter? curriculum learning over neo-Latin languages, in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, Torino, Italia, 2024, pp. 5212–5220. URL: https: //aclanthology.org/2024.lrec-main.464. [15] L. Ranaldi, A. Freitas, Aligning large and small language models via chain-of-thought reasoning, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp. 1812–1827. URL: https://aclanthology.org/2024.eacl-long.109. [16] L. Ranaldi, A. Freitas, Self-refine instruction-tuning for aligning reasoning in language models, in: Y. Al- Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Com- putational Linguistics, Miami, Florida, USA, 2024, pp. 2325–2347. URL: https://aclanthology.org/2024. emnlp-main.139. A. Pre-training Corpora In our experiments, we use the GPT-2 small, the 124 million model with 12 layers, a hidden size of 768, and 12 attention heads. We use a batch size of 0.5 million tokens for both the models with 256 and 512 context window sizes and pre-train models using 20B tokens with 100,000 steps. We use Adam optimiser with 𝛽1 = 0.90, 𝛽2 = 0.95, a weight decay of 0.1, and a cosine learning rate scheduler. The peak learning rate is 3 × 10−4 , decreasing to 3 × 10−5 at the end. We perform the experiments using 16 Nvidia RTX A6000 with 48GB of VRAM. Subset # documents # words C4 (it) ∼ 8𝑀 ∼ 4𝐵 CulturaX (it) ∼ 2.5𝑀 ∼ 2.6𝑀 Wikipedia (it) ∼ 1.5𝑀 ∼ 780𝑀 C4 (it) ∼ 8𝑀 ∼ 3.4𝐵 CulturaX (it) ∼ 2.5𝑀 ∼ 2.1𝑀 Wikipedia (it) ∼ 1.5𝑀 ∼ 760𝑀 Table 4 Size of pre-training corpora. For computational reasons, we produced equivalent samples for both English and Italian. B. Next Token Accuracy of Pre-Trained Language Models In addition to PPL, we report the next token accuracy of pre-trained language models in Table 5. The "next-token accuracy" is calculated as follows: Specifically we define Acc as: 𝑁 1 ∑︁ Acc = I(𝑦^𝑖 = 𝑦𝑖 ) (5) 𝑁 𝑖=1 where: • 𝑁 is the total number of tokens in the test set. • 𝑦^𝑖 is the token predicted by the model at position 𝑖. • 𝑦𝑖 is the correct (ground truth) token at position 𝑖. • I is the indicator function, which is 1 if 𝑦^𝑖 = 𝑦𝑖 and 0 otherwise. 𝐿 Model C4 CulturaX Wikipedia Avg. RandomChunk 0.242 0.431 0.336 0.336 UniChunk 0.248 0.463 0.415 0.375 256 Bm25Chunk 0.332 0.451 0.424 0.402 IntraDoc 0.357 0.472 0.442 0.423 RandomChunk 0.346 0.456 0.368 0.393 UniChunk 0.389 0.462 0.405 0.419 512 Bm25Chunk 0.419 0.493 0.423 0.445 IntraDoc 0.440 0.498 0.463 0.467 Table 5 Evaluation of next token accuracy on proposed test-set. C. Multilingual Question Answering Examples Lang Context Question Target Answer Barack Obama was the 44th President of the United en Who was the 44th President of the United States? Barack Obama States, serving two terms from 2009 to 2017. Barack Obama è stato il 44º Presidente degli Stati it Chi è stato il 44º Presidente degli Stati Uniti? Barack Obama Uniti, in carica per due mandati dal 2009 al 2017. Barack Obama was the 44th President of the United en-it Chi è stato il 44º Presidente degli Stati Uniti? Barack Obama States, serving two terms from 2009 to 2017. Barack Obama è stato il 44º Presidente degli Stati it-en Who was the 44th President of the United States? Barack Obama Uniti, in carica per due mandati dal 2009 al 2017. Table 6 Examples from the MLQA dataset in English, Italian and Cross-lingual. D. In-context Learning E. Understanding and performances English and Commonsense performances Italian English and Italian This section reports the results obtained on the tasks introduced in Section 4.2. To conduct a more detailed analysis, This section reports the results obtained on the tasks we have used the original (English) and Italian versions of introduced in Section 4.3. We have used the original (English) three tasks belonging to the GLUE family. We selected SST2, and Italian versions of MLQA, XCOPA, and SQuAD to CoLA, and RTE. The bilingual versions were taken from the conduct a more detailed analysis. contribution previously proposed by Yang et al. [7]. 𝐿 Model MLQA XCOPA SQuAD Avg. Model SST2-En CoLA-En RTE-En Avg. RandomChunk 22.63 30.71 30.52 30.22 51.34 61.73 25.71 46.26 UniChunk 24.09 23.15 27.34 24.83 RandomChunk 256 UniChunk 57.16 63.21 19.17 43.15 Bm25Chunk 29.16 34.19 27.16 30.11 256 IntraDoc 34.06 38.21 30.85 34.3 Bm25Chunk 61.9 65.02 26.31 50.42 IntraDoc 53.39 61.67 25.27 46.76 RandomChunk 26.63 32.16 31.82 30.32 RandomChunk 55.49 63.42 38.19 52.46 UniChunk 27.05 33.26 31.54 30.65 512 UniChunk 59.16 63.12 21.87 48.02 Bm25Chunk 30.66 36.51 34.73 34.08 512 Bm25Chunk 60.81 64.69 36.23 53.93 IntraDoc 32.88 38.15 38.23 36.23 IntraDoc 59.21 66.25 36.19 53.73 Table 9 Table 7 Evaluation results of natural language understanding, com- In-context learning performance evaluated by text classifica- monsense reasoning and QA tasks in English. tion accuracy across three English tasks. 𝐿 Model MLQA XCOPA SQuAD Avg. Model SST2-It CoLA-It RTE-It Avg. RandomChunk 20.33 29.62 30.18 29.31 RandomChunk 49.41 59.62 23.51 44.17 UniChunk 23.85 23.42 26.73 25.06 55.13 62.92 18.32 46.76 256 256 UniChunk Bm25Chunk 27.21 33.16 27.32 29.05 Bm25Chunk 61.24 63.07 23.92 49.40 IntraDoc 33.26 37.88 30.18 33.65 IntraDoc 52.93 60.81 23.92 46.08 RandomChunk 25.88 31.78 30.97 x.x RandomChunk 54.71 62.63 34.36 50.64 UniChunk 27.23 33.42 30.94 30.32 57.92 62.94 22.46 47.82 512 512 UniChunk Bm25Chunk 30.77 35.92 34.66 33.42 Bm25Chunk 59.83 63.38 34.25 52.36 IntraDoc 31.97 37.28 38.46 35.64 IntraDoc 59.06 65.23 35.16 52.55 Table 10 Table 8 Evaluation results of natural language understanding, com- In-context learning performance evaluated by text classifica- monsense reasoning and QA tasks in Italian. tion accuracy across three Italian tasks.