<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">How far does the sequence of compositions impact Multilingual Pre-Training?</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Leonardo</forename><surname>Ranaldi</surname></persName>
							<email>lranaldi@ed.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">School of Informatics</orgName>
								<orgName type="institution">University of Edinburgh</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giulia</forename><surname>Pucci</surname></persName>
							<email>g.pucci.24@abdn.uk</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Computing Science</orgName>
								<orgName type="institution">University of Aberdeen</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fabio</forename><forename type="middle">Massimo</forename><surname>Zanzotto</surname></persName>
							<email>fabio.massimo.zanzotto@uniroma2.it</email>
							<affiliation key="aff2">
								<orgName type="institution">Università degli Studi Roma &quot;Tor Vergata&quot;</orgName>
								<address>
									<settlement>Roma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">How far does the sequence of compositions impact Multilingual Pre-Training?</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">D9B800C342C8F23200719B6BAEF933AD</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large Language Models</term>
					<term>Pre-training Methods</term>
					<term>Cross-lingual Generalisation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>An Efficient strategy for conducting pre-training of language models is the concatenation of contiguous sequences of text of fixed length through causal masking that estimates the probability of each token given its context. Yet earlier work suggests that this technique affects the performance of the model as it might include misleading information from previous text sequences during pre-training. To fill this gap, intra-context and rank-based causal masking techniques have been proposed, in which the probability of each token is conditional only on the previous ones in the same document or ranked sequences, avoiding misleading information from different contexts. However, the sequences provided by the use of these techniques have been little explored, overlooking the opportunity to optimise the composition by manipulating the volume and heterogeneity in the sequences and improving unbalance pre-training settings. In this paper, we demonstrate that organising text chunks based on a policy that aligns with text similarity effectively improve pre-training, enhances the learning and cross-lingual generalisation capabilities of language models, maintains efficiency, and allows for fewer instances.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Large language models (LLMs) are pre-trained on huge amounts of documents by optimizing a language modelling objective and show an intriguing ability to solve various downstream NLP tasks. Ranaldi et al. <ref type="bibr" target="#b0">[1]</ref> in multilingual settings and later Zhao et al. <ref type="bibr" target="#b1">[2]</ref> highlighted the importance of pre-training data quality, diversity and composition methodologies. Our research takes a step further by exploring the influence of the pre-training sequences heterogeneity for cross-lingual generalisation. This potentially leads to significant advancements in understanding LLMs' learning properties.</p><p>In decoder-only architectures pre-training, the constructions of the instances are based on packing that combines randomly sampled texts (i.e., documents) into a chunk that matches the size of the context window without using any selection policy. Then, the causal masking predicts the next token conditioned on the previous, including those from different documents (portions of non-contiguous texts) in the chunk. The ways to mitigate this arbitrary procedure are: (i) intra-document causal masking <ref type="bibr" target="#b2">[3]</ref>, where the likelihood of each token is conditioned on the previous from the same document <ref type="bibr" target="#b2">[3]</ref> and retrieval-based masking <ref type="bibr" target="#b1">[2]</ref> where similar documents retrieved by retrieval systems condition likelihood.</p><p>To study the role of heterogeneity and volume of samples in sequence composition strategies (i.e., packing and masking pipelines), we pre-train language models using different masking approaches (described in §2.2) and compare them with models pre-trained via the traditional causal masking with different packing approaches by varying amount of the sequence composition of the documents in the pre-training chunks. Whilst for studying the impact on cross-lingual generalisation we use crosslingual settings (i.e., Italian English). Complementing the foundation approaches proposed in <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>,we operate via bilingual corpora. Hence, we analyse the results produced by a commonly used baseline method that randomly samples and packs documents (RandomChunk), a process that samples and packs documents from the same source based on their composition and origin (UniChunk), and then operate via efficient retrieval-based packing method, which retrieves and packs related documents ( §2.1).</p><p>The experimental results indicate that operating via causal masking (RandomChunk) with arbitrary sequence patterns of documents leads to the inclusion of misleading information that stems from different context during pre-training ( §3), impacting in a negatively the performance of the models in downstream tasks ( §4). Instead, intra-document causal masking, which avoids the misleading phenomena during pre-training, significantly improves the models' performance and does not impact the runtime. Although intra-document causal masking performs well, it limits the operability of sequence composition mixing documents from different corpora (in our case in different languages as well). As revealed by Zhao et al. <ref type="bibr" target="#b1">[2]</ref> as well, this is partly solved by UniChunk's avoidance of packing documents from different distributions, which improves the performance of causal masking models in downstream tasks but still does not allow individual sequences to be selected. Hence, we use a retrieval-based packing method, which allows operating directly on sequences by improving cross-lingual models' language modeling, in-context learning and generative capabilities by using causal masking and thus paying a small fee for document sorting but achieving tangible results.</p><p>Our main findings can be summarised as follows: • By analyzing different pre-trained strategies in crosslingual settings we reveal that operating through causal masking and considering the order and patterns sequence represented in documents, leads to significant improvements. In addition, retrieval-based techniques provide resilience and allow for the selection of pre-training sequences by guaranteeing heterogeneity and reducing data ( §3). • We show important benefits on the in-context learning capabilities of downstream models. We observe that in low-resource settings, it is possible to achieve the same performance and in some cases cross-lingual generalisation (in our case, English-Italian) ( §4). • In conclusion, we show that the retrieval-based packing method allowing for a flexible sequence composition process benefits unbalanced cross-lingual learning tangible benefits by using less pre-training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Pre-Training Strategies</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Packing Approaches</head><p>Given 𝒟𝑖 that represents a corpus, and 𝒟 = ⋃︀ 𝑠 𝒟𝑠 denote resulting from the union of such corpora. Specifically, each corpus 𝒟𝑠 is as a set of documents 𝒟𝑠 = {𝑑1, . . . , 𝑑 |𝒟𝑠| }, where each 𝑑𝑖 is defined as a sequence of tokens 𝑑𝑖 = (︀ 𝑥1, . . . , 𝑥 |𝑑 𝑖 | )︀ . The packing strategy involves first selecting a set of documents {𝑑𝑖} 𝑛 𝑖=1 from 𝒟, and then packing them into a chunk 𝐶 with a fixed length |𝐶| = 𝐿. The documents {𝑑𝑖} 𝑛 𝑖=1 are concatenated by interleaving them with endof-sentence ([eos]) tokens. Hence, 𝐶 is denoted as:</p><formula xml:id="formula_0">𝐶 = {𝑑𝑖 ⊕ [eos] | 𝑖 = 1 . . . 𝑛 − 1} ⊕ s(𝑑𝑛), (1)</formula><p>where [eos] is the end-of-sentence token, s() truncates the last document such that |𝐶| = 𝐿, and the content of the chunk 𝐶 is removed from the dataset 𝒟 to avoid sampling the same documents multiple times.</p><p>Following the strategies proposed in <ref type="bibr" target="#b1">[2]</ref>, we use three strategies to sample the documents {𝑑𝑖} 𝑛 𝑖=1 from the dataset 𝒟 for composing pre-training chunk.</p><p>In contrast to the previous works, we use 𝛼 ∈ [0, 1] to control the fraction of the corpus used. Hence, we use 𝒮 ⊆ 𝒟 and |𝒮| = ⌊𝛼 × |𝒟|⌋.</p><p>We define the three strategies (Baseline, Sequencebased and Ranking based) as follow:</p><p>Baseline The common baseline approach called RandomChunk, with documents 𝑑𝑖 ∈ 𝒟 are sampled uni-formly at random from the entire pre-training corpus 𝒟:</p><formula xml:id="formula_1">(𝒟, 𝛼) = {︃ 𝑛 ⨁︁ 𝑖=1 𝑑𝑖 ⊕ [eos] | 𝑑𝑖 ∼ Uniform(𝒮) }︃ (2)</formula><p>where 𝒮 ⊆ 𝒟 and |𝒮| = ⌊𝛼 × |𝒟|⌋. As a result, in RandomChunk, a chunk can contain documents from a different source, as shown in Figure <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sequence-based</head><p>The UniChunk approach is sequencebased and respects the sequences of the corpora. Hence, each chunk is composed of documents from a single source corpus 𝒟𝑠:</p><formula xml:id="formula_2">(𝒟𝑠, 𝛼) = {︃ 𝑛 ⨁︁ 𝑖=1 𝑑𝑖 ⊕ [eos] | 𝑑𝑖 ∼ Uniform(𝒮𝑠) }︃<label>(3)</label></formula><p>where 𝒮𝑠 ⊆ 𝒟𝑠 and |𝒮𝑠| = ⌊𝛼 × |𝒟𝑠|⌋ and 𝒟𝑠 ⊆ 𝒟. This strategy avoids packing documents from different corpora and allows control over the amount of data utilized from each specific corpus, enhancing efficient usage of computational resources while preserving thematic coherence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ranking-based</head><p>To empower the relevance of documents in pre-training chunks, we use a retrieverbased pipeline (BM25-based <ref type="bibr" target="#b3">[4]</ref>) to construct pre-training chunks, which we define Bm25Chunk. Hence, given a document 𝑑𝑖 ∈ 𝒟𝑠, a sequence of documents {𝑑𝑖} 𝑛 𝑖=1 by 𝑑𝑖+1 = Retrieve(𝑑𝑖, 𝒟𝑠) are retrieved; here, Retrieve(𝑑𝑖, 𝒟𝑠) collects the most similar documents to 𝑑𝑖 from 𝒟𝑠 using BM25 ranking.</p><p>However, since the retrieval process can be computationally heavy due to the size of the pre-training corpus 𝒟𝑠. To improve the efficiency of the retrieval step, a subset ℬ𝑠 ⊆ 𝒟𝑠 of the corpus 𝒟𝑠 is used, reducing the computational complexity of retrieval as proposed in <ref type="bibr" target="#b1">[2]</ref>.</p><p>In particular, ℬ𝑠 ⊆ 𝒟𝑠 contains 𝑘 documents uniformly sampled from 𝒟𝑠. To control the number of utilised documents, we operate via 𝛼 that regulates the fractions of 𝑘. Hence we use ℬ𝛼 ⊆ ℬ𝑠 where |ℬ𝛼| = ⌊𝛼 × |ℬ𝑠|⌋.</p><p>This approach strategically serves as the retrieval source for constructing pre-training chunks:</p><p>𝑑1 ∼ Uniform(ℬ𝑠), 𝑑𝑖+1 = Retrieve(𝑑𝑖, ℬ𝛼).</p><p>After retrieving a sequence of documents {𝑑𝑖} 𝑛 𝑖=1 from the ℬ𝛼 for constructing a chunk, the buffer is refilled by sampling novel documents from 𝒟𝑠.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Masking Approaches</head><p>The masking strategy is the other critical stage of language model pre-training, which defines how next-token prediction distributions are conditioned on further tokens in a provided sequence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Causal Masking</head><p>In causal masking, each token in a sequence is predicted based on all previous tokens. Specifically, given a chunk 𝐶 = (𝑥1, . . . , 𝑥 |𝐶| ), the likelihood of 𝐶 is given by:</p><formula xml:id="formula_3">𝑃 (𝐶) = |𝐶| ∏︁ 𝑖=1 𝑃 (𝑥𝑖 | 𝑥1, . . . , 𝑥𝑖−1),</formula><p>where 𝑃 (𝑥𝑖 | 𝑥1, . . . , 𝑥𝑖−1) is the probability of the token 𝑥𝑖 given previous tokens 𝑥1, . . . , 𝑥𝑖−1 in the chunk. During the pre-training, causal masking indicates that, given a chunk 𝐶, the likelihood of each token in 𝐶 is conditioned on all previous tokens, including those that stem from different documents.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Intra-Document Causal Masking</head><p>In intra-document causal masking, the probability of each token is influenced by the previous tokens within the same document and, consequently, the same context. Hence, using a fraction 𝒮 ⊆ 𝒟 where |𝒮| = ⌊𝛼 × |𝒟|⌋ we construct the chunks 𝐶 asdefined as in §1. The probability of each token 𝑑𝑖𝑗 belonging to document 𝑑𝑖 is only conditioned on the previous tokens within 𝑑𝑖:</p><formula xml:id="formula_4">𝑃 (𝐶) = 𝑛 ∏︁ 𝑖=1 |𝑑 𝑖 | ∏︁ 𝑗 𝑃 (︀ 𝑑𝑖𝑗 | 𝑑𝑖1, . . . , 𝑑 𝑖(𝑗−1) )︀ ,<label>(4)</label></formula><p>where each 𝑑𝑖 is sampled from 𝐶 as defined above. The models trained using this approach are called IntraDoc in the rest of the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Language Modeling Settings</head><p>Models The implementation is based on the GPT-2 <ref type="bibr" target="#b4">[5]</ref>. We pre-train 124 million parameter models using context windows of 256, 512 tokens. To observe the effect of different data compositions, we fix the vocabulary and model parameters described in Appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corpora &amp; Settings</head><p>We combine three high-quality open-source corpora 1 best exemplified from C4, Cul-turaX, and Wikipedia. We construct the corpus 𝒟 by operating through the methods proposed in §2 both on 𝒟𝐸𝑛 and 𝒟𝐼𝑡 and then we combine them. Moreover, to observe the impact of the quantity of pre-training instances, we use a scaling factor 𝛼 that operates during the construction of 𝒟𝐸𝑛 and 𝒟𝐼𝑡.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>To analyse the operation of proposed approaches, we evaluate the model perplexities ( §4.1), in-context learning ( §4.2), understanding ( §4.3) and question-answering capabilities ( §4.4) under different configurations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Perplexity</head><p>We compute the perplexity (PPL) on two different setups: (i) models pre-trained with an equal quantity of data and then evaluated on a held-out set of documents where each document is independently treated, (ii) models pre-trained with an equal quantity of data scaled by an 𝛼 factor, which is 𝛼 in {0.1, 0.25, 0.5, 0.75} and then evaluated on a held-out set of documents where each document is independently treated. While the first configuration allows one to observe whether the proposed methods induce overfitting (data-contamination <ref type="bibr" target="#b5">[6]</ref>), the second experiment analyses the impact of the amount of data used.</p><p>The impact of Sequence Composition Table <ref type="table" target="#tab_1">1</ref> shows that Bm25Chunk achieves the lowest PPL among the three causal masking models, yielding a lower average PPL compared to RandomChunk (in both settings more than about +5) and UniChunk (in both settings around +3.2). Increasing the correlation of documents in a sequence empowers the language modelling ability of the pre-trained models. Instead, when considering models trained via intra-document causal masking, it emerges that IntraDoc achieves the lowest PPL compared to the models trained via causal masking. Generally, all methods obtain significantly lower PPLs (particularly Bm25Chunk than IntraDoc) in Wikipedia. This phenomenon could imply that the pre-training sources are very common (lower PPL is better-known text), these texts is more influenced by documents with different contexts (misleading contexts) and the proposed strategies can improve this problem.</p><p>The role of Quantity Figure <ref type="figure" target="#fig_0">2</ref> shows that Bm25Chunk consistently achieves a lower average PPL than the other approaches even when decreasing the amount of pretraining data. In fact, in both settings (Figure <ref type="figure" target="#fig_0">2</ref>), it can be observed that the average PPL of RandomChunk and UniChunk lowers directly as the amount of pretraining data used boosts. While intra-document causal masking performs similarly to Bm25Chunk in resourcebased settings (red line and green line Figure <ref type="figure" target="#fig_0">2</ref>), improving the intra-document causal masking alpha reduces the PPL less consistently. Finally, it can be observed that Bm25Chunk reaches stable performance even with 𝛼 = 0.75. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">In-Context Learning</head><p>Following Zhao et al. <ref type="bibr" target="#b1">[2]</ref>, we evaluate the in-context learning abilities of the models using GLUE-X <ref type="bibr" target="#b6">[7]</ref> (SST2, CoLA and RTE) both in English and Italian.</p><p>Table <ref type="table" target="#tab_3">2</ref> reports the average in-context learning accuracy values of the models in few-shots settings, using 15 for 256 and 20 demonstrations for the 512 model, respectively. Bm25Chunk yields a higher average accuracy than RandomChunk for 256 (+5.12%) and 512 (+1.55%). These demonstrate that increasing the correlation of the documents in pre-training chunks improves the models' in-context learning abilities.</p><p>Figure <ref type="figure" target="#fig_1">3</ref>, we report the average accuracy using different numbers of few-shot demonstrations. Bm25Chunk has an on-par accuracy with IntraDoc on the 256 setting; however, IntraDoc obtains a significantly higher accuracy than Bm25Chunk on the 512 setting. Finally, RandomChunk and UniChunk obtain comparable results using different context lengths, and they do not consistently improve accuracy when increasing the number of demonstrations. This might be due to the tighter levels of distraction in both settings, which use arbitrary packing strategies.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Evaluation results of natural language understanding, commonsense reasoning and QA tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Understanding &amp; Commonsense</head><p>We evaluate the pre-trained models on natural language understanding, commonsense reasoning tasks (i.e., XSQuAD <ref type="bibr" target="#b7">[8]</ref>, XCOPA <ref type="bibr" target="#b8">[9]</ref>), and question-answering (i.e., MLQA <ref type="bibr" target="#b9">[10]</ref>).  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Multilinguality</head><p>To assess code-switching abilities, we experimented with cross-lingual input by operating with MLQA. We crossed the languages, delivering contexts in English and questions in Italian and vice versa (Appendix C). Figure <ref type="figure" target="#fig_2">4</ref> show that Bm25Chunk outperforms both RandomChunk and intra-document causal masking. At the same time, IntraDoc, as discussed in §4.3 for MLQA, outperforms Bm25Chunk. This result confirms that IntraDoc's performance is not only related to monolingual learning sequences but also more complex dynamics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>The role of pre-training sampling is a strategic component. We analyse the impact of sequencing by pretraining several language models on multilingual corpora. We showed that causal masking involves misleading documents that confound the pre-training of language models and impact the performance in downstream tasks. Hence, we find that improving sequence correlation in pre-training chunks reduces potential distractions while improving the performance of language models without reducing pre-training efficiency. In the future, we will study whether these findings archive benefits in finetuning pipelines <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref> as well.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Average Perplexities decreasing training set.</figDesc><graphic coords="4,333.12,268.73,142.35,85.41" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Average in-context learning accuracy using different numbers of input demonstrations.</figDesc><graphic coords="5,89.29,242.52,203.36,101.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Evaluation results of MultiLingual Question Answering by providing cross-lingual input (en-it means context in English and question in Italian and vice versa as described in Appendix C).</figDesc><graphic coords="5,317.87,194.10,172.85,86.43" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Evaluation of perplexity on test set created by sampling the original pre-training corpora (Appendix D).</figDesc><table><row><cell>𝐿</cell><cell>Model</cell><cell>C4</cell><cell cols="2">CulturaX Wiki</cell><cell>Avg.</cell></row><row><cell></cell><cell>RandomChunk</cell><cell>20.12</cell><cell>19.61</cell><cell>9.89</cell><cell>16.5</cell></row><row><cell>256</cell><cell>UniChunk Bm25Chunk</cell><cell>18.83 14.96</cell><cell>15.65 15.07</cell><cell>8.56 5.23</cell><cell>14.3 11.4</cell></row><row><cell></cell><cell>IntraDoc</cell><cell>14.04</cell><cell>13.57</cell><cell>5.08</cell><cell>10.7</cell></row><row><cell></cell><cell>RandomChunk</cell><cell>19.32</cell><cell>18.76</cell><cell>9.55</cell><cell>15.9</cell></row><row><cell>512</cell><cell>UniChunk Bm25Chunk</cell><cell>18.22 13.85</cell><cell>15.11 13.27</cell><cell>7.89 5.02</cell><cell>13.4 10.7</cell></row><row><cell></cell><cell>IntraDoc</cell><cell>12.98</cell><cell>13.07</cell><cell>4.39</cell><cell>10.0</cell></row></table><note>𝐿 is the context window for pre-training (next-token accuracy in Appendix B).</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>Average In-context learning performance evaluated by text classification accuracy across three tasks. Accuracies for English and Italian are reported in Appendix E.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head></head><label></label><figDesc>IntraDoc obtains the best average performance. It indicates that eliminating potential distractions from unrelated documents and learning each document separately empowers understanding and generation abilities. This finding is different from the ideas in previous works, which suggested that pre-training with multiple documents in one context and adding distraction in context during pre-training benefit in-context and understanding ability.</figDesc><table><row><cell>formances,</cell></row><row><cell>It emerges that Bm25Chunk outperforms</cell></row><row><cell>RandomChunk and UniChunk in all tasks, confirming that</cell></row><row><cell>increasing the similarity of documents in pre-training</cell></row><row><cell>chunks improve understanding abilities. Specifically,</cell></row><row><cell>Bm25Chunk obtains a significantly better accuracy on</cell></row><row><cell>MLQA, showing it can operate in-context information</cell></row><row><cell>provided in the input question.</cell></row><row><cell>However, even though Bm25Chunk archives solid per-</cell></row></table></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Pre-training Corpora</head><p>In our experiments, we use the GPT-2 small, the 124 million model with 12 layers, a hidden size of 768, and 12 attention heads. We use a batch size of 0.5 million tokens for both the models with 256 and 512 context window sizes and pre-train models using 20B tokens with 100,000 steps. We use Adam optimiser with 𝛽 1 = 0.90, 𝛽 2 = 0.95, a weight decay of 0.1, and a cosine learning rate scheduler. The peak learning rate is 3 × 10 −4 , decreasing to 3 × 10 −5 at the end. We perform the experiments using 16 Nvidia RTX A6000 with 48GB of VRAM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Subset</head><p># documents # words</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Size of pre-training corpora. For computational reasons, we produced equivalent samples for both English and Italian.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Next Token Accuracy of Pre-Trained Language Models</head><p>In addition to PPL, we report the next token accuracy of pre-trained language models in Table <ref type="table">5</ref>.</p><p>The "next-token accuracy" is calculated as follows: Specifically we define Acc as:</p><p>where:</p><p>• 𝑁 is the total number of tokens in the test set.</p><p>• 𝑦 ^𝑖 is the token predicted by the model at position 𝑖.</p><p>• 𝑦 𝑖 is the correct (ground truth) token at position 𝑖.</p><p>• I is the indicator function, which is 1 if 𝑦 ^𝑖 = 𝑦 𝑖 and 0 otherwise. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝐿</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. In-context Learning performances English and Italian</head><p>This section reports the results obtained on the tasks introduced in Section 4.2. To conduct a more detailed analysis, we have used the original (English) and Italian versions of three tasks belonging to the GLUE family. We selected SST2, CoLA, and RTE. The bilingual versions were taken from the contribution previously proposed by Yang et al. <ref type="bibr" target="#b6">[7]</ref>.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 8</head><p>In-context learning performance evaluated by text classification accuracy across three Italian tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Understanding and Commonsense performances English and Italian</head><p>This section reports the results obtained on the tasks introduced in Section 4.3. We have used the original (English) and Italian versions of MLQA, XCOPA, and SQuAD to conduct a more detailed analysis.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 10</head><p>Evaluation results of natural language understanding, commonsense reasoning and QA tasks in Italian.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Modeling easiness for training transformers with curriculum learning</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Zanzotto</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2023.ranlp-1.101" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing</title>
				<editor>
			<persName><forename type="first">R</forename><surname>Mitkov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Angelova</surname></persName>
		</editor>
		<meeting>the 14th International Conference on Recent Advances in Natural Language Processing<address><addrLine>Shoumen, Bulgaria, Varna, Bulgaria</addrLine></address></meeting>
		<imprint>
			<publisher>INCOMA Ltd</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="937" to="948" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Analysing the impact of sequence composition on language model pre-training</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Staniszewski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tworkowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Miłoś</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Minervini</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2402.13991.arXiv:2402.13991" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lomeli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<idno>ArXiv abs/2310.10638</idno>
		<ptr target="https://api.semanticscholar.org/CorpusID:264172290" />
		<title level="m">In-context pretraining: Language modeling beyond document boundaries</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The probabilistic relevance framework: Bm25 and beyond, Found</title>
		<author>
			<persName><forename type="first">S</forename><surname>Robertson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zaragoza</surname></persName>
		</author>
		<idno type="DOI">10.1561/1500000019</idno>
		<ptr target="https://doi.org/10.1561/1500000019.doi:10.1561/1500000019" />
	</analytic>
	<monogr>
		<title level="j">Trends Inf. Retr</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="333" to="389" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Herbert-Voss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Henighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ziegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Winter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sigler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Litwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Berner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccandlish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Larochelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Ranzato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Hadsell</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Balcan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Lin</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Investigating the impact of data contamination of large language models in text-to-SQL translation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Ruzzetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Onorati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Giannone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Favalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Romagnoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Zanzotto</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2024.findings-acl.827</idno>
		<ptr target="https://aclanthology.org/2024.findings-acl.827.doi:10.18653/v1/2024.findings-acl.827" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics ACL 2024, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">L.-W</forename><surname>Ku</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Martins</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Srikumar</surname></persName>
		</editor>
		<meeting><address><addrLine>Bangkok, Thailand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="13909" to="13920" />
		</imprint>
	</monogr>
	<note>and virtual meeting</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">GLUE-X: Evaluating natural language understanding models from an out-of-distribution generalization perspective</title>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.findings-acl.806</idno>
		<ptr target="https://aclanthology.org/2023.findings-acl.806.doi:10.18653/v1/2023.findings-acl.806" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Rogers</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Boyd-Graber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Okazaki</surname></persName>
		</editor>
		<meeting><address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="12731" to="12750" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Squad: 100, 000+ questions for machine comprehension of text</title>
		<author>
			<persName><forename type="first">P</forename><surname>Rajpurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lopyrev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/D16-1264</idno>
		<ptr target="https://doi.org/10.18653/v1/d16-1264.doi:10.18653/V1/D16-1264" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Su</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Carreras</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Duh</surname></persName>
		</editor>
		<meeting>the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016<address><addrLine>Austin, Texas, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">November 1-4, 2016. 2016</date>
			<biblScope unit="page" from="2383" to="2392" />
		</imprint>
	</monogr>
	<note>The Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">XCOPA: A multilingual dataset for causal commonsense reasoning</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Ponti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Glavaš</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Majewska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Vulić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Korhonen</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.emnlp-main.185</idno>
		<ptr target="https://aclanthology.org/2020.emnlp-main.185.doi:10.18653/v1/2020.emnlp-main.185" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Webber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Cohn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</editor>
		<meeting>the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2362" to="2376" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">MLQA: Evaluating cross-lingual extractive question answering</title>
		<author>
			<persName><forename type="first">P</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Oguz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rinott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Riedel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schwenk</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-main.653</idno>
		<ptr target="https://aclanthology.org/2020.acl-main.653.doi:10.18653/v1/2020.acl-main.653" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Chai</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Schluter</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Tetreault</surname></persName>
		</editor>
		<meeting>the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="7315" to="7330" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Knowing knowledge: Epistemological study of knowledge in transformers</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<idno type="DOI">10.3390/app13020677</idno>
		<ptr target="https://www.mdpi.com/2076-3417/13/2/677.doi:10.3390/app13020677" />
	</analytic>
	<monogr>
		<title level="j">Applied Sciences</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Does the English matter? elicit cross-lingual abilities of large language models</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.mrl-1.14</idno>
		<ptr target="https://aclanthology.org/2023.mrl-1.14.doi:10.18653/v1/2023.mrl-1.14" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Ataman</surname></persName>
		</editor>
		<meeting>the 3rd Workshop on Multi-lingual Representation Learning (MRL), Association for Computational Linguistics<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="173" to="183" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A tree-of-thoughts to broaden multi-step reasoning across languages</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Ruzzetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Zanzotto</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2024.findings-naacl.78</idno>
		<ptr target="https://aclanthology.org/2024.findings-naacl.78.doi:10.18653/v1/2024.findings-naacl.78" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: NAACL 2024, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Duh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Gomez</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bethard</surname></persName>
		</editor>
		<meeting><address><addrLine>Mexico City, Mexico</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="1229" to="1241" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Does the language matter? curriculum learning over neo-Latin languages</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Freitas</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.lrec-main.464" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Calzolari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M.-Y</forename><surname>Kan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Hoste</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Lenci</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Sakti</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Xue</surname></persName>
		</editor>
		<meeting>the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)<address><addrLine>Torino, Italia</addrLine></address></meeting>
		<imprint>
			<publisher>ELRA and ICCL</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="5212" to="5220" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Aligning large and small language models via chain-of-thought reasoning</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Freitas</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.eacl-long.109" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<editor>
			<persName><forename type="first">Y</forename><surname>Graham</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Purver</surname></persName>
		</editor>
		<meeting>the 18th Conference of the European Chapter of the Association for Computational Linguistics<address><addrLine>St. Julian&apos;s, Malta</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1812" to="1827" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Self-refine instruction-tuning for aligning reasoning in language models</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Freitas</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.emnlp-main.139" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</title>
				<editor>
			<persName><forename type="first">Y</forename><surname>Al-Onaizan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Bansal</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y.-N</forename><surname>Chen</surname></persName>
		</editor>
		<meeting>the 2024 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Miami, Florida, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="2325" to="2347" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
