<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How far does the sequence of compositions impact Multilingual Pre-Training?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonardo Ranaldi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Pucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Massimo Zanzotto</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing Science, University of Aberdeen</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Informatics, University of Edinburgh</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli Studi Roma "Tor Vergata"</institution>
          ,
          <addr-line>Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>An Eficient strategy for conducting pre-training of language models is the concatenation of contiguous sequences of text of ifxed length through causal masking that estimates the probability of each token given its context. Yet earlier work suggests that this technique afects the performance of the model as it might include misleading information from previous text sequences during pre-training. To fill this gap, intra-context and rank-based causal masking techniques have been proposed, in which the probability of each token is conditional only on the previous ones in the same document or ranked sequences, avoiding misleading information from diferent contexts. However, the sequences provided by the use of these techniques have been little explored, overlooking the opportunity to optimise the composition by manipulating the volume and heterogeneity in the sequences and improving unbalance pre-training settings. In this paper, we demonstrate that organising text chunks based on a policy that aligns with text similarity efectively improve pre-training, enhances the learning and cross-lingual generalisation capabilities of language models, maintains eficiency, and allows for fewer instances.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Pre-training Methods</kwd>
        <kwd>Cross-lingual Generalisation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>To study the role of heterogeneity and volume of
samples in sequence composition strategies (i.e., packing and
Large language models (LLMs) are pre-trained on huge masking pipelines), we pre-train language models using
amounts of documents by optimizing a language mod- diferent masking approaches (described in §2.2) and
comelling objective and show an intriguing ability to solve pare them with models pre-trained via the traditional
various downstream NLP tasks. Ranaldi et al. [1] in mul- causal masking with diferent packing approaches by
tilingual settings and later Zhao et al. [2] highlighted varying amount of the sequence composition of the
docthe importance of pre-training data quality, diversity and uments in the pre-training chunks. Whilst for studying
composition methodologies. Our research takes a step the impact on cross-lingual generalisation we use
crossfurther by exploring the influence of the pre-training lingual settings (i.e., Italian English). Complementing
sequences heterogeneity for cross-lingual generalisation. the foundation approaches proposed in [1, 2],we
operThis potentially leads to significant advancements in un- ate via bilingual corpora. Hence, we analyse the results
derstanding LLMs’ learning properties. produced by a commonly used baseline method that
ran</p>
      <p>In decoder-only architectures pre-training, the con- domly samples and packs documents (RandomChunk), a
structions of the instances are based on packing that process that samples and packs documents from the same
combines randomly sampled texts (i.e., documents) into source based on their composition and origin (UniChunk),
a chunk that matches the size of the context window with- and then operate via eficient retrieval-based packing
out using any selection policy. Then, the causal mask- method, which retrieves and packs related documents
ing predicts the next token conditioned on the previous, (§2.1).
including those from diferent documents (portions of The experimental results indicate that operating via
non-contiguous texts) in the chunk. The ways to mitigate causal masking (RandomChunk) with arbitrary sequence
this arbitrary procedure are: (i) intra-document causal patterns of documents leads to the inclusion of
misleadmasking [3], where the likelihood of each token is condi- ing information that stems from diferent context during
tioned on the previous from the same document [3] and pre-training (§3), impacting in a negatively the
perforretrieval-based masking [2] where similar documents mance of the models in downstream tasks (§4). Instead,
retrieved by retrieval systems condition likelihood. intra-document causal masking, which avoids the
misleading phenomena during pre-training, significantly
improves the models’ performance and does not impact
the runtime. Although intra-document causal masking
performs well, it limits the operability of sequence
com</p>
      <p>C4
Wikipedia
CulturaX</p>
      <p>Baseline chunking
document-1
document-2</p>
      <p>document-3</p>
      <p>Sequence-based chunking
document-4</p>
      <p>document-5</p>
      <p>Retrieve-based chunking
doc-1</p>
      <p>retrieve using doc-1
start: doc-1</p>
      <p>return doc-2
indexCC
indexCultura
indexHLPT</p>
      <p>doc-2
retrieve using doc-2
Index Collector
2. Pre-Training Strategies
case in diferent languages as well). As revealed by Zhao
et al. [2] as well, this is partly solved by UniChunk’s
avoidance of packing documents from diferent distributions, 2.1. Packing Approaches
which improves the performance of causal masking
models in downstream tasks but still does not allow individual Given  that represents a corpus, and  = ⋃︀ 
sequences to be selected. denote resulting from the union of such corpora.
Specif</p>
      <p>Hence, we use a retrieval-based packing method, ically, each corpus  is as a set of documents  =
winhgicchroaslsl-olwinsgoupalermatoidneglsd’irlaenctgluyaogne smeqoudeenlicnegs, biny-icmonptreoxvt- o{f1to, k.e.n.,s|=|}(︀ ,w1h,e.r.e. ,eac|h|︀) . is defined as a sequence
learning and generative capabilities by using causal mask- The packing strategy involves first selecting a set of
ing and thus paying a small fee for document sorting but documents {}=1 from , and then packing them into
achieving tangible results. a chunk  with a fixed length || = . The documents</p>
      <p>Our main findings can be summarised as follows: {}=1 are concatenated by interleaving them with
end• By analyzing diferent pre-trained strategies in cross- of-sentence ([eos]) tokens. Hence,  is denoted as:
lingual settings we reveal that operating through
causal masking and considering the order and patterns  = { ⊕ [eos] |  = 1 . . .  − 1} ⊕ s(), (1)
sequence represented in documents, leads to
significant improvements. In addition, retrieval-based tech- where [eos] is the end-of-sentence token, s() truncates
niques provide resilience and allow for the selection of the last document such that || = , and the content
pre-training sequences by guaranteeing heterogeneity of the chunk  is removed from the dataset  to avoid
and reducing data (§3). sampling the same documents multiple times.
• We show important benefits on the in-context learn- Following the strategies proposed in [2],we use three
ing capabilities of downstream models. We observe strategies to sample the documents {}=1 from the
that in low-resource settings, it is possible to achieve dataset  for composing pre-training chunk.
the same performance and in some cases cross-lingual In contrast to the previous works, we use  ∈ [0, 1]
generalisation (in our case, English-Italian) (§4). to control the fraction of the corpus used. Hence, we use
• In conclusion, we show that the retrieval-based pack-  ⊆  and || = ⌊ × ||⌋ .</p>
      <p>ing method allowing for a flexible sequence composi- We define the three strategies (Baseline,
Sequencetion process benefits unbalanced cross-lingual learning based and Ranking based) as follow:
tangible benefits by using less pre-training data.</p>
      <p>Baseline The common baseline approach called
RandomChunk, with documents  ∈  are sampled
uniformly at random from the entire pre-training corpus
:
(,  ) =
{︃ 
⨁︁  ⊕ [eos] |  ∼ Uniform()
=1
}︃</p>
      <p>Causal Masking In causal masking, each token in a
sequence is predicted based on all previous tokens.
Specifically, given a chunk  = (1, . . . , ||), the likelihood
(2) of  is given by:</p>
      <p>||
 () = ∏︁  ( | 1, . . . , − 1),
=1
where  ⊆  and || = ⌊ × ||⌋ . As a result, in
RandomChunk, a chunk can contain documents from a
diferent source, as shown in Figure 1.
where  ( | 1, . . . , − 1) is the probability of the
token  given previous tokens 1, . . . , − 1 in the chunk.</p>
      <p>Sequence-based The UniChunk approach is sequence- During the pre-training, causal masking indicates that,
based and respects the sequences of the corpora. Hence, given a chunk , the likelihood of each token in  is
each chunk is composed of documents from a single conditioned on all previous tokens, including those that
source corpus : stem from diferent documents.</p>
      <p>{︃ 
⨁︁  ⊕ [eos] |  ∼ Uniform()
=1</p>
      <p>}︃
(,  ) = (3) Intra-Document Causal Masking In intra-document
causal masking, the probability of each token is
influwhere  ⊆   and || = ⌊ × | |⌋ and  ⊆  . enced by the previous tokens within the same document</p>
      <p>This strategy avoids packing documents from diferent and, consequently, the same context. Hence, using a
fraccorpora and allows control over the amount of data uti- tion  ⊆  where || = ⌊ × ||⌋ we construct the
lized from each specific corpus, enhancing eficient usage chunks  asdefined as in §1. The probability of each
of computational resources while preserving thematic token  belonging to document  is only conditioned
coherence. on the previous tokens within :
Ranking-based To empower the relevance of doc-  ||
uments in pre-training chunks, we use a retriever-  () = ∏︁ ∏︁  (︀  | 1, . . . , (− 1)︀) , (4)
based pipeline (BM25-based [4]) to construct pre-training =1 
chunks, which we define Bm25Chunk. Hence, given where each  is sampled from  as defined above. The
a document  ∈ , a sequence of documents models trained using this approach are called IntraDoc
{}=1 by +1 = Retrieve(, ) are retrieved; here, in the rest of the paper.</p>
      <p>Retrieve(, ) collects the most similar documents
to  from  using BM25 ranking.</p>
      <p>However, since the retrieval process can be computa- 3. Language Modeling Settings
tionally heavy due to the size of the pre-training corpus
. To improve the eficiency of the retrieval step, a Models The implementation is based on the GPT-2 [5].
subset ℬ ⊆   of the corpus  is used, reducing the We pre-train 124 million parameter models using context
computational complexity of retrieval as proposed in [2]. windows of 256, 512 tokens. To observe the efect of</p>
      <p>In particular, ℬ ⊆   contains  documents uni- diferent data compositions, we fix the vocabulary and
formly sampled from . To control the number of model parameters described in Appendix A.
utilised documents, we operate via  that regulates Corpora &amp; Settings We combine three high-quality
the fractions of . Hence we use ℬ ⊆ ℬ  where open-source corpora1 best exemplified from C4,
Cul|ℬ | = ⌊ × |ℬ |⌋. turaX, and Wikipedia. We construct the corpus  by</p>
      <p>This approach strategically serves as the retrieval operating through the methods proposed in §2 both on
source for constructing pre-training chunks:  and  and then we combine them. Moreover,
to observe the impact of the quantity of pre-training
instances, we use a scaling factor  that operates during
the construction of  and .
1 ∼ Uniform(ℬ), +1 = Retrieve(, ℬ ).</p>
      <p>After retrieving a sequence of documents {}=1 from
the ℬ for constructing a chunk, the bufer is refilled by
sampling novel documents from .</p>
    </sec>
    <sec id="sec-2">
      <title>4. Experiments</title>
      <sec id="sec-2-1">
        <title>2.2. Masking Approaches</title>
        <p>To analyse the operation of proposed approaches, we
The masking strategy is the other critical stage of lan- evaluate the model perplexities (§4.1), in-context
learnguage model pre-training, which defines how next-token ing (§4.2), understanding (§4.3) and question-answering
prediction distributions are conditioned on further to- capabilities (§4.4) under diferent configurations.
kens in a provided sequence. 1The statistics are reported in Table 4</p>
      </sec>
      <sec id="sec-2-2">
        <title>4.1. Perplexity</title>
        <p>We compute the perplexity (PPL) on two diferent
setups: (i) models pre-trained with an equal quantity of
data and then evaluated on a held-out set of documents
where each document is independently treated, (ii)
models pre-trained with an equal quantity of data scaled by
an  factor, which is  in {0.1, 0.25, 0.5, 0.75} and then
evaluated on a held-out set of documents where each
document is independently treated. While the first
conifguration allows one to observe whether the proposed
methods induce overfitting (data-contamination [ 6]), the
second experiment analyses the impact of the amount of
data used.</p>
        <p>The impact of Sequence Composition Table 1
shows that Bm25Chunk achieves the lowest PPL among
the three causal masking models, yielding a lower
average PPL compared to RandomChunk (in both settings more
than about +5) and UniChunk (in both settings around
+3.2). Increasing the correlation of documents in a
sequence empowers the language modelling ability of the
pre-trained models. Instead, when considering models
trained via intra-document causal masking, it emerges
that IntraDoc achieves the lowest PPL compared to the
models trained via causal masking.</p>
        <p>256
512</p>
        <p>Model</p>
        <sec id="sec-2-2-1">
          <title>RandomChunk UniChunk Bm25Chunk IntraDoc</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>RandomChunk UniChunk Bm25Chunk IntraDoc</title>
          <p>C4</p>
          <p>The role of Quantity Figure 2 shows that Bm25Chunk
consistently achieves a lower average PPL than the other
approaches even when decreasing the amount of
pretraining data. In fact, in both settings (Figure 2), it
can be observed that the average PPL of RandomChunk
and UniChunk lowers directly as the amount of
pretraining data used boosts. While intra-document causal
masking performs similarly to Bm25Chunk in
resourcebased settings (red line and green line Figure 2),
improving the intra-document causal masking alpha reduces
the PPL less consistently. Finally, it can be observed
that Bm25Chunk reaches stable performance even with
 = 0.75.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>4.2. In-Context Learning</title>
        <p>Following Zhao et al. [2], we evaluate the in-context
learning abilities of the models using GLUE-X [7] (SST2,
CoLA and RTE) both in English and Italian.</p>
        <p>Table 2 reports the average in-context learning
accuracy values of the models in few-shots settings, using
15 for 256 and 20 demonstrations for the 512 model,
respectively. Bm25Chunk yields a higher average accuracy
than RandomChunk for 256 (+5.12%) and 512 (+1.55%).
These demonstrate that increasing the correlation of the
documents in pre-training chunks improves the models’
in-context learning abilities.</p>
        <p>Figure 3, we report the average accuracy using
different numbers of few-shot demonstrations. Bm25Chunk
has an on-par accuracy with IntraDoc on the 256
setting; however, IntraDoc obtains a significantly higher
accuracy than Bm25Chunk on the 512 setting. Finally,
RandomChunk and UniChunk obtain comparable results
using diferent context lengths, and they do not
consistently improve accuracy when increasing the number of
demonstrations. This might be due to the tighter levels of
distraction in both settings, which use arbitrary packing
strategies.
formances, IntraDoc obtains the best average
performance. It indicates that eliminating potential distractions
from unrelated documents and learning each document
separately empowers understanding and generation
abilities. This finding is diferent from the ideas in previous
works, which suggested that pre-training with
multiple documents in one context and adding distraction in
context during pre-training benefit in-context and
understanding ability.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion</title>
      <p>The role of pre-training sampling is a strategic
component. We analyse the impact of sequencing by
pre4.3. Understanding &amp; Commonsense training several language models on multilingual corpora.
We evaluate the pre-trained models on natural lan- We showed that causal masking involves misleading
docguage understanding, commonsense reasoning tasks (i.e., uments that confound the pre-training of language
modXSQuAD [8], XCOPA [9]), and question-answering (i.e., els and impact the performance in downstream tasks.
MLQA [10]). It emerges that Bm25Chunk outperforms Hence, we find that improving sequence correlation in
RandomChunk and UniChunk in all tasks, confirming that pre-training chunks reduces potential distractions while
increasing the similarity of documents in pre-training improving the performance of language models without
chunks improve understanding abilities. Specifically, reducing pre-training eficiency. In the future, we will
Bm25Chunk obtains a significantly better accuracy on study whether these findings archive benefits in
fineMLQA, showing it can operate in-context information tuning pipelines [11, 12, 13, 14, 15, 16] as well.
provided in the input question.</p>
      <p>
        However, even though Bm25Chunk archives solid
per//aclanthology.
        <xref ref-type="bibr" rid="ref32">org/2024</xref>
        .findings-naacl.78. doi: 10.
      </p>
      <p>
        18653/v1/2024.findings-naacl.78.
[14] L. Ranaldi, G. Pucci, A. Freitas, Does the
language matter? curriculum learning over neo-Latin
languages, in: N. Calzolari, M.-Y. Kan, V. Hoste,
A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of
the 2024 Joint International Conference on
Computational Linguistics, Language Resources and
Evaluation (LREC-COLING 2024), ELRA and ICCL,
Torino, Italia, 2024, pp. 5212–5220. URL: https:
//aclanthology.
        <xref ref-type="bibr" rid="ref32">org/2024</xref>
        .lrec-main.464.
[15] L. Ranaldi, A. Freitas, Aligning large and small
language models via chain-of-thought reasoning,
in: Y. Graham, M. Purver (Eds.), Proceedings of the
18th Conference of the European Chapter of the
Association for Computational Linguistics (Volume
1: Long Papers), Association for Computational
Lin
        <xref ref-type="bibr" rid="ref29">guistics, St. Julian’s, Malta, 2024</xref>
        , pp. 1812–1827.
      </p>
      <p>
        URL: https://aclanthology.
        <xref ref-type="bibr" rid="ref32">org/2024</xref>
        .eacl-long.109.
[16] L. Ranaldi, A. Freitas, Self-refine instruction-tuning
for aligning reasoning in language models, in: Y.
AlOnaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings
of the 2024 Conference on Empirical Methods in
Natural Language Processing, Association for
Computational Lin
        <xref ref-type="bibr" rid="ref29">guistics, Miami, Florida, USA, 2024</xref>
        ,
pp. 2325–2347. URL: https://aclanthology.
        <xref ref-type="bibr" rid="ref32">org/2024</xref>
        .
      </p>
      <p>emnlp-main.139.</p>
    </sec>
    <sec id="sec-4">
      <title>A. Pre-training Corpora</title>
      <p>In our experiments, we use the GPT-2 small, the 124 million
model with 12 layers, a hidden size of 768, and 12 attention
heads. We use a batch size of 0.5 million tokens for both the
models with 256 and 512 context window sizes and pre-train
models using 20B tokens with 100,000 steps. We use Adam
optimiser with  1 = 0.90,  2 = 0.95, a weight decay of 0.1,
and a cosine learning rate scheduler. The peak learning rate is
3 × 10− 4, decreasing to 3 × 10− 5 at the end. We perform the
experiments using 16 Nvidia RTX A6000 with 48GB of VRAM.</p>
      <p>Subset
C4 (it)
CulturaX (it)
Wikipedia (it)
C4 (it)
CulturaX (it)
Wikipedia (it)
# documents</p>
      <p># words
∼
∼
∼
∼
∼
∼
• ^ is the token predicted by the model at position .
•  is the correct (ground truth) token at position .
• I is the indicator function, which is 1 if ^ =  and 0
otherwise.

256
512</p>
      <p>Model
RandomChunk
UniChunk
Bm25Chunk
IntraDoc
RandomChunk
UniChunk
Bm25Chunk
IntraDoc</p>
      <p>C4</p>
    </sec>
    <sec id="sec-5">
      <title>B. Next Token Accuracy of</title>
    </sec>
    <sec id="sec-6">
      <title>Pre-Trained Language Models</title>
      <p>In addition to PPL, we report the next token accuracy of
pre-trained language models in Table 5.</p>
      <p>The "next-token accuracy" is calculated as follows:
Specifically we define Acc as:</p>
      <p>Acc =</p>
      <p>1 ∑︁ I(^ = )
 =1
where:
Lang
Barack Obama</p>
      <p>Barack Obama</p>
    </sec>
    <sec id="sec-7">
      <title>D. In-context Learning</title>
      <p>performances English and</p>
    </sec>
    <sec id="sec-8">
      <title>Italian</title>
      <p>This section reports the results obtained on the tasks
introduced in Section 4.2. To conduct a more detailed analysis,
we have used the original (English) and Italian versions of
three tasks belonging to the GLUE family. We selected SST2,
CoLA, and RTE. The bilingual versions were taken from the
contribution previously proposed by Yang et al. [7].</p>
      <p>SST2-En</p>
      <p>CoLA-En</p>
      <p>RTE-En</p>
    </sec>
    <sec id="sec-9">
      <title>E. Understanding and</title>
    </sec>
    <sec id="sec-10">
      <title>Commonsense performances</title>
    </sec>
    <sec id="sec-11">
      <title>English and Italian</title>
      <p>This section reports the results obtained on the tasks
introduced in Section 4.3. We have used the original (English)
and Italian versions of MLQA, XCOPA, and SQuAD to
conduct a more detailed analysis.</p>
      <p>MLQA</p>
      <p>XCOPA</p>
      <p>MLQA</p>
      <p>XCOPA
30.52
27.34
27.16
30.85</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          (Eds.),
          <source>Findings of the Association for Com</source>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Zanzotto</surname>
          </string-name>
          , Modeling eas- putational
          <source>Linguistics: ACL</source>
          <year>2023</year>
          , Association
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          learning, in: R. Mitkov, G. Angelova (Eds.),
          <source>Pro- 2023</source>
          , pp.
          <fpage>12731</fpage>
          -
          <lpage>12750</lpage>
          . URL: https://aclanthology.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>ceedings of the 14th International Conference on org/2023.findings-acl.806. doi: 10</source>
          .18653/v1/
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>Recent Advances in Natural Language Processing, findings-acl.806.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>INCOMA</given-names>
            <surname>Ltd</surname>
          </string-name>
          .,
          <string-name>
            <surname>Shoumen</surname>
            , Bulgaria, Varna, Bulgaria, [8]
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rajpurkar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lopyrev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
          </string-name>
          , Squad:
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <year>2023</year>
          , pp.
          <fpage>937</fpage>
          -
          <lpage>948</lpage>
          . URL: https://aclanthology.org/ 100, 000+ questions for machine comprehension
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          2023.ranlp-
          <volume>1</volume>
          .101. of text, in: J.
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Carreras</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Duh (Eds.), [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Staniszewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tworkowski</surname>
          </string-name>
          ,
          <source>Proceedings of the 2016 Conference on Empirical</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>impact of sequence composition on language model</source>
          <year>2016</year>
          , Austin, Texas, USA, November 1-
          <issue>4</issue>
          ,
          <year>2016</year>
          , The
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>pre-training</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2402. Association for Computational Linguistics,
          <year>2016</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          13991. arXiv:
          <volume>2402</volume>
          .13991. pp.
          <fpage>2383</fpage>
          -
          <lpage>2392</lpage>
          . URL: https://doi.org/10.18653/v1/ [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <fpage>d16</fpage>
          -
          <lpage>1264</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/D16-1264.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , [9]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Ponti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Glavaš</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Majewska</surname>
          </string-name>
          , Q. Liu,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>document boundaries</article-title>
          ,
          <source>ArXiv abs/2310</source>
          .10638 (
          <year>2023</year>
          ).
          <article-title>dataset for causal commonsense reasoning</article-title>
          , in:
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          264172290.
          <source>ings of the 2020 Conference on Empirical Meth</source>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          , The probabilis- ods
          <source>in Natural Language Processing (EMNLP)</source>
          , As-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>tic relevance framework: Bm25 and beyond, sociation for Computational Linguistics</article-title>
          , Online,
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Found. Trends</given-names>
            <surname>Inf</surname>
          </string-name>
          .
          <source>Retr</source>
          .
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          . URL:
          <year>2020</year>
          , pp.
          <fpage>2362</fpage>
          -
          <lpage>2376</lpage>
          . URL: https://aclanthology.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          https://doi.org/10.1561/1500000019. doi:
          <volume>10</volume>
          .1561/ org/2020.emnlp-main.
          <volume>185</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          1500000019. emnlp-main.
          <volume>185</volume>
          . [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          , J. D. [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oguz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rinott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , H. Schwenk,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <source>Proceedings of the 58th Annual</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          , I. Sutskever, tics, Online,
          <year>2020</year>
          , pp.
          <fpage>7315</fpage>
          -
          <lpage>7330</lpage>
          . URL: https://
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners, aclanthology</article-title>
          .org/
          <year>2020</year>
          .acl-main.
          <volume>653</volume>
          . doi:
          <volume>10</volume>
          .18653/
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , v1/
          <year>2020</year>
          .acl-main.
          <volume>653</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          (Eds.),
          <source>Advances in Neural Information</source>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          , G. Pucci, Knowing knowledge: Epis-
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          ,
          <article-title>temological study of knowledge in transform-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Inc.</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https://proceedings. ers,
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          ). URL: https://
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          neurips.cc/paper_files/paper/2020/file/ www.mdpi.com/2076-3417/13/2/677. doi:
          <volume>10</volume>
          .3390/
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. app13020677.</source>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Onorati</surname>
          </string-name>
          , L. Ranaldi, [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Pucci, Does the English matter? elicit</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>zotto, Investigating the impact of data contamina- D</article-title>
          . Ataman (Ed.),
          <source>Proceedings of the 3rd Workshop</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>Findings of the Association for Computational Lin-</article-title>
          pore,
          <year>2023</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>183</lpage>
          . URL: https://aclanthology.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>guistics ACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational org/</article-title>
          <year>2023</year>
          .mrl-
          <volume>1</volume>
          .14. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .mrl-1.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          , Bangkok, Thailand and virtual meeting,
          <volume>14</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <year>2024</year>
          , pp.
          <fpage>13909</fpage>
          -
          <lpage>13920</lpage>
          . URL: https://aclanthology. [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          org/
          <year>2024</year>
          .findings-acl.
          <volume>827</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          . F.
          <string-name>
            <given-names>M.</given-names>
            <surname>Zanzotto</surname>
          </string-name>
          ,
          <article-title>A tree-of-thoughts to broaden</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>findings-acl.827. multi-step reasoning across languages</article-title>
          , in: K. Duh, [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Qin,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.), Findings of the Associ-
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>GLUE-X:</surname>
          </string-name>
          Eval- ation
          <source>for Computational Linguistics: NAACL</source>
          <year>2024</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <article-title>from an out-of-distribution generalization perspec-</article-title>
          ico
          <string-name>
            <surname>City</surname>
          </string-name>
          , Mexico,
          <year>2024</year>
          , pp.
          <fpage>1229</fpage>
          -
          <lpage>1241</lpage>
          . URL:
          <article-title>https: Barack Obama was the 44th President of the United States, serving two terms from 2009 to 2017</article-title>
          .
          <article-title>Barack Obama è stato il 44º Presidente degli Stati Uniti, in carica per due mandati dal 2009 al 2017</article-title>
          .
          <article-title>Barack Obama was the 44th President of the United States, serving two terms from 2009 to 2017</article-title>
          .
          <article-title>Barack Obama è stato il 44º Presidente degli Stati Uniti, in carica per due mandati dal 2009 al 2017</article-title>
          .
          <article-title>Who was the 44th President of the United States? Chi è stato il 44º Presidente degli Stati Uniti? Chi è stato il 44º Presidente degli Stati Uniti? Who was the 44th President of the United States? Target Answer Barack Obama Barack Obama</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>