How far does the sequence of compositions impact Multilingual Pre-Training?

How far does the sequence of compositions impact Multilingual Pre-Training? LeonardoRanaldi lranaldi@ed.ac.uk School of Informatics University of Edinburgh

GiuliaPucci g.pucci.24@abdn.uk Department of Computing Science University of Aberdeen

FabioMassimoZanzotto fabio.massimo.zanzotto@uniroma2.it Università degli Studi Roma "Tor Vergata"

Roma Italy

Dec 04 -06 2024 Pisa Italy

How far does the sequence of compositions impact Multilingual Pre-Training? 1613-0073 D9B800C342C8F23200719B6BAEF933AD GROBID - A machine learning software for extracting information from scholarly documents Large Language Models Pre-training Methods Cross-lingual Generalisation

An Efficient strategy for conducting pre-training of language models is the concatenation of contiguous sequences of text of fixed length through causal masking that estimates the probability of each token given its context. Yet earlier work suggests that this technique affects the performance of the model as it might include misleading information from previous text sequences during pre-training. To fill this gap, intra-context and rank-based causal masking techniques have been proposed, in which the probability of each token is conditional only on the previous ones in the same document or ranked sequences, avoiding misleading information from different contexts. However, the sequences provided by the use of these techniques have been little explored, overlooking the opportunity to optimise the composition by manipulating the volume and heterogeneity in the sequences and improving unbalance pre-training settings. In this paper, we demonstrate that organising text chunks based on a policy that aligns with text similarity effectively improve pre-training, enhances the learning and cross-lingual generalisation capabilities of language models, maintains efficiency, and allows for fewer instances.

Introduction

Large language models (LLMs) are pre-trained on huge amounts of documents by optimizing a language modelling objective and show an intriguing ability to solve various downstream NLP tasks. Ranaldi et al. [1] in multilingual settings and later Zhao et al. [2] highlighted the importance of pre-training data quality, diversity and composition methodologies. Our research takes a step further by exploring the influence of the pre-training sequences heterogeneity for cross-lingual generalisation. This potentially leads to significant advancements in understanding LLMs' learning properties.

In decoder-only architectures pre-training, the constructions of the instances are based on packing that combines randomly sampled texts (i.e., documents) into a chunk that matches the size of the context window without using any selection policy. Then, the causal masking predicts the next token conditioned on the previous, including those from different documents (portions of non-contiguous texts) in the chunk. The ways to mitigate this arbitrary procedure are: (i) intra-document causal masking [3], where the likelihood of each token is conditioned on the previous from the same document [3] and retrieval-based masking [2] where similar documents retrieved by retrieval systems condition likelihood.

To study the role of heterogeneity and volume of samples in sequence composition strategies (i.e., packing and masking pipelines), we pre-train language models using different masking approaches (described in §2.2) and compare them with models pre-trained via the traditional causal masking with different packing approaches by varying amount of the sequence composition of the documents in the pre-training chunks. Whilst for studying the impact on cross-lingual generalisation we use crosslingual settings (i.e., Italian English). Complementing the foundation approaches proposed in [1,2],we operate via bilingual corpora. Hence, we analyse the results produced by a commonly used baseline method that randomly samples and packs documents (RandomChunk), a process that samples and packs documents from the same source based on their composition and origin (UniChunk), and then operate via efficient retrieval-based packing method, which retrieves and packs related documents ( §2.1).

The experimental results indicate that operating via causal masking (RandomChunk) with arbitrary sequence patterns of documents leads to the inclusion of misleading information that stems from different context during pre-training ( §3), impacting in a negatively the performance of the models in downstream tasks ( §4). Instead, intra-document causal masking, which avoids the misleading phenomena during pre-training, significantly improves the models' performance and does not impact the runtime. Although intra-document causal masking performs well, it limits the operability of sequence composition mixing documents from different corpora (in our case in different languages as well). As revealed by Zhao et al. [2] as well, this is partly solved by UniChunk's avoidance of packing documents from different distributions, which improves the performance of causal masking models in downstream tasks but still does not allow individual sequences to be selected. Hence, we use a retrieval-based packing method, which allows operating directly on sequences by improving cross-lingual models' language modeling, in-context learning and generative capabilities by using causal masking and thus paying a small fee for document sorting but achieving tangible results.

Our main findings can be summarised as follows: • By analyzing different pre-trained strategies in crosslingual settings we reveal that operating through causal masking and considering the order and patterns sequence represented in documents, leads to significant improvements. In addition, retrieval-based techniques provide resilience and allow for the selection of pre-training sequences by guaranteeing heterogeneity and reducing data ( §3). • We show important benefits on the in-context learning capabilities of downstream models. We observe that in low-resource settings, it is possible to achieve the same performance and in some cases cross-lingual generalisation (in our case, English-Italian) ( §4). • In conclusion, we show that the retrieval-based packing method allowing for a flexible sequence composition process benefits unbalanced cross-lingual learning tangible benefits by using less pre-training data.

Pre-Training Strategies

Packing Approaches

Given 𝒟𝑖 that represents a corpus, and 𝒟 = ⋃︀ 𝑠 𝒟𝑠 denote resulting from the union of such corpora. Specifically, each corpus 𝒟𝑠 is as a set of documents 𝒟𝑠 = {𝑑1, . . . , 𝑑 |𝒟𝑠| }, where each 𝑑𝑖 is defined as a sequence of tokens 𝑑𝑖 = (︀ 𝑥1, . . . , 𝑥 |𝑑 𝑖 | )︀ . The packing strategy involves first selecting a set of documents {𝑑𝑖} 𝑛 𝑖=1 from 𝒟, and then packing them into a chunk 𝐶 with a fixed length |𝐶| = 𝐿. The documents {𝑑𝑖} 𝑛 𝑖=1 are concatenated by interleaving them with endof-sentence ([eos]) tokens. Hence, 𝐶 is denoted as:

𝐶 = {𝑑𝑖 ⊕ [eos] | 𝑖 = 1 . . . 𝑛 − 1} ⊕ s(𝑑𝑛), (1)

where [eos] is the end-of-sentence token, s() truncates the last document such that |𝐶| = 𝐿, and the content of the chunk 𝐶 is removed from the dataset 𝒟 to avoid sampling the same documents multiple times.

Following the strategies proposed in [2], we use three strategies to sample the documents {𝑑𝑖} 𝑛 𝑖=1 from the dataset 𝒟 for composing pre-training chunk.

In contrast to the previous works, we use 𝛼 ∈ [0, 1] to control the fraction of the corpus used. Hence, we use 𝒮 ⊆ 𝒟 and |𝒮| = ⌊𝛼 × |𝒟|⌋.

We define the three strategies (Baseline, Sequencebased and Ranking based) as follow:

Baseline The common baseline approach called RandomChunk, with documents 𝑑𝑖 ∈ 𝒟 are sampled uni-formly at random from the entire pre-training corpus 𝒟:

(𝒟, 𝛼) = {︃ 𝑛 ⨁︁ 𝑖=1 𝑑𝑖 ⊕ [eos] | 𝑑𝑖 ∼ Uniform(𝒮) }︃ (2)

where 𝒮 ⊆ 𝒟 and |𝒮| = ⌊𝛼 × |𝒟|⌋. As a result, in RandomChunk, a chunk can contain documents from a different source, as shown in Figure 1.

Sequence-based

The UniChunk approach is sequencebased and respects the sequences of the corpora. Hence, each chunk is composed of documents from a single source corpus 𝒟𝑠:

(𝒟𝑠, 𝛼) = {︃ 𝑛 ⨁︁ 𝑖=1 𝑑𝑖 ⊕ [eos] | 𝑑𝑖 ∼ Uniform(𝒮𝑠) }︃(3)

where 𝒮𝑠 ⊆ 𝒟𝑠 and |𝒮𝑠| = ⌊𝛼 × |𝒟𝑠|⌋ and 𝒟𝑠 ⊆ 𝒟. This strategy avoids packing documents from different corpora and allows control over the amount of data utilized from each specific corpus, enhancing efficient usage of computational resources while preserving thematic coherence.

Ranking-based

To empower the relevance of documents in pre-training chunks, we use a retrieverbased pipeline (BM25-based [4]) to construct pre-training chunks, which we define Bm25Chunk. Hence, given a document 𝑑𝑖 ∈ 𝒟𝑠, a sequence of documents {𝑑𝑖} 𝑛 𝑖=1 by 𝑑𝑖+1 = Retrieve(𝑑𝑖, 𝒟𝑠) are retrieved; here, Retrieve(𝑑𝑖, 𝒟𝑠) collects the most similar documents to 𝑑𝑖 from 𝒟𝑠 using BM25 ranking.

However, since the retrieval process can be computationally heavy due to the size of the pre-training corpus 𝒟𝑠. To improve the efficiency of the retrieval step, a subset ℬ𝑠 ⊆ 𝒟𝑠 of the corpus 𝒟𝑠 is used, reducing the computational complexity of retrieval as proposed in [2].

In particular, ℬ𝑠 ⊆ 𝒟𝑠 contains 𝑘 documents uniformly sampled from 𝒟𝑠. To control the number of utilised documents, we operate via 𝛼 that regulates the fractions of 𝑘. Hence we use ℬ𝛼 ⊆ ℬ𝑠 where |ℬ𝛼| = ⌊𝛼 × |ℬ𝑠|⌋.

This approach strategically serves as the retrieval source for constructing pre-training chunks:

𝑑1 ∼ Uniform(ℬ𝑠), 𝑑𝑖+1 = Retrieve(𝑑𝑖, ℬ𝛼).

After retrieving a sequence of documents {𝑑𝑖} 𝑛 𝑖=1 from the ℬ𝛼 for constructing a chunk, the buffer is refilled by sampling novel documents from 𝒟𝑠.

Masking Approaches

The masking strategy is the other critical stage of language model pre-training, which defines how next-token prediction distributions are conditioned on further tokens in a provided sequence.

Causal Masking

In causal masking, each token in a sequence is predicted based on all previous tokens. Specifically, given a chunk 𝐶 = (𝑥1, . . . , 𝑥 |𝐶| ), the likelihood of 𝐶 is given by:

𝑃 (𝐶) = |𝐶| ∏︁ 𝑖=1 𝑃 (𝑥𝑖 | 𝑥1, . . . , 𝑥𝑖−1),

where 𝑃 (𝑥𝑖 | 𝑥1, . . . , 𝑥𝑖−1) is the probability of the token 𝑥𝑖 given previous tokens 𝑥1, . . . , 𝑥𝑖−1 in the chunk. During the pre-training, causal masking indicates that, given a chunk 𝐶, the likelihood of each token in 𝐶 is conditioned on all previous tokens, including those that stem from different documents.

Intra-Document Causal Masking

In intra-document causal masking, the probability of each token is influenced by the previous tokens within the same document and, consequently, the same context. Hence, using a fraction 𝒮 ⊆ 𝒟 where |𝒮| = ⌊𝛼 × |𝒟|⌋ we construct the chunks 𝐶 asdefined as in §1. The probability of each token 𝑑𝑖𝑗 belonging to document 𝑑𝑖 is only conditioned on the previous tokens within 𝑑𝑖:

𝑃 (𝐶) = 𝑛 ∏︁ 𝑖=1 |𝑑 𝑖 | ∏︁ 𝑗 𝑃 (︀ 𝑑𝑖𝑗 | 𝑑𝑖1, . . . , 𝑑 𝑖(𝑗−1) )︀ ,(4)

where each 𝑑𝑖 is sampled from 𝐶 as defined above. The models trained using this approach are called IntraDoc in the rest of the paper.

Language Modeling Settings

Models The implementation is based on the GPT-2 [5]. We pre-train 124 million parameter models using context windows of 256, 512 tokens. To observe the effect of different data compositions, we fix the vocabulary and model parameters described in Appendix A.

Corpora & Settings

We combine three high-quality open-source corpora 1 best exemplified from C4, Cul-turaX, and Wikipedia. We construct the corpus 𝒟 by operating through the methods proposed in §2 both on 𝒟𝐸𝑛 and 𝒟𝐼𝑡 and then we combine them. Moreover, to observe the impact of the quantity of pre-training instances, we use a scaling factor 𝛼 that operates during the construction of 𝒟𝐸𝑛 and 𝒟𝐼𝑡.

Experiments

To analyse the operation of proposed approaches, we evaluate the model perplexities ( §4.1), in-context learning ( §4.2), understanding ( §4.3) and question-answering capabilities ( §4.4) under different configurations.

Perplexity

We compute the perplexity (PPL) on two different setups: (i) models pre-trained with an equal quantity of data and then evaluated on a held-out set of documents where each document is independently treated, (ii) models pre-trained with an equal quantity of data scaled by an 𝛼 factor, which is 𝛼 in {0.1, 0.25, 0.5, 0.75} and then evaluated on a held-out set of documents where each document is independently treated. While the first configuration allows one to observe whether the proposed methods induce overfitting (data-contamination [6]), the second experiment analyses the impact of the amount of data used.

The impact of Sequence Composition Table 1 shows that Bm25Chunk achieves the lowest PPL among the three causal masking models, yielding a lower average PPL compared to RandomChunk (in both settings more than about +5) and UniChunk (in both settings around +3.2). Increasing the correlation of documents in a sequence empowers the language modelling ability of the pre-trained models. Instead, when considering models trained via intra-document causal masking, it emerges that IntraDoc achieves the lowest PPL compared to the models trained via causal masking. Generally, all methods obtain significantly lower PPLs (particularly Bm25Chunk than IntraDoc) in Wikipedia. This phenomenon could imply that the pre-training sources are very common (lower PPL is better-known text), these texts is more influenced by documents with different contexts (misleading contexts) and the proposed strategies can improve this problem.

The role of Quantity Figure 2 shows that Bm25Chunk consistently achieves a lower average PPL than the other approaches even when decreasing the amount of pretraining data. In fact, in both settings (Figure 2), it can be observed that the average PPL of RandomChunk and UniChunk lowers directly as the amount of pretraining data used boosts. While intra-document causal masking performs similarly to Bm25Chunk in resourcebased settings (red line and green line Figure 2), improving the intra-document causal masking alpha reduces the PPL less consistently. Finally, it can be observed that Bm25Chunk reaches stable performance even with 𝛼 = 0.75.

In-Context Learning

Following Zhao et al. [2], we evaluate the in-context learning abilities of the models using GLUE-X [7] (SST2, CoLA and RTE) both in English and Italian.

Table 2 reports the average in-context learning accuracy values of the models in few-shots settings, using 15 for 256 and 20 demonstrations for the 512 model, respectively. Bm25Chunk yields a higher average accuracy than RandomChunk for 256 (+5.12%) and 512 (+1.55%). These demonstrate that increasing the correlation of the documents in pre-training chunks improves the models' in-context learning abilities.

Figure 3, we report the average accuracy using different numbers of few-shot demonstrations. Bm25Chunk has an on-par accuracy with IntraDoc on the 256 setting; however, IntraDoc obtains a significantly higher accuracy than Bm25Chunk on the 512 setting. Finally, RandomChunk and UniChunk obtain comparable results using different context lengths, and they do not consistently improve accuracy when increasing the number of demonstrations. This might be due to the tighter levels of distraction in both settings, which use arbitrary packing strategies.

Table 3

Evaluation results of natural language understanding, commonsense reasoning and QA tasks.

Understanding & Commonsense

We evaluate the pre-trained models on natural language understanding, commonsense reasoning tasks (i.e., XSQuAD [8], XCOPA [9]), and question-answering (i.e., MLQA [10]).

Multilinguality

To assess code-switching abilities, we experimented with cross-lingual input by operating with MLQA. We crossed the languages, delivering contexts in English and questions in Italian and vice versa (Appendix C). Figure 4 show that Bm25Chunk outperforms both RandomChunk and intra-document causal masking. At the same time, IntraDoc, as discussed in §4.3 for MLQA, outperforms Bm25Chunk. This result confirms that IntraDoc's performance is not only related to monolingual learning sequences but also more complex dynamics.

Conclusion

The role of pre-training sampling is a strategic component. We analyse the impact of sequencing by pretraining several language models on multilingual corpora. We showed that causal masking involves misleading documents that confound the pre-training of language models and impact the performance in downstream tasks. Hence, we find that improving sequence correlation in pre-training chunks reduces potential distractions while improving the performance of language models without reducing pre-training efficiency. In the future, we will study whether these findings archive benefits in finetuning pipelines [11,12,13,14,15,16] as well.

Figure 2 :2Figure 2: Average Perplexities decreasing training set.

Figure 3 :3Figure 3: Average in-context learning accuracy using different numbers of input demonstrations.

Figure 4 :4Figure 4: Evaluation results of MultiLingual Question Answering by providing cross-lingual input (en-it means context in English and question in Italian and vice versa as described in Appendix C).

Table 11Evaluation of perplexity on test set created by sampling the original pre-training corpora (Appendix D).𝐿ModelC4CulturaX WikiAvg.RandomChunk20.1219.619.8916.5256UniChunk Bm25Chunk18.83 14.9615.65 15.078.56 5.2314.3 11.4IntraDoc14.0413.575.0810.7RandomChunk19.3218.769.5515.9512UniChunk Bm25Chunk18.22 13.8515.11 13.277.89 5.0213.4 10.7IntraDoc12.9813.074.3910.0

𝐿 is the context window for pre-training (next-token accuracy in Appendix B).

Table 22Average In-context learning performance evaluated by text classification accuracy across three tasks. Accuracies for English and Italian are reported in Appendix E.

IntraDoc obtains the best average performance. It indicates that eliminating potential distractions from unrelated documents and learning each document separately empowers understanding and generation abilities. This finding is different from the ideas in previous works, which suggested that pre-training with multiple documents in one context and adding distraction in context during pre-training benefit in-context and understanding ability.

formances,It emerges that Bm25Chunk outperformsRandomChunk and UniChunk in all tasks, confirming thatincreasing the similarity of documents in pre-trainingchunks improve understanding abilities. Specifically,Bm25Chunk obtains a significantly better accuracy onMLQA, showing it can operate in-context informationprovided in the input question.However, even though Bm25Chunk archives solid per-

A. Pre-training Corpora

In our experiments, we use the GPT-2 small, the 124 million model with 12 layers, a hidden size of 768, and 12 attention heads. We use a batch size of 0.5 million tokens for both the models with 256 and 512 context window sizes and pre-train models using 20B tokens with 100,000 steps. We use Adam optimiser with 𝛽 1 = 0.90, 𝛽 2 = 0.95, a weight decay of 0.1, and a cosine learning rate scheduler. The peak learning rate is 3 × 10 −4 , decreasing to 3 × 10 −5 at the end. We perform the experiments using 16 Nvidia RTX A6000 with 48GB of VRAM.

Subset

# documents # words

Table 4

Size of pre-training corpora. For computational reasons, we produced equivalent samples for both English and Italian.

B. Next Token Accuracy of Pre-Trained Language Models

In addition to PPL, we report the next token accuracy of pre-trained language models in Table 5.

The "next-token accuracy" is calculated as follows: Specifically we define Acc as:

where:

• 𝑁 is the total number of tokens in the test set.

• 𝑦 ^𝑖 is the token predicted by the model at position 𝑖.

• 𝑦 𝑖 is the correct (ground truth) token at position 𝑖.

• I is the indicator function, which is 1 if 𝑦 ^𝑖 = 𝑦 𝑖 and 0 otherwise.

𝐿

D. In-context Learning performances English and Italian

This section reports the results obtained on the tasks introduced in Section 4.2. To conduct a more detailed analysis, we have used the original (English) and Italian versions of three tasks belonging to the GLUE family. We selected SST2, CoLA, and RTE. The bilingual versions were taken from the contribution previously proposed by Yang et al. [7].

Table 8

In-context learning performance evaluated by text classification accuracy across three Italian tasks.

E. Understanding and Commonsense performances English and Italian

This section reports the results obtained on the tasks introduced in Section 4.3. We have used the original (English) and Italian versions of MLQA, XCOPA, and SQuAD to conduct a more detailed analysis.

Table 10

Evaluation results of natural language understanding, commonsense reasoning and QA tasks in Italian.

Modeling easiness for training transformers with curriculum learning LRanaldi GPucci FMZanzotto Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing RMitkov GAngelova the 14th International Conference on Recent Advances in Natural Language Processing

Shoumen, Bulgaria, Varna, Bulgaria

INCOMA Ltd 2023 Analysing the impact of sequence composition on language model pre-training YZhao YQu KStaniszewski STworkowski WLiu PMiłoś YWu PMinervini 2024 WShi SMin MLomeli CZhou MLi VLin NASmith LZettlemoyer SYih MLewis ArXiv abs/2310.10638 In-context pretraining: Language modeling beyond document boundaries 2023 The probabilistic relevance framework: Bm25 and beyond, Found SRobertson HZaragoza 10.1561/1500000019 Trends Inf. Retr 3 2009 Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei Advances in Neural Information Processing Systems HLarochelle MRanzato RHadsell MBalcan HLin Curran Associates, Inc 2020 33 Investigating the impact of data contamination of large language models in text-to-SQL translation FRanaldi ESRuzzetti DOnorati LRanaldi CGiannone AFavalli RRomagnoli FMZanzotto 10.18653/v1/2024.findings-acl.827 Findings of the Association for Computational Linguistics ACL 2024, Association for Computational Linguistics L.-WKu AMartins VSrikumar

Bangkok, Thailand

2024 and virtual meeting GLUE-X: Evaluating natural language understanding models from an out-of-distribution generalization perspective LYang SZhang LQin YLi YWang HLiu JWang XXie YZhang 10.18653/v1/2023.findings-acl.806 Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics ARogers JBoyd-Graber NOkazaki

Toronto, Canada

2023 Squad: 100, 000+ questions for machine comprehension of text PRajpurkar JZhang KLopyrev PLiang 10.18653/V1/D16-1264 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016 JSu XCarreras KDuh the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016

Austin, Texas, USA

November 1-4, 2016. 2016 The Association for Computational Linguistics XCOPA: A multilingual dataset for causal commonsense reasoning EMPonti GGlavaš OMajewska QLiu IVulić AKorhonen 10.18653/v1/2020.emnlp-main.185 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics BWebber TCohn YHe YLiu the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics 2020 MLQA: Evaluating cross-lingual extractive question answering PLewis BOguz RRinott SRiedel HSchwenk 10.18653/v1/2020.acl-main.653 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Knowing knowledge: Epistemological study of knowledge in transformers LRanaldi GPucci 10.3390/app13020677 Applied Sciences 13 2023 Does the English matter? elicit cross-lingual abilities of large language models LRanaldi GPucci 10.18653/v1/2023.mrl-1.14 Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL), Association for Computational Linguistics DAtaman the 3rd Workshop on Multi-lingual Representation Learning (MRL), Association for Computational Linguistics

Singapore

2023 A tree-of-thoughts to broaden multi-step reasoning across languages LRanaldi GPucci FRanaldi ESRuzzetti FMZanzotto 10.18653/v1/2024.findings-naacl.78 Findings of the Association for Computational Linguistics: NAACL 2024, Association for Computational Linguistics KDuh HGomez SBethard

Mexico City, Mexico

2024 Does the language matter? curriculum learning over neo-Latin languages LRanaldi GPucci AFreitas Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) NCalzolari M.-YKan VHoste ALenci SSakti NXue the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, Italia

ELRA and ICCL 2024 Aligning large and small language models via chain-of-thought reasoning LRanaldi AFreitas Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics Long Papers YGraham MPurver the 18th Conference of the European Chapter of the Association for Computational Linguistics

St. Julian's, Malta

2024 1 Association for Computational Linguistics Self-refine instruction-tuning for aligning reasoning in language models LRanaldi AFreitas Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing YAl-Onaizan MBansal Y.-NChen the 2024 Conference on Empirical Methods in Natural Language Processing

Miami, Florida, USA

Association for Computational Linguistics 2024