-

E. Yang);

1613-0073

Extending Translate-Train for ColBERT-X to African Language CLIR

Eugene Yang

eugene.yang@jhu.edu 0 1

Dawn J. Lawrie

lawrie@jhu.edu 0 1

Paul McNamee

mcnamee@jhu.edu 0 1

James Mayfield

mayfield@jhu.edu 0 1

ColBERT-X

0 1

Translate-Train

0 1

PLAID

0 1

JH POLO

0 1 0 FIRE'23: Forum for Information Retrieval Evaluation 1 Human Language Technology Center of Excellence, Johns Hopkins University , Baltimore, Maryland , USA

000 0 0001

This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023. Our submissions use machine translation models to translate the documents and the training passages, and ColBERT-X as the retrieval model. Additionally, we present a set of unoficial runs that use an alternative training procedure with a similar training setting.

CEUR ceur-ws.org

1. Introduction

CEUR Workshop Proceedings

We use MS MARCO [ 7 ] training triples with English queries and machine-translated African language passages (Hausa, Somali, Swahili, and Yoruba) to perform Translate-Train for ColBERTX based on the out-of-box XLM-RoBERTa-Large model [ 8 ] (Run 4) and our MLM-fine-tuned version (Run 6). We also compare with the ColBERT-X model trained with English MS MARCO, i.e., English-Trained, (Runs 3 and 5) to understand the quality and usefulness of the machinetranslated MS MARCO in the African languages. Finally, we experiment with a new technique, JH POLO [ 9 ], that uses large language models to generate English training queries drawn from the retrieval collection to perform in-domain retrieval fine-tuning (Run 7).

2. Machine Translation

We used automated machine translation (MT) in two principal ways for the evaluation. First, we used document translation to create English-language representations of the CIRAL document collections, as this directly enables search using English queries. Second, we translated the MS MARCO passages from English to the four African languages.

Transformer-based models were trained using Amazon’s Sockeye v2 toolkit [ 10 ] with training data that was principally from the open source repository, OPUS [11]. Preprocessing steps included: running the Moses tokenizer, removal of duplicate lines, and learning of subword units using the subword-nmt toolkit. Case was retained. Notable hyperparameters include: use of 6 layers in both encoder and decoder; 512 dimensional embeddings; 8 attention heads; 2,048 hidden units per layer; 30,000 subword byte pair encoding (BPE) unit, separately in source and target languages; batch size of 4,096; the Adam optimizer with an initial learning rate of 2 × 10−4.

2.1. Document Translation

When the query language is known ahead of time it is possible to translate documents into the query language, efectively reducing the CLIR problem to a monolingual task. Of course the quality of automated machine translation can vary considerably, and some queries can materially sufer if named-entities or other essential query elements are mistranslated. When languages have fewer resources, and when source and target languages difer in linguistic typology, translation can be challenging.

To increase the likelihood of producing better quality document translations we created synthetic training bitext so that our neural machine translation models would have larger quantities of data to work with. In recently published work McNamee and Duh [12] showed that back-translation can be particularly eficacious in lower-resource settings, and helps with lexical coverage in the resulting translation system. For this setup we first trained Englishto-Other models, and used these initial four models to back-translate 7 million sentences of web-crawled English news. Then for each language these 7 million synthetic translations were added to our human produced training data (i.e., bitext from OPUS) to then train the forward models which were used to create English language translations of the four African document collections.

2.2. Translating MS MARCO

MS MARCO was created to support neural IR over English texts. To support the Translate-Train approach for the cross language setting, we wanted to produce translations of MS MARCO into Hausa, Somali, Swahili, and Yoruba. The original English dataset consists of 8,841,823 passages containing 497 million words. Table 2 shows the quantity of training bitext, translation quality scores on a commonly used benchmark, and the number of words in the translated MS MARCO dataset, by language.

3. Training Pipeline

Our full training pipeline for ColBERT-X starts from the pretrained XLM-RoBERTa Large model, followed by masked language model fine-tuning (MLM), retrieval fine-tuning with translatetrain, and finally, in-domain fine-tuning with JH POLO. This section describes each fine-tuning step.

3.1. Masked Language Model Fine-tuning

Since XLM-RoBERTa [ 8 ] pretraining does not include Yoruba, we designed a fine-tuning step to accommodate this absence. However, presenting only Yoruba text to the model during ifne-tuning risks catastrophic forgetting of other language knowledge. Specifically, we would like the language model to retain language knowledge related to the four African languages and to the query language – English. Therefore, we present documents in Hausa, Somali, Swahili, Yoruba, and English round-robin to perform masked language model fine-tuning. We used Common Crawl documents in Afriberta Corpus [ 6 ] for the four African languages and collected additional English Common Crawl documents to match the genre.

We fine-tune the model for 200,000 update steps using a learning rate of 1 × 10−5 and a batch size of 48 text sequences of a maximum length of 512 tokens each. We used four A100 NVidia GPUs to train the model. Fine-tuning took around 34 hours to complete.

3.2. Retrieval Fine-tuning with Translate-Train

To transform a multilingual language model into a CLIR ColBERT-X model, we fine-tuned the language model using MS MARCO small training triples with the original English queries and translated passages (Translate-Train) [ 3 ]. We evaluate this Translate-Train with both the pretrained XLM-RoBERTa model and our MLM-fine-tuned language model. The model is trained with a contrastive loss using Cross-Entropy between the positive and negative passages of each training query. For comparison, we also fine-tuned the language model with English MS MARCO without translation (English-Train).

We fine-tune the language model with the retrieval objective for 200,000 update steps with a learning rate of 5 × 10−6 and a batch size of 64 triples (query, positive, and negative passage triplets). Following the ColBERT-X [ 3 ] training setup, we pad the queries to 32 tokens with [MASK] tokens. Each ColBERT-X model is trained with eight V100 NVidia GPUs for around 50 hours. For the oficial submissions, we used the PLAID [ 1 ] implementation of ColBERT training. However, after the submission, we discovered that the ColBERT-X implementation 3, which is based on the ColBERT v1 [ 2 ] codebase, provides a more stable and efective training process. Thus, we also report a set of unoficial runs using this implementation.

For retrieval, we use the PLAID [ 1 ] retrieval implementation, which uses K-Means clustering and compression to approximate and accelerate retrieval. We compress each document token residual vector dimension down to one bit, resulting in a 128-bit residual representation for each document token.

3.3. JH POLO In-Domain Retrieval Fine-tuning

Training data for the CIRAL languages is quite limited. One option for new training data is Translate-Train: translating the documents of an existing retrieval training collection, such as MS MARCO, to the target languages. However, machine translation for the CIRAL languages is not particularly good at the time of the evaluation. Furthermore, there is no guarantee that the documents of an existing evaluation collection will be a good match for those of the target collection. Creating new training examples using the target collection itself for the documents would eliminate these problems; documents would be naturally occurring, and would therefore not exhibit “translationese.” And there would never be a mismatch between the genre or style of the documents in the training collection and those in the target collection. 3https://github.com/hltcoe/ColBERT-X You must write questions for a news quiz to appear in the newspaper. A news quiz asks about events in the news, NOT about news articles. Here are two articles that appeared in this week’s news: «first» «second» For each article give five factual news quiz English questions, one per line with no extraneous words, that are answered by the events described in that document and are not answered by the events described in the other document. The quiz questions must never refer to individual news articles, or assume the quiz-taker has seen those articles. Precede the first five with DOCA: and the second with DOCB:

JH POLO [ 9 ] is a methodology for creating such training data. It relies on the existence of a large generative language model that includes coverage of the target language. The process begins by selecting two documents from the target collection that have some topic overlap. One of the documents will end up as a relevant document in a training example, and the other will become a non-relevant document in the same example. Selecting document pairs that are closer in meaning will lead to harder negative examples in the training examples produced.

Once the documents have been selected, the generative language model is prompted to create a query for which the first document is relevant and the second document is not. This query, and the two documents, are bundled to form a single new training example. This process can be repeated to generate as many training examples as desired.

We used the JH POLO methodology to create training data for the four CIRAL languages. We used GPT-4 for the generative language model. While GPT-4 would occasionally complain that it was unable to handle documents in one of the CIRAL languages, in almost all cases it would willingly process the documents without being told what language they were written in. In addition to allowing naturally-occurring documents in the training set, this approach hits the sweet spot of most generative LLMs: producing short, English texts. Our prompt is shown in Figure 1. The prompt accomplishes several things: • It identifies the task as question answering. • It attempts to focus the questions on the content of the news articles, not on the articles themselves. It also tries to prevent the generated queries from assuming that the searcher knows the content of any article a priori. GPT-4 had a dificult time conforming to these requirements. • It includes the text of the two documents or passages. • It asks for ten training examples for each document pair, with each document serving as the relevant document for five of thOSe queries.

• It specifies how the output should be formatted.

To select document pairs, we first used each document with more than a fixed number of characters (the “query document”) as a query over the other documents using a BM25 sparse retrieval model. We considered each of the top twenty documents in the resulting ranked list (the “candidate document”) not including the query document. We eliminated from consideration any document that met any of the following criteria: { "id": "chatcmpl-7ruApne33HFrrRiJB25563Ctcr3Ud", "object": "chat.completion", "created": 1693081671, "model": "gpt-4-0613", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "DOCA:\n1. Who reflected on President Buhari's leadership ↪ in the recent public discourse?\n2. What is the criticism ↪ regarding the opposition's treatment raised in the recent ↪ debate?\n3. What were the financial implications mentioned in ↪ recent political critiques?\n4. What are the societal issues ↪ addressed by the speaker?\n5. Who made the prayer for wisdom and ↪ understanding at the end of the recent speech?\n\nDOCB:\n1. What ↪ were the tragic aspects of Yunusa and Ese's love story mentioned ↪ in recent reports?\n2. Who recently criticized some religious ↪ leaders for committing wrongdoings?\n3. What legal judgement was ↪ recently confirmed as punishment for an offender?\n4. What issue ↪ of child exploitation came to light recently?\n5. What phrase has ↪ been adopted by vocal sympathizers to describe the prevailing ↪ situation?" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 742, "completion_tokens": 161, "total_tokens": 903 }

}

• the ratio of the score of the candidate document to that of the query document was greater than 0.65 • the longest common substring between the query document and the candidate document was more than 60% of the entire candidate document • fewer than twenty characters from the candidate document were not part of the longest common substring • the candidate document had fewer than 150 characters We selected for inclusion in the training collection the pair that was not rejected by the above criteria, and that maximized the size of the training collection, given that no document was allowed to be part of more than one pair.

Once the document pairs were selected, we embedded the text of each document in the GPT-4 prompt and ran the prompt. In most cases, GPT-4 successfully produced output with ten output queries per prompt. Figure 2 shows the GPT-4 output for a completed prompt.

We applied two forms of automated quality control to the JH POLO outputs. First, because GPT-4 had a dificult time omitting mention of the documents in its output queries and not assuming the user knew anything about those documents, we eliminated any query that contained any of the words articles, reports, speaker, these. Second, to try to eliminate examples where the relevant and non-relevant documents were too close together, we used an mMiniLM cross-encoder (cross-encoder/mmarco-mMiniLMv2-L12-H384-v1) to compare the query to each of the documents; we eliminated any example where the cross-encoder score (between 0 and 1) for the positive document was not at least 0.15 above the score of the non-relevant document. The result was a collection of 48,459 training examples over 14,323 document pairs in the four CIRAL languages combined.

4. Results 4.1. Unoficial Runs

Based on other experiments, we discovered that the PLAID training implementation (essentially version 3 of the ColBERT implementation) leads to degraded performance in the resulting IR model. We retrained the models using the original ColBERT-X implementation and present the results in Table 4. Since these runs are produced after the submission deadline, the runs are not part of the pooling assessments. Therefore, only around 50% to 60% of the top 20 retrieved documents are judged. While treating the unjudged documents as non-relevant is a common assumption in IR evaluation, this also suggests that results presented in Tables 3 (and other oficial submissions) and 4 are not perfectly comparable.

Based on results in Table 4, models trained with the ColBERT-X implementation seem to be generally more efective. While the trend of the contribution provided by each training step is less clear, Translate-Train without MLM still provides more efective models than English-Train, except for Somali.

However, based on this set of results, the benefit of the additional MLM fine-tuning step is smaller. In fact, the knowledge in the Afriberta Corpus and in the machine-translated MS MARCO seem to be contradictory. While performing only Translate-Train or MLM fine-tuning still leads to similar efectiveness, doing both does not give us additional advantage. [11] J. Tiedemann, Parallel data, tools and interfaces in OPUS, in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, 2012, pp. 2214–2218. URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf. [12] P. McNamee, K. Duh, An extensive exploration of back-translation in 60 languages, in: Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada, 2023, pp. 8166–8183. URL: https: //aclanthology.org/2023.findings-acl.518. doi:10.18653/v1/2023.findings- acl.518. [13] E. Yang, S. Nair, R. Chandradevan, R. Iglesias-Flores, D. W. Oard, C3: Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 2507–2512. URL: https://doi.org/10.1145/3477495.3531886. doi:10.1145/3477495.3531886.

[1]

Santhanam ,

Khattab ,

Potts ,

Zaharia , Plaid: an eficient engine for late interaction retrieval , in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management , 2022 , pp. 1747 - 1756 .

[2]

Khattab ,

Zaharia , Colbert: Eficient and efective passage search via contextualized late interaction over bert , in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , 2020 , pp. 39 - 48 .

[3]

Nair ,

Yang ,

Lawrie ,

Duh ,

McNamee ,

Murray ,

Mayfield ,

D. W.

Oard , Transfer learning approaches for building cross-language dense retrieval models , in: Advances in Information Retrieval: 44th European Conference on IR Research , ECIR 2022 , Stavanger, Norway, April 10-14 , 2022 , Proceedings, Part

, Springer-Verlag, Berlin, Heidelberg, 2022 , p. 382 - 396 . URL: https://doi.org/10.1007/978-3- 030 -99736-6_ 26 .

[4]

Lawrie , S. MacAvaney, J. Mayfield,

McNamee ,

D. W.

Oard ,

Soldaini , E. Yang, Overview of the TREC 2022 NeuCLIR track , 2023 . arXiv: 2304 . 12367 .

[5]

Santhanam ,

Khattab ,

Saad-Falcon ,

Potts , M. Zaharia, ColBERTv2: Efective and eficient retrieval via lightweight late interaction, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , Seattle, United States, 2022 , pp. 3715 - 3734 . URL: https://aclanthology.org/ 2022 .naacl-main. 272 .

[6]

Ogueji ,

Zhu ,

Lin , Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages , in: Proceedings of the 1st Workshop on Multilingual Representation Learning , Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021 , pp. 116 - 126 . URL: https://aclanthology.org/ 2021 .mrl- 1 . 11 .

[7]

Nguyen ,

Rosenberg ,

Song ,

Gao ,

Tiwary ,

Majumder , L. Deng, MS MARCO: A human generated machine reading comprehension dataset , CoRR abs/1611 .09268 ( 2016 ). URL: http://arxiv.org/abs/1611.09268. arXiv: 1611 . 09268 .

[8]

Conneau ,

Khandelwal ,

Goyal ,

Chaudhary ,

Wenzek ,

Guzmán , E. Grave,

Ott ,

Zettlemoyer ,

Stoyanov , Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 8440 - 8451 . URL: https://aclanthology.org/ 2020 .acl-main. 747 .

[9]

Mayfield ,

Yang ,

Lawrie ,

Barham ,

Weller ,

Mason ,

Nair ,

Miller , Synthetic cross-language information retrieval training data , arXiv preprint arXiv:2305.00331 ( 2023 ).

[10]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS'17, Curran Associates Inc., Red

Hook

, NY , USA, 2017 , p. 6000 - 6010 .