=Paper=
{{Paper
|id=Vol-3681/T2-2
|storemode=property
|title=Extending Translate-Train for ColBERT-X to African Language CLIR
|pdfUrl=https://ceur-ws.org/Vol-3681/T2-2.pdf
|volume=Vol-3681
|authors=Eugene Yang,Dawn J. Lawrie,Paul McNamee,James Mayfield
|dblpUrl=https://dblp.org/rec/conf/fire/YangLMM23
}}
==Extending Translate-Train for ColBERT-X to African Language CLIR==
Extending Translate-Train for ColBERT-X to African Language CLIR Eugene Yang, Dawn J. Lawrie, Paul McNamee and James Mayfield Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, Maryland, USA Abstract This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023. Our submissions use machine translation models to translate the documents and the training passages, and ColBERT-X as the retrieval model. Additionally, we present a set of unofficial runs that use an alternative training procedure with a similar training setting. Keywords ColBERT-X, Translate-Train, PLAID, CLIR, JH POLO 1. Introduction The HLTCOE team participated in all four CLIR tasks in the CIRAL shared task. These CLIR tasks use English queries to search for Hausa, Somali, Swahili, and Yoruba documents. Our systems primarily use PLAID [1], an implementation of ColBERT [2] retrieval architecture that encodes each token as a vector. Prior work has demonstrated successes in augmenting training queries and passages with translation to match the CLIR target languages for training CLIR dense retrieval models [3]. This technique, named Translate-Train, was evaluated in widely spoken languages, such as Chinese, Persian, and Italian, with decent machine translation models. In the recent 2022 TREC NeuCLIR track [4], a ColBERT model called ColBERT-X, trained with Translate-Train, is the most effective end-to-end neural dense retrieval model.1 This paper documents our attempt to adapt the Translate-Train training technique in African languages, where machine translation models have generally struggled. As summarized in Table 1, we submitted seven runs for each CLIR task, including two runs using machine- translated documents with BM25 with RM3 (Run 1) and English ColBERT (Run 2) [5]. Since Yoruba was not included in the pretraining of XLM-RoBERTa, we introduced an additional masked language model (MLM) fine-tuning using the Afriberta Corpus [6]2 to enhance the language model for the CIRAL tasks. FIRE’23: Forum for Information Retrieval Evaluation , December 15–18, 2023, Goa University, Panjim, India Envelope-Open eugene.yang@jhu.edu (E. Yang); lawrie@jhu.edu (D. J. Lawrie); mcnamee@jhu.edu (P. McNamee); mayfield@jhu.edu (J. Mayfield) Orcid 0000-0002-0051-1535 (E. Yang); 0000-0001-7347-7086 (D. J. Lawrie); 0000-0002-0548-5751 (P. McNamee); 0000-0003-3866-3013 (J. Mayfield) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 The ColBERT-X runs were contributed by the organizers, and, thus, marked as manual runs. Although unlikely, the performance might have been affected by knowledge that is only accessible by the organizers. 2 https://huggingface.co/datasets/castorini/afriberta-corpus CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 Run Name Description. Run Doc MT Ret. Model MLM FT Ret. FT JHPolo FT (1) dt.bm25-rm3 3 BM25+RM3 — — — (2) dt.plaid 3 ColBERT 7 ET 7 (3) plaid-xlmr.et 7 ColBERT-X 7 ET 7 (4) plaid-xlmr.tt 7 ColBERT-X 7 TT 7 (5) plaid-xlmr.mlmfine.et 7 ColBERT-X 3 ET 7 (6) plaid-xlmr.mlmfine.tt 7 ColBERT-X 3 TT 7 (7) plaid-xlmr.mlmfine.tt.jholo 7 ColBERT-X 3 TT 3 We use MS MARCO [7] training triples with English queries and machine-translated African language passages (Hausa, Somali, Swahili, and Yoruba) to perform Translate-Train for ColBERT- X based on the out-of-box XLM-RoBERTa-Large model [8] (Run 4) and our MLM-fine-tuned version (Run 6). We also compare with the ColBERT-X model trained with English MS MARCO, i.e., English-Trained, (Runs 3 and 5) to understand the quality and usefulness of the machine- translated MS MARCO in the African languages. Finally, we experiment with a new technique, JH POLO [9], that uses large language models to generate English training queries drawn from the retrieval collection to perform in-domain retrieval fine-tuning (Run 7). 2. Machine Translation We used automated machine translation (MT) in two principal ways for the evaluation. First, we used document translation to create English-language representations of the CIRAL document collections, as this directly enables search using English queries. Second, we translated the MS MARCO passages from English to the four African languages. Transformer-based models were trained using Amazon’s Sockeye v2 toolkit [10] with training data that was principally from the open source repository, OPUS [11]. Preprocessing steps included: running the Moses tokenizer, removal of duplicate lines, and learning of subword units using the subword-nmt toolkit. Case was retained. Notable hyperparameters include: use of 6 layers in both encoder and decoder; 512 dimensional embeddings; 8 attention heads; 2,048 hidden units per layer; 30,000 subword byte pair encoding (BPE) unit, separately in source and target languages; batch size of 4,096; the Adam optimizer with an initial learning rate of 2 × 10−4 . 2.1. Document Translation When the query language is known ahead of time it is possible to translate documents into the query language, effectively reducing the CLIR problem to a monolingual task. Of course the quality of automated machine translation can vary considerably, and some queries can materially suffer if named-entities or other essential query elements are mistranslated. When Table 2 MS MARCO Translations. Shown are: (a) the size of training bitext, in sentences; (b) MT quality using BLEU scores (lower-cased sacrebleu) on the FLORES-101 test set; and, (c) the size in words (millions) of the resulting translation. Language Bitext size FLORES-101 Translation size Hausa 2.2M 26.1 576M Somali 786k 13.6 559M Swahili 9.9M 37.7 502M Yoruba 1.4M 5.5 672M languages have fewer resources, and when source and target languages differ in linguistic typology, translation can be challenging. To increase the likelihood of producing better quality document translations we created synthetic training bitext so that our neural machine translation models would have larger quantities of data to work with. In recently published work McNamee and Duh [12] showed that back-translation can be particularly efficacious in lower-resource settings, and helps with lexical coverage in the resulting translation system. For this setup we first trained English- to-Other models, and used these initial four models to back-translate 7 million sentences of web-crawled English news. Then for each language these 7 million synthetic translations were added to our human produced training data (i.e., bitext from OPUS) to then train the forward models which were used to create English language translations of the four African document collections. 2.2. Translating MS MARCO MS MARCO was created to support neural IR over English texts. To support the Translate-Train approach for the cross language setting, we wanted to produce translations of MS MARCO into Hausa, Somali, Swahili, and Yoruba. The original English dataset consists of 8,841,823 passages containing 497 million words. Table 2 shows the quantity of training bitext, translation quality scores on a commonly used benchmark, and the number of words in the translated MS MARCO dataset, by language. 3. Training Pipeline Our full training pipeline for ColBERT-X starts from the pretrained XLM-RoBERTa Large model, followed by masked language model fine-tuning (MLM), retrieval fine-tuning with translate- train, and finally, in-domain fine-tuning with JH POLO. This section describes each fine-tuning step. 3.1. Masked Language Model Fine-tuning Since XLM-RoBERTa [8] pretraining does not include Yoruba, we designed a fine-tuning step to accommodate this absence. However, presenting only Yoruba text to the model during fine-tuning risks catastrophic forgetting of other language knowledge. Specifically, we would like the language model to retain language knowledge related to the four African languages and to the query language – English. Therefore, we present documents in Hausa, Somali, Swahili, Yoruba, and English round-robin to perform masked language model fine-tuning. We used Common Crawl documents in Afriberta Corpus [6] for the four African languages and collected additional English Common Crawl documents to match the genre. We fine-tune the model for 200,000 update steps using a learning rate of 1 × 10−5 and a batch size of 48 text sequences of a maximum length of 512 tokens each. We used four A100 NVidia GPUs to train the model. Fine-tuning took around 34 hours to complete. 3.2. Retrieval Fine-tuning with Translate-Train To transform a multilingual language model into a CLIR ColBERT-X model, we fine-tuned the language model using MS MARCO small training triples with the original English queries and translated passages (Translate-Train) [3]. We evaluate this Translate-Train with both the pretrained XLM-RoBERTa model and our MLM-fine-tuned language model. The model is trained with a contrastive loss using Cross-Entropy between the positive and negative passages of each training query. For comparison, we also fine-tuned the language model with English MS MARCO without translation (English-Train). We fine-tune the language model with the retrieval objective for 200,000 update steps with a learning rate of 5 × 10−6 and a batch size of 64 triples (query, positive, and negative passage triplets). Following the ColBERT-X [3] training setup, we pad the queries to 32 tokens with [MASK] tokens. Each ColBERT-X model is trained with eight V100 NVidia GPUs for around 50 hours. For the official submissions, we used the PLAID [1] implementation of ColBERT training. However, after the submission, we discovered that the ColBERT-X implementation 3 , which is based on the ColBERT v1 [2] codebase, provides a more stable and effective training process. Thus, we also report a set of unofficial runs using this implementation. For retrieval, we use the PLAID [1] retrieval implementation, which uses K-Means clustering and compression to approximate and accelerate retrieval. We compress each document token residual vector dimension down to one bit, resulting in a 128-bit residual representation for each document token. 3.3. JH POLO In-Domain Retrieval Fine-tuning Training data for the CIRAL languages is quite limited. One option for new training data is Translate-Train: translating the documents of an existing retrieval training collection, such as MS MARCO, to the target languages. However, machine translation for the CIRAL languages is not particularly good at the time of the evaluation. Furthermore, there is no guarantee that the documents of an existing evaluation collection will be a good match for those of the target collection. Creating new training examples using the target collection itself for the documents would eliminate these problems; documents would be naturally occurring, and would therefore not exhibit “translationese.” And there would never be a mismatch between the genre or style of the documents in the training collection and those in the target collection. 3 https://github.com/hltcoe/ColBERT-X You must write questions for a news quiz to appear in the newspaper. A news quiz asks about events in the news, NOT about news articles. Here are two articles that appeared in this week’s news: «first» «second» For each article give five factual news quiz English questions, one per line with no extraneous words, that are answered by the events described in that document and are not answered by the events described in the other document. The quiz questions must never refer to individual news articles, or assume the quiz-taker has seen those articles. Precede the first five with DOCA: and the second with DOCB: Figure 1: GPT-4 prompt used to create JH POLO training examples. JH POLO [9] is a methodology for creating such training data. It relies on the existence of a large generative language model that includes coverage of the target language. The process begins by selecting two documents from the target collection that have some topic overlap. One of the documents will end up as a relevant document in a training example, and the other will become a non-relevant document in the same example. Selecting document pairs that are closer in meaning will lead to harder negative examples in the training examples produced. Once the documents have been selected, the generative language model is prompted to create a query for which the first document is relevant and the second document is not. This query, and the two documents, are bundled to form a single new training example. This process can be repeated to generate as many training examples as desired. We used the JH POLO methodology to create training data for the four CIRAL languages. We used GPT-4 for the generative language model. While GPT-4 would occasionally complain that it was unable to handle documents in one of the CIRAL languages, in almost all cases it would willingly process the documents without being told what language they were written in. In addition to allowing naturally-occurring documents in the training set, this approach hits the sweet spot of most generative LLMs: producing short, English texts. Our prompt is shown in Figure 1. The prompt accomplishes several things: • It identifies the task as question answering. • It attempts to focus the questions on the content of the news articles, not on the articles themselves. It also tries to prevent the generated queries from assuming that the searcher knows the content of any article a priori. GPT-4 had a difficult time conforming to these requirements. • It includes the text of the two documents or passages. • It asks for ten training examples for each document pair, with each document serving as the relevant document for five of thOSe queries. • It specifies how the output should be formatted. To select document pairs, we first used each document with more than a fixed number of characters (the “query document”) as a query over the other documents using a BM25 sparse retrieval model. We considered each of the top twenty documents in the resulting ranked list (the “candidate document”) not including the query document. We eliminated from consideration any document that met any of the following criteria: { "id": "chatcmpl-7ruApne33HFrrRiJB25563Ctcr3Ud", "object": "chat.completion", "created": 1693081671, "model": "gpt-4-0613", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "DOCA:\n1. Who reflected on President Buhari's leadership ↪ in the recent public discourse?\n2. What is the criticism ↪ regarding the opposition's treatment raised in the recent ↪ debate?\n3. What were the financial implications mentioned in ↪ recent political critiques?\n4. What are the societal issues ↪ addressed by the speaker?\n5. Who made the prayer for wisdom and ↪ understanding at the end of the recent speech?\n\nDOCB:\n1. What ↪ were the tragic aspects of Yunusa and Ese's love story mentioned ↪ in recent reports?\n2. Who recently criticized some religious ↪ leaders for committing wrongdoings?\n3. What legal judgement was ↪ recently confirmed as punishment for an offender?\n4. What issue ↪ of child exploitation came to light recently?\n5. What phrase has ↪ been adopted by vocal sympathizers to describe the prevailing ↪ situation?" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 742, "completion_tokens": 161, "total_tokens": 903 } } Figure 2: GPT-4 output used to create JH POLO training examples. • the ratio of the score of the candidate document to that of the query document was greater than 0.65 • the longest common substring between the query document and the candidate document was more than 60% of the entire candidate document • fewer than twenty characters from the candidate document were not part of the longest common substring • the candidate document had fewer than 150 characters We selected for inclusion in the training collection the pair that was not rejected by the above criteria, and that maximized the size of the training collection, given that no document was allowed to be part of more than one pair. Once the document pairs were selected, we embedded the text of each document in the GPT-4 prompt and ran the prompt. In most cases, GPT-4 successfully produced output with ten output queries per prompt. Figure 2 shows the GPT-4 output for a completed prompt. We applied two forms of automated quality control to the JH POLO outputs. First, because GPT-4 had a difficult time omitting mention of the documents in its output queries and not assuming the user knew anything about those documents, we eliminated any query that contained any of the words articles, reports, speaker, these. Second, to try to eliminate examples where the relevant and non-relevant documents were too close together, we used an mMiniLM cross-encoder (cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 ) to compare the query to each of the documents; we eliminated any example where the cross-encoder score (between 0 and 1) for the positive document was not at least 0.15 above the score of the non-relevant document. The result was a collection of 48,459 training examples over 14,323 document pairs in the four CIRAL languages combined. 4. Results Table 3 summarizes the run results of our official submissions. For all four languages, an English ColBERT model indexing the machine-translated documents (Run 2) provides the most effective retrieval results in both nDCG@20 and R@100. For Somali, this model ranks the best results among all submissions. The ColBERT-X models submitted are not as effective as the monolingual English one. Similar to prior works, models trained with Translate-Train are more effective than those trained with English-Train. The difference in Yoruba without MLM fine-tuning is substantially larger (Runs 3 and 4), since Translate-Train is the only step that introduces Yoruba text to the model during training. We observe that MLM fine-tuning is generally helpful even when followed by Translate- Train, which also conveys African language knowledge to the model. The MLM-fine-tuned model followed by English-Train demonstrates similar effectiveness when directly performing Translate-Train on the out-of-box XLM-RoBERTa Large model. This observation aligns with prior work in continued language model fine-tuning for CLIR tasks [13], where the authors found that an effective language model fine-tuning step can replace Translate-Train. Furthermore, performing Translate-Training on top of the MLM-fine-tuned language model also leads to better effectiveness, suggesting THAT all training steps contribute to the final retrieval effectiveness. However, the in-domain JH POLO fine-tuning does not seem to be helpful. For Hausa, Somali, and Swahili, fine-tuning on an additional 1,000 JH POLO training examples degraded performance. However, both nDCG@20 and R@100 improved after JH POLO fine-tuning for Yoruba. We hypothesize that this is due to the additional Yoruba text that is presented to the model. Since the model has only seen a small amount of Yoruba text, any additional training showing more Yoruba language would be helpful. Table 3 Official runs from the HLTCOE team. Runs are evaluated by the track organizers. nDCG@20 R@100 MLM Ret. FT JH POLO Hausa Somali Swahili Yoruba Hausa Somali Swahili Yoruba Max 0.5700 0.5118 0.5232 0.5819 0.5902 0.6436 0.5956 0.8057 Median 0.2530 0.2445 0.2447 0.3138 0.3576 0.3083 0.3340 0.5037 Mean 0.2690 0.2403 0.2644 0.3115 0.3598 0.3265 0.3249 0.5091 (1) DT » BM25+RM3 0.2015 0.2550 0.2178 0.3555 0.3359 0.4210 0.3340 0.6273 (2) DT » English ColBERT 0.4743 0.5118 0.4932 0.4793 0.5733 0.6436 0.5956 0.7240 ColBERT-X Submission Models (3) 7 ET 7 0.3481 0.2915 0.3182 0.2627 0.5237 0.4332 0.4616 0.5176 (4) 7 TT 7 0.3557 0.2878 0.3347 0.3522 0.5083 0.4373 0.4510 0.5784 (5) 3 ET 7 0.3488 0.2760 0.3301 0.3804 0.5326 0.4260 0.4487 0.6950 (6) 3 TT 7 0.4335 0.3366 0.4230 0.4189 0.5256 0.4534 0.4645 0.6394 (7) 3 TT 3 0.3601 0.3117 0.4081 0.4297 0.4829 0.4277 0.4477 0.6748 Table 4 Unofficial Runs. Runs are evaluated with the qrels provided by the organizers and evaluated by the HLTCOE team. nDCG@20 Judged@20 MLM Ret. FT JHPolo Hausa Somali Swahili Yoruba Hausa Somali Swahili Yoruba 7 ET 7 0.3516 0.3080 0.3064 0.2534 0.5225 0.4763 0.5271 0.4995 7 TT 7 0.3722 0.2994 0.3766 0.3625 0.5925 0.5354 0.6165 0.6490 3 ET 7 0.3751 0.3018 0.3179 0.4097 0.5563 0.4848 0.5382 0.6475 3 TT 7 0.3450 0.2513 0.3093 0.3863 0.5562 0.4515 0.5124 0.6535 3 ET 3 0.3451 0.3069 0.3083 0.4105 0.5369 0.4914 0.5147 0.6665 3 TT 3 0.2957 0.2276 0.2878 0.4168 0.4406 0.3884 0.4682 0.6400 4.1. Unofficial Runs Based on other experiments, we discovered that the PLAID training implementation (essentially version 3 of the ColBERT implementation) leads to degraded performance in the resulting IR model. We retrained the models using the original ColBERT-X implementation and present the results in Table 4. Since these runs are produced after the submission deadline, the runs are not part of the pooling assessments. Therefore, only around 50% to 60% of the top 20 retrieved documents are judged. While treating the unjudged documents as non-relevant is a common assumption in IR evaluation, this also suggests that results presented in Tables 3 (and other official submissions) and 4 are not perfectly comparable. Based on results in Table 4, models trained with the ColBERT-X implementation seem to be generally more effective. While the trend of the contribution provided by each training step is less clear, Translate-Train without MLM still provides more effective models than English-Train, except for Somali. However, based on this set of results, the benefit of the additional MLM fine-tuning step is smaller. In fact, the knowledge in the Afriberta Corpus and in the machine-translated MS MARCO seem to be contradictory. While performing only Translate-Train or MLM fine-tuning still leads to similar effectiveness, doing both does not give us additional advantage. References [1] K. Santhanam, O. Khattab, C. Potts, M. Zaharia, Plaid: an efficient engine for late interaction retrieval, in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 1747–1756. [2] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextualized late interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 39–48. [3] S. Nair, E. Yang, D. Lawrie, K. Duh, P. McNamee, K. Murray, J. Mayfield, D. W. Oard, Transfer learning approaches for building cross-language dense retrieval models, in: Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I, Springer-Verlag, Berlin, Heidelberg, 2022, p. 382–396. URL: https://doi.org/10.1007/978-3-030-99736-6_26. [4] D. Lawrie, S. MacAvaney, J. Mayfield, P. McNamee, D. W. Oard, L. Soldaini, E. Yang, Overview of the TREC 2022 NeuCLIR track, 2023. arXiv:2304.12367 . [5] K. Santhanam, O. Khattab, J. Saad-Falcon, C. Potts, M. Zaharia, ColBERTv2: Effective and efficient retrieval via lightweight late interaction, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Seattle, United States, 2022, pp. 3715–3734. URL: https://aclanthology.org/2022.naacl-main.272. [6] K. Ogueji, Y. Zhu, J. Lin, Small data? no problem! exploring the viability of pretrained multi- lingual language models for low-resourced languages, in: Proceedings of the 1st Workshop on Multilingual Representation Learning, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 116–126. URL: https://aclanthology.org/2021.mrl-1.11. [7] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO: A human generated machine reading comprehension dataset, CoRR abs/1611.09268 (2016). URL: http://arxiv.org/abs/1611.09268. arXiv:1611.09268 . [8] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.747. [9] J. Mayfield, E. Yang, D. Lawrie, S. Barham, O. Weller, M. Mason, S. Nair, S. Miller, Synthetic cross-language information retrieval training data, arXiv preprint arXiv:2305.00331 (2023). [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010. [11] J. Tiedemann, Parallel data, tools and interfaces in OPUS, in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, 2012, pp. 2214–2218. URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf. [12] P. McNamee, K. Duh, An extensive exploration of back-translation in 60 languages, in: Findings of the Association for Computational Linguistics: ACL 2023, Associa- tion for Computational Linguistics, Toronto, Canada, 2023, pp. 8166–8183. URL: https: //aclanthology.org/2023.findings-acl.518. doi:10.18653/v1/2023.findings- acl.518 . [13] E. Yang, S. Nair, R. Chandradevan, R. Iglesias-Flores, D. W. Oard, C3: Continued pre- training with contrastive weak supervision for cross language ad-hoc retrieval, in: Proceedings of the 45th International ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 2507–2512. URL: https://doi.org/10.1145/3477495.3531886. doi:10.1145/3477495.3531886 .