Overview of the CIRAL Track at FIRE 2023:
                                Cross-lingual Information Retrieval for African
                                Languages
                                Mofetoluwa Adeyemi1 , Akintunde Oladipo1 , Xinyu Crystina Zhang1 ,
                                David Alfonso-Hermelo2 , Mehdi Rezagholizadeh2 , Boxing Chen2 and Jimmy Lin1
                                1
                                    University of Waterloo
                                2
                                    Huawei Noah’s Ark Lab


                                                                         Abstract
                                                                         This paper provides an overview of the first CIRAL track at the Forum for Information Retrieval Evaluation
                                                                         2023. The goal of CIRAL is to promote the research and evaluation of cross-lingual information retrieval
                                                                         for African languages. With the intent of curating a human-annotated test collection through community
                                                                         evaluations, our track entails retrieval between English and four African languages which are Hausa,
                                                                         Somali, Swahili and Yoruba. We discuss the cross-lingual information retrieval task, curation of the test
                                                                         collection, participation and evaluation results. Analysis of the curated pools is provided, and we also
                                                                         compare the effectiveness of the submitted retrieval methods. The CIRAL track did show and encourage
                                                                         the research prospects that exist for CLIR in African languages, and we are hopeful for the direction this
                                                                         takes.

                                                                         Keywords
                                                                         Cross-lingual Information Retrieval, African Languages, Ad-hoc Retrieval, Passage Ranking, Community
                                                                         Evaluations


                                1. Introduction
                                Cross-lingual information retrieval (CLIR) is a specific category under multilingual retrieval,
                                which retrieves documents in a language different from the given query. It plays a huge role
                                in obtaining information mostly available in the document’s language. Efforts in CLIR go as
                                far back as the early 90s starting with Text Retrieval Conference (TREC) [1] and has been
                                migrated to other conferences including CLEF [2], FIRE [3], and NCTIR [4], which have specific
                                focus on European languages, Indian languages, and East Asian languages. Tracks dedicated
                                to cross-lingual information retrieval in these conferences, such as the NeuCLIR track [5] in
                                TREC, are a venue to promote the participation and evaluation of these groups of languages in
                                CLIR. However, there is a lag in such research involvement for African languages.
                                   There are also few resources for studying CLIR in African languages, despite the growing
                                research efforts on them in cross-lingual information retrieval [6, 7, 8, 9]. AfriCLIRMatrix [10]

                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                $ moadeyem@uwaterloo.ca (M. Adeyemi); aooladipo@uwaterloo.ca (A. Oladipo); x978zhan@uwaterloo.ca
                                (X. C. Zhang); david.alfonso.hermelo@huawei.com (D. Alfonso-Hermelo); mehdi.rezagholizadeh@huawei.com
                                (M. Rezagholizadeh); boxing.chen@huawei.com (B. Chen); jimmylin@uwaterloo.ca (J. Lin)
                                                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
constructs the first CLIR dataset in African languages, built synthetically based on Wikipedia’s
structure and covers 15 African languages. Other collections such as large scale CLIR [11],
CLIRMatrix [12] and the IARPA MATERIAL test collection [13] which was curated solely for
low-resource languages, only cover a minimal amount of African languages.
   The sparsity of resources and the bid to promote participation in CLIR research for African
languages calls out the construction of CIRAL, which stands for Cross-lingual Information
Retrieval for African Languages. The CIRAL track hosted at the Forum for Information Retrieval
Evaluation (FIRE) focused on cross-lingual passage retrieval covering 4 African languages: Hausa,
Somali, Swahili, and Yoruba, which are some of the most widely spoken languages in Africa.
Given the low-resource nature of African languages, even in widely used sources like Wikipedia,
CIRAL’s collection is built with articles from the indigenous news domain of the respective
languages. Similar to the passage ranking task in TREC’s Deep learning track [14], relatively
few queries (80 to 100) are developed for the task. The queries and judgments are produced by
native speakers who took the roles of query developers and relevance assessors. As is often
the culture in community evaluations, CIRAL also set out to curate a test collection by pooling
submissions from track participants.
   In hosting CIRAL, we look out for: 1) The effectiveness of indigenous textual data in CLIR
for African languages, 2) A comparison of how well different retrieval methods perform in
CLIR for African languages, 3) The importance of retrieval and participation diversity. An
overview of the curated collection, query development and relevance assessment process is
provided and results from relevance assessment demonstrate the effectiveness of retrieving
relevant passages from indigenous sources. Participation in the track and submissions for the
respective languages are also discussed, comparing different retrieval methods employed in the
task. Taking into consideration participation, submissions and other factors, we examine the
test collection curated from the task, which informs future decisions in community evaluations
for African languages.
   We hope CIRAL fosters CLIR evaluation and research in African languages and in low-
resource settings, and hence the development of retrieval systems that are well suited for such
tasks. Details of the track are also available on the provided website.1


2. Task Description
The focus task at CIRAL was cross-lingual passage ranking between English and African
languages. For a kick-off, only four African languages were included this year: Hausa, Somali,
Swahili, and Yoruba, which are selected according to the size of native speakers of the languages
in East and West Africa. All four languages are in Latin script, two belonging to the Afro-Asiatic
language family, and the other two to the Niger-Congo family. See details of the languages in
Table 1. We choose English as the pivot language as it is an official language in countries where
these African languages are spoken, with the exception of Somali whose speakers lean more
towards Arabic than English.
   Given English queries, participants are tasked with developing retrieval systems which
return ranked passages in the African languages according to the estimated likelihood of
   1
       https://ciralproject.github.io/
                     Language       Language Family           Region           # Speakers        Script
                     Hausa                                    West Africa             77M        Latin
                                    Afro-Asiatic
                     Somali                                   East Africa             22M        Latin
                     Swahili                                  East Africa             83M        Latin
                                    Niger-Congo
                     Yoruba                                   West Africa             55M        Latin

Table 1
Details on the African languages in the CIRAL task.

                                Median Tokens      Avg Tokens
 Language       # of Passages                                    # News Articles       News Sources
                                per Passage        per Passage
 Hausa                715,355        144             135.29                 240,883    LegitNG, DailyTrust, VOA, Isyaku, etc.
 Somali               827,552        131             126.13                 629,441    VOA, Tuko, Risaala, Caasimada, etc.
 Swahili              949,013        129             126.71                 146,669    VOA, UN Swahili, MTanzania, etc.
 Yoruba                82,095        168             167.94                  27,985    Alaroye, VON, BBC, Asejere, etc.

Table 2
Collection details for each language in CIRAL. The average and median number of tokens per passage
also gives an idea of the distribution of passages in each language. The table also shows the number
and some sources of news articles collected for each language.


relevance. Queries are formulated as natural language questions and passages are judged using
binary relevance: 0 for non-relevant and 1 for relevant. The relevant passage is defined as
the one that answers the question, whereas the non-relevant passage does not. To facilitate
the development and evaluation of their retrieval systems, participants were provided with a
training set comprising a sample of 10 queries for each language, their relevance judgments
and the passage collection for the languages. Considering the nature of the task, we evaluate
for early precision and recall using metrics such as nDCG@20 and Recall@100 and participants
were also made aware of these in developing their systems. For evaluations, the test set of
queries was provided for which submitted runs were manually judged to form query pools.
Participants were also encouraged to rank their submitted runs in the order that they preferred
to contribute to the pools. Details on the provided passage collection, development of queries,
and pooling process are discussed in the following sections.


3. Passage Collection
CIRAL’s passage collection is curated from indigenous news websites and blogs for each of
the four languages. These sites serve as a source of local and international information and
as shown in Table 2, are a huge source of text for their languages. The articles are collected
using a web scrapping framework called Otelemuye2 and combined into monolingual document
sets. The collected articles date from as early as was available on the website (which was the
early 2000s for some languages) up until March 2023. Passages are generated from the set by
chunking each news article on a sentence level using a sliding-window segmentation [15]. To
ensure natural discourse segments when chunking the articles, a stride window of 3 is used with

    2
        https://github.com/theyorubayesian/otelemuye
a maximum of 6 sentences per window. The resulting passages are further filtered to remove
those with less than 7 or more than 200 words. Table 2 shows the median and average number of
tokens per passage in each language, providing more insight into their passage distribution. To
ensure each passage is in its required language, we filter using the language’s list of stopwords,
hence removing passages in a different language; a minimum of 3 to 5 stopwords was used to
ascertain if a passage was in its African language. The resulting number of passages is shown
in Table 2.
   The curated passages are provided in JSONL files, each line representing a JSON object with
details about a passage. Passages have the following fields:

    • docid: Unique identifier
    • title: The headline of the news article from which it was obtained.
    • text: The passage body.
    • url: The link to the news article from which it was gotten.

The unique identifier (docid) for each passage is constructed programmatically to have the
format source#article_id#passage_id providing information on the news website and
specific article number the passage was extracted from in the monolingual set. This is also
helpful as there are a few news articles without titles, hence leaving the respective passages
without a text in the title field. The passage collection files were made publicly available to
participants in a Hugging face dataset repository.3


4. Query Development
Queries for the task are formulated as natural language questions, modelling that of collections
such as MS MARCO [16] which is used in TREC’s Deep learning track [14], the MIRACL
dataset [17], among others. Considering the passage collection for a language was curated from
its indigenous websites, language queries had to have topics either of interest to the speakers of
the language or with information that can be easily found in the language’s news. We term
these queries language/cultural-specific queries,4 which is a combination of queries with both
generic and indigenous topics depending on the language. The language-specific queries are
developed as factoids to ensure answers are direct and unambiguous.
   The process of query development involved native speakers generating questions with
answers in the language’s news. For this task, articles from the MasakhaNews dataset [18] are
used as a source of inspiration for the query formation. The MasakhaNews dataset is a news
topic classification dataset and covers 16 African languages. It serves as a good starting point
given that the documents in the dataset have been classified into categories namely: business,
entertainment, health, politics, religion, sport and technology, providing a more direct approach for
generating diverse queries. Using the same passage preprocessing implemented in generating
the passage collection, articles in MasakhaNews are chunked into various passages but with
an additional category field to jointly inspire queries. Query developers (interchangeably
    3
     https://huggingface.co/datasets/CIRAL/ciral-corpus
    4
     The generated queries also include some which are generic, but we term the queries language-specific due to
the news data also capturing events which are mostly of interest in the language.
Figure 1: The interface of the Hugging Face space used to search for relevant passages in the query
development process. Annotators are able to select their names, input the in-language query, its English
translation and docid of the passage that inspired it. This is the interface for the Somali space.


called annotators) are native speakers of the languages with reading and writing proficiency in
both English and their respective African languages. To generate these queries, annotators are
given the MasakhaNews passage snippets and tasked with generating questions inspired but
not answerable by the snippet to ensure good-quality queries. The questions are generated in
the African language and then translated into English by the annotator.
   To ensure that generated queries had relevant passages, the annotators checked if an inspired
question had passages answering the question in the CIRAL collection. This was done via a
search interface developed as a Hugging Face space for each language using Spacerini.5 As
shown in the Figure 1, the annotators provide the query in its African language, its English
translation and the docid of the passage that inspired it and search was monolingual i.e. using
the query in its African language. Using a hybrid of BM25 [19] and AfriBERTa DPR indexes,6
the top 20 retrieved documents were annotated for relevance with selections for either true or
false. Relevance annotation was done as follows:

    • Relevant (True): The annotator selected true if the passage answered the question or
      implied the answer without doubt.
    • Non-relevant (False): The annotator selected false if the passage didn’t answer the
      question.
Instances where the passages gave partial or incomplete answers to the question also occurred
and depending on the level of incompleteness, the annotators judged the passages as non-
relevant. Passages annotated as true in the interface were assigned a relevance of 1 and
those annotated as false a relevance of 0. Queries retained and distributed in the task had
at least 1 relevant passage and no more than 15 relevant passages to avoid way too simple
queries for the systems. Ambiguous or incomprehensible queries were also filtered out from
the collection. A set of 10 queries for each language was first developed and released along
with the corresponding judgments as train samples. Subsequently, the test queries for which
    5
        https://github.com/castorini/hf-spacerini
    6
        https://huggingface.co/castorini/afriberta-dpr-ptf-msmarco-ft-latin-mrtydi
                                         # Dev     # Dev     # Test    # Test
                           Language
                                        Queries    Judg.    Queries    Judg.
                           Hausa           10        165          85    1,523
                           Somali          10        187         100    1,728
                           Swahili         10        196          85    1,656
                           Yoruba          10        185         100    1,921


Table 3
Statistics of CIRAL’s queries and judgements. 10 queries were released for each language as train
samples along with their judgements.


                               Date    Event
                      13th July 2023   Hausa and Yoruba Training Data Released
                       6th Aug 2023    Somali and Swahili Training Data Released
                       21st Aug 2023   Test Data Released
                      10th Sep 2023    Run Submission Deadline
                      26th Sep 2023    Distribution of Results

Table 4
Track timeline showing the release dates of datasets, submission of runs and result distribution.


the pooling process was to be carried out were released: 85 for Hausa, 100 for Somali, 85 for
Swahili and 100 for Yoruba as presented in Table 3.
   Judgments obtained during the query development process are referred to as shallow consid-
ering they are few. The number of shallow judgments obtained through the query development
process for the test queries is also shown in Table 3, and these judgments are reserved in the
pool formation during relevance assessment. The different timelines for which each set was
released, along with the run submission and result distribution dates are provided in Table 4.


5. Relevance Assessment
As often practised in community evaluation, runs submitted for the test set are manually judged
to form the test collection’s qrels via pooling. A total of 84 submissions were made by the
participating teams, 21 for each language. Using the ranked list of runs provided by the teams,
query pools were formed for each language and we provide details on the relevance assessment
process and analysis of pools in this section.

5.1. Pooling Process
The top 3 ranked submissions from participating teams contributed to the pooling process, with
subsequent additions depending on available time and assessment resources. A total of 40 runs
contributed to the pools across the four languages, depending on the model type; however
dense models made up more of the contributing runs. The pool depth for submissions was kept
                                                Hausa        Somali        Swahili       Yoruba
                    Minimum across queries           47        53            58             43
                    Maximum across queries          117       117            126           125
                    Total pool size             7,288         9,094         8,079          8,311

Table 5
The minimum and maximum pool size per query across the languages. Certain queries do not get any
more contributing passages to their pools and plateau at pool sizes 40 to 60, while some queries have
pool sizes more than 100.


                                       Hausa        Somali      Swahili        Yoruba
                         Minimum          1            1              1              1
                         Maximum         81           65              71            61
                         Mean            24           21              28            14
                         Median          20           19              29            10
                         Total          1,918        2,030        2,386            1,397

Table 6
Statistics of relevant passages in curated pools.


at a constant of 𝑘 = 20, however, there was no restricted size for the pools. Judgments were
carried out by two assessors for each language, where an assessor judged the full pool of a given
query; the test set was split into distinct halves and each assigned to an assessor. Assessors
provided judgments on a binarized scale using the following description:

    • Relevant: The passage answers the query, or the answer can be very easily implied from
      the passage.
    • Non-relevant: The passage doesn’t answer the question at all, or is related to the question
      but doesn’t answer the question.

Relevant passages are given a judgment of 1 and non-relevant a judgment of 0.

5.2. Pool Analysis
The total pool size obtained for each language from the relevance assessment is presented in
Table 5. This includes the sparse judgments obtained during query development, which were
also re-assessed during the relevance assessment phase. Across the languages, the minimum
number of judgments per query ranges from 40 to 60 while some queries have up to over 120
judgments. 3 queries in Hausa, 4 in Somali, 2 in Swahili and 12 in Yoruba have pool sizes
of less than 60 passages, indicating that contributing runs retrieved similar sets of passages
for these queries in their top 20 ranks. Runs which contributed to the pooling process also
retrieved more relevant passages across the four languages as seen in Table 6. However, certain
queries were found to have no relevant passages and were discarded. This was as a result of
wrongly annotated passages from the query development phase, or grammatical errors which
affected retrieval results. This left Hausa with 80 test queries as opposed to the initial 85 queries
Figure 2: Distribution of relevance densities among the queries in each language. Across the languages,
queries with densities of 0.6 and above are relatively few, with Yoruba having 0 queries.


and Somali with 99 as opposed to 100. There were also a few queries with just 1 relevant
passage across the languages with Yoruba having the highest of 5 queries. The increased
amount of relevant passages obtained from the pooling process is a good indication that African
indigenous websites are a great source for retrieval, especially coupled with queries of interest
to the language speakers, which could also include generic topics.
   Table 6 also indicates that a large number of relevant passages were obtained for certain
queries. Considering the minimal number of runs that contributed to the pools, this raises the
concern that more relevant passages might remain unjudged, especially for runs that didn’t
contribute to the pooling process or are evaluated after the track. We analyse the number
of queries with the highest tendency of having unjudged relevant passages using relevance
densities. The relevance density of a query is its number of relevant passages compared to its
pool size, and we adopt a rule of thumb that queries with relevance densities 0.6 and higher
very likely still have unjudged relevant passages. Figure 2 gives a distribution of the relevance
densities for each language and we find that the number of queries with densities higher than
0.6 is less than 5 across the languages. There is also a higher amount of queries having densities
between 0.6 and 0.4, with Swahili and Hausa having up to 22 to 25% of queries. Although this
approach to analysing the completeness of judgment isn’t holistic, it provides some insight
on the quantity of queries in each language that would most certainly have unjudged relevant
passages from new systems.
                           nDCG@20           MRR@10           Recall@100           MAP
                          Mean  Max         Mean Max          Mean   Max        Mean  Max
                Hausa     0.2690   0.5700   0.4230   0.6952   0.3598   0.5902   0.1624   0.3611
                Somali    0.2403   0.5118   0.4115   0.7102   0.3265   0.6436   0.1483   0.3567
                Swahili   0.2644   0.5232   0.4537   0.7222   0.3249   0.5956   0.1406   0.3117
                Yoruba    0.3115   0.5819   0.4486   0.6211   0.5091   0.8057   0.2135   0.4512

Table 7
Mean and Maximum scores across all runs.


6. Results and Analysis
An overview of participants’ submissions and the results obtained from evaluating submitted
runs on the pooled qrels is provided in this section. Results are also analysed at the query level
to identify query difficulty, as well as the effectiveness of the submitted runs and model type.
   A total of 3 teams participated in the CIRAL track with 84 runs submitted. Considering that
cross-lingual passage ranking was the major task, participants weren’t given any specifications
on the retrieval type to employ and submissions comprised dense (52), reranking (20), hybrid
(8) and sparse (4) methods. All submissions covered the four languages hence there is an equal
number of runs among the languages. The retrieval methods employed by participating teams
are properly discussed in their working notes.

6.1. Overall Results
We present the results statistics of all languages in Table 7, and the detailed results of all
submitted runs in Tables 8, 9, 10 and 11. The nDCG@20, MRR@10, Recall@100, and MAP@100
scores for each submission are reported and the average and maximum scores can be found in
Table 7. The main metric in the task is nDCG@20 and a cut-off of k=20 is used considering a
decent number of queries had above 10 relevant passages during query development. Dense
models make up 62% of submissions for each language and have the highest average scores
across the metrics. Most submissions employ end-to-end cross-lingual retrieval with a few
document translation methods represented as DT in the table. However, the top 2 performing
submissions across the languages employ document translation at one stage or the other in
their systems and have the highest scores for all metrics.
   The effectiveness of model types is better visualized in Figure 3. Runs are ordered by the
nDCG@20 scores, and though dense runs make up most of the top runs, there is a variation
in effectiveness across the dense models. The effectiveness of reranking methods also varies
widely across the languages, with the exception of Yoruba where reranking models have the
top nDCG@20 scores as seen in Figure 3a. Given there wasn’t a specific task on reranking,
submitted runs employ different first and second-stage methods which has an impact on the
varying degree of output quality. However, the best reranking run outperformed the best dense
run across the languages with the exception of Somali. The submission pool has a very minimal
number of hybrid and sparse runs, giving insufficient room for comparison of the model types
on the task. The sparse run, however, outperforms some of the dense and reranking runs and
achieves competitive nDCG scores, especially in Somali and Yoruba.
                                           (a) nDCG@20


                                           (b) Recall@100
Figure 3: Distribution of nDCG@20 and Recall@100 among the various run types, ordered by nDCG@20
in both images. Hatched bars represent runs that implement document translation at any stage in their
methods, hence most submitted runs employ end-to-end CLIR retrieval.


  Dense models achieve higher recall@100 across all languages as seen in Figure 3b. Maintaining
the same order by nDCG@20, runs not having a high nDCG@20 retrieved more relevant
passages in their top 100 candidates. With the exception of Yoruba and the best reranking
model, reranking generally achieved lower recall@100, with even the sparse run achieving a
better score across the languages. These results indicate that many of the submitted systems
have relevant passages at deeper depths, however, due to the nature of the task, we optimize for
early rankings using nDCG@20.
6.2. Query-level Results
Figures 4, 5, 6 and 7 provide query level effectiveness using nDCG@20 and queries are ordered
by the median scores across evaluated runs. The median nDCG score for a good percentage
of queries is greater than 0, indicating that most submissions do not perform too badly on
individual queries across the languages. Certain queries, such as 41 in Hausa, also have quite
a gap between the maximum score obtained and the scores by the rest of the runs, indicating
specific runs perform better on these queries compared to other runs. The same can be said for
queries like 81 in Swahili, where only a few runs identify the relevant passages of the query.
This implies that these runs understand the semantics of the query and such queries could boost
the scores of systems that are able to retrieve its relevant documents.
   We also analyse the query difficulty across the languages, as queries that are too easy or
difficult are not ideal in distinguishing systems’ effectiveness. Examples of these are queries 72 in
the Hausa language and 433 in Yoruba where the median nDCG@20 score is 1.0 across submitted
systems, making them very easy queries and problematic for evaluation. There are also quite a
number of difficult queries across the languages, with Somali having the highest, where only a
few outliers score higher than 0 nDCG@20. However, a good number of queries such as 21 in
Swahili and 161 in Somali have a decent spread of scores and are ideal for evaluation.


7. Conclusion
The CIRAL track was held for the first time at the Forum for Information Retrieval Evaluation
(FIRE) 2023, with the goal of promoting the research and evaluation of cross-lingual informa-
tion retrieval for African languages. The task covered passage retrieval between English and
four African languages and test collections were curated for these languages via community
evaluations. Submissions from participating teams comprise mostly dense single-stage retrieval
systems, and these make up most of the best-performing systems on the task. Some limitations
faced this year include a minimal number of participants and less diversity in submitted retrieval
systems. Despite the limitations, we hope the CIRAL track evolves and the curated collection
matures into its most reliable and reusable version.


Acknowledgments
This research was supported in part by the Natural Sciences and Engineering Research Council
(NSERC) of Canada. We would like to thank the Masakhane community7 for their contributions
in the query development phase of the project. We also appreciate John Hopkins University
HLTCOE, organizers of the NeuCLIR track at TREC,8 for contributing the English translations
of the passage collections to the track.


    7
        https://www.masakhane.io/
    8
        https://neuclir.github.io/
References
 [1] P. Schäuble, P. Sheridan, Cross-language information retrieval (CLIR) track overview,
     NIST SPECIAL PUBLICATION SP (1998) 31–44.
 [2] C. Peters, Information retrieval evaluation in a changing world lessons learned from 20
     years of CLEF (2019).
 [3] P. Majumder, M. Mitra, D. Pal, A. Bandyopadhyay, S. Maiti, S. Pal, D. Modak, S. Sanyal,
     The FIRE 2008 evaluation exercise, ACM Transactions on Asian Language Information
     Processing (TALIP) 9 (2010) 1–24.
 [4] N. Kando, K. Kuriyama, T. Nozue, K. Eguchi, H. Kato, S. Hidaka, Overview of IR tasks at
     the first NTCIR workshop, in: Proceedings of the first NTCIR workshop on research in
     Japanese text retrieval and term recognition, 1999, pp. 11–44.
 [5] D. Lawrie, S. MacAvaney, J. Mayfield, P. McNamee, D. W. Oard, L. Soldaini, E. Yang,
     Overview of the TREC 2022 NeuCLIR track, arXiv preprint arXiv:2304.12367 (2023).
 [6] X. Zhang, K. Ogueji, X. Ma, J. Lin, Toward best practices for training multilingual dense
     retrieval models, ACM Transactions on Information Systems 42 (2023) 1–33.
 [7] M. Yarmohammadi, X. Ma, S. Hisamoto, M. Rahman, Y. Wang, H. Xu, D. Povey, P. Koehn,
     K. Duh, Robust document representations for cross-lingual information retrieval in low-
     resource settings, in: Proceedings of Machine Translation Summit XVII: Research Track,
     2019, pp. 12–20.
 [8] S. Nair, P. Galuscakova, D. W. Oard, Combining contextualized and non-contextualized
     query translations to improve CLIR, in: Proceedings of the 43rd International ACM SIGIR
     Conference on Research and Development in Information Retrieval, 2020, pp. 1581–1584.
 [9] L. Zhao, R. Zbib, Z. Jiang, D. Karakos, Z. Huang, Weakly supervised attentional model
     for low resource ad-hoc cross-lingual information retrieval, in: Proceedings of the 2nd
     Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), 2019, pp.
     259–264.
[10] O. Ogundepo, X. Zhang, S. Sun, K. Duh, J. Lin, AfriCLIRMatrix: Enabling cross-lingual
     information retrieval for African languages, in: Proceedings of the 2022 Conference
     on Empirical Methods in Natural Language Processing, Association for Computational
     Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 8721–8728.
[11] S. Sasaki, S. Sun, S. Schamoni, K. Duh, K. Inui, Cross-lingual learning-to-rank with shared
     representations, in: Proceedings of the 2018 Conference of the North American Chapter
     of the Association for Computational Linguistics: Human Language Technologies, Volume
     2 (Short Papers), 2018, pp. 458–463.
[12] S. Sun, K. Duh, CLIRMatrix: A massively large collection of bilingual and multilingual
     datasets for cross-lingual information retrieval, in: Proceedings of the 2020 Conference on
     Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4160–4170.
[13] C. Rubino, Machine translation for English retrieval of information in any language
     (machine translation for English-based domain-appropriate triage of information in any
     language), in: Conferences of the Association for Machine Translation in the Americas:
     MT Users’ Track, The Association for Machine Translation in the Americas, Austin, TX,
     USA, 2016, pp. 322–354.
[14] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. M. Voorhees, Overview of the TREC 2019
     Deep Learning track, arXiv preprint arXiv:2003.07820 (2020).
[15] M. S. Tamber, R. Pradeep, J. Lin, Pre-processing matters! Improved Wikipedia corpora for
     open-domain question answering, in: Proceedings of the 45th European Conference on
     Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part III,
     Springer, 2023, pp. 163–176.
[16] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO:
     A human-generated MAchine Reading COmprehension dataset (2016).
[17] X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, D. Alfonso-Hermelo, X. Li, Q. Liu,
     M. Rezagholizadeh, J. Lin, Making A MIRACL: Multilingual information retrieval across a
     continuum of languages, arXiv preprint arXiv:2210.09984 (2022).
[18] D. I. Adelani, M. Masiak, I. A. Azime, J. O. Alabi, A. L. Tonja, C. Mwase, O. Ogundepo, B. F.
     Dossou, A. Oladipo, D. Nixdorf, et al., MasakhaNews: News topic classification for African
     languages, arXiv preprint arXiv:2304.09972 (2023).
[19] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: BM25 and beyond,
     Foundations and Trends® in Information Retrieval 3 (2009) 333–389.
 Run                                    Team      End-to-End   Model Type   nDCG@20   MRR@10    MAP      Recall@100
 bm25-dt-mT5-pft-rerank                 h2oloo       DT         Reranking    0.5700    0.6952   0.3610     0.5902
 dt.plaid                              HLTCOE        DT           Dense      0.4743    0.5846   0.3088     0.5733
 plaid-xlmr.mlmfine.tt                 HLTCOE        ✓            Dense      0.4335    0.5625   0.2711     0.5256
 plaid-xlmr.mlmfine.tt.jholo           HLTCOE        ✓            Dense      0.3601    0.4886   0.2372     0.4829
 plaid-xlmr.tt                         HLTCOE        ✓            Dense      0.3557    0.5693   0.2468     0.5083
 plaid-xlmr.mlmfine.et                 HLTCOE        ✓            Dense      0.3488    0.5524   0.2461     0.5326
 plaid-xlmr.et                         HLTCOE        ✓            Dense      0.3481    0.5153   0.2430     0.5237
 bm25-dt-afrimt5-rerank                 h2oloo       DT         Reranking    0.3152    0.5306   0.1993     0.4186
 afroxlmr_..._ft_ckpt2000             Masakhane      ✓            Dense      0.2830    0.4288   0.1364     0.3732
 afro_xlmr_..._sw_mrtydi_ft           Masakhane      ✓            Dense       0.276    0.4921   0.1566     0.4148
 splade.bm25-mt5-rerank                 h2oloo       ✓          Reranking    0.2530    0.4159   0.1384     0.2256
 splade-mt5-rerank                      h2oloo       ✓          Reranking    0.2530    0.4159   0.1384     0.2256
 afroxlmr_base_..._ckpt1000           Masakhane      ✓            Dense      0.2451    0.3908   0.1232     0.3576
 afriberta_base_ckpt25k               Masakhane      ✓            Dense      0.2085    0.3567   0.1153     0.3342
 dt.bm25-rm3                           HLTCOE        DT          Sparse      0.2015    0.3571   0.1260     0.3359
 splade-afrimt5-rerank                  h2oloo       ✓          Reranking    0.1771    0.3687    0.095     0.1757
 afriberta_base_..._mrtydi_ft         Masakhane      ✓            Dense      0.1417    0.2408   0.0738     0.2187
 hybrid_afriberta_dpr_splade            h2oloo       ✓           Hybrid      0.1405    0.3143   0.0706     0.2567
 hybrid_mdpr_msmarco_clir_splade        h2oloo       ✓           Hybrid      0.0895    0.2174   0.0446     0.1436
 afriberta_base_..._sw_miracl         Masakhane      ✓            Dense      0.0895    0.1744   0.0418     0.2122
 afriberta_..._sw_miracl_ft_ckpt100   Masakhane      ✓            Dense      0.0846    0.2113   0.0376     0.1264

Table 8
Hausa Results.


 Run                                    Team      End-to-End   Model Type   nDCG@20   MRR@10    MAP      Recall@100
 dt.plaid                              HLTCOE        DT           Dense      0.5118    0.7102   0.3567     0.6436
 bm25-dt-mT5-pft-rerank                 h2oloo       DT         Reranking    0.4935    0.6542   0.3387     0.5581
 plaid-xlmr.mlmfine.tt                 HLTCOE        ✓            Dense      0.3366    0.5414   0.2115     0.4534
 plaid-xlmr.mlmfine.tt.jholo           HLTCOE        ✓            Dense      0.3117    0.5249   0.1892     0.4277
 plaid-xlmr.et                         HLTCOE        ✓            Dense      0.2915    0.5513   0.1881     0.4332
 plaid-xlmr.tt                         HLTCOE        ✓            Dense      0.2878    0.4783   0.1906     0.4373
 plaid-xlmr.mlmfine.et                 HLTCOE        ✓            Dense      0.2760    0.4609   0.1911     0.4260
 bm25-dt-afrimt5-rerank                 h2oloo       DT         Reranking    0.2676    0.4680   0.1718     0.3234
 dt.bm25-rm3                           HLTCOE        DT          Sparse      0.2550    0.3978   0.1789     0.4210
 splade.bm25-mt5-rerank                 h2oloo       ✓          Reranking    0.2445    0.4440   0.1402     0.2372
 splade-mt5-rerank                      h2oloo       ✓          Reranking    0.2445    0.4440   0.1402     0.2372
 afroxlmr_..._ft_ckpt2000             Masakhane      ✓            Dense      0.2209    0.3846   0.1119     0.2975
 afro_xlmr_..._sw_mrtydi_ft           Masakhane      ✓            Dense      0.2165    0.4004   0.1219     0.3337
 afroxlmr_base_..._ckpt1000           Masakhane      ✓            Dense      0.2067    0.3840   0.1087     0.3083
 splade-afrimt5-rerank                  h2oloo       ✓          Reranking    0.1647    0.3691   0.0862     0.1645
 hybrid_afriberta_dpr_splade            h2oloo       ✓           Hybrid      0.1644    0.3191   0.0928     0.2467
 afriberta_base_ckpt25k               Masakhane      ✓            Dense      0.1505    0.2845   0.0789     0.2459
 hybrid_mdpr_msmarco_clir_splade        h2oloo       ✓           Hybrid      0.1198    0.2730   0.0725     0.1765
 afriberta_base_..._mrtydi_ft         Masakhane      ✓            Dense      0.1034    0.2206   0.0436     0.1663
 afriberta_base_..._sw_miracl         Masakhane      ✓            Dense      0.0965    0.1713   0.0546     0.1780
 afriberta_..._sw_miracl_ft_ckpt100   Masakhane      ✓            Dense      0.0830    0.1589   0.0453     0.1407

Table 9
Somali Results.
 Run                                    Team      End-to-End   Model Type   nDCG@20   MRR@10    MAP      Recall@100
 bm25-dt-mT5-pft-rerank                 h2oloo       DT         Reranking    0.5232    0.7222   0.3110     0.5473
 dt.plaid                              HLTCOE        DT           Dense      0.5118    0.7102   0.3567     0.6436
 plaid-xlmr.mlmfine.tt                 HLTCOE        ✓            Dense      0.4230    0.6175   0.2500     0.4645
 plaid-xlmr.mlmfine.tt.jholo           HLTCOE        ✓            Dense      0.4081    0.6002   0.2362     0.4477
 plaid-xlmr.tt                         HLTCOE        ✓            Dense      0.3347    0.5615   0.2041     0.4510
 plaid-xlmr.mlmfine.et                 HLTCOE        ✓            Dense      0.3301    0.5820   0.2093     0.4487
 afroxlmr_..._ft_ckpt2000             Masakhane      ✓            Dense      0.3215    0.4983   0.1378     0.3543
 plaid-xlmr.et                         HLTCOE        ✓            Dense      0.3182    0.5541   0.1977     0.4616
 afroxlmr_base_..._ckpt1000           Masakhane      ✓            Dense      0.3037    0.5190   0.1302     0.3585
 bm25-dt-afrimt5-rerank                 h2oloo       DT         Reranking    0.2542    0.4729   0.1351     0.2899
 afro_xlmr_..._sw_mrtydi_ft           Masakhane      ✓            Dense      0.2447    0.4621   0.1118     0.3499
 splade.bm25-mt5-rerank                 h2oloo       ✓          Reranking    0.2395    0.4764   0.1055     0.2105
 splade-mt5-rerank                      h2oloo       ✓          Reranking    0.2395    0.4764   0.1055     0.2105
 dt.bm25-rm3                           HLTCOE        DT          Sparse      0.2178    0.4172   0.1353     0.3340
 afriberta_base_ckpt25k               Masakhane      ✓            Dense      0.1833    0.3968   0.0792     0.2486
 afriberta_base_..._mrtydi_ft         Masakhane      ✓            Dense      0.1426    0.3143   0.0532     0.2207
 splade-afrimt5-rerank                  h2oloo       ✓          Reranking    0.1378    0.3043   0.0537     0.1215
 hybrid_afriberta_dpr_splade            h2oloo       ✓           Hybrid      0.1277    0.3000   0.0565     0.1930
 afriberta_base_..._sw_miracl         Masakhane      ✓            Dense      0.1109    0.2141   0.0463     0.1897
 afriberta_..._sw_miracl_ft_ckpt100   Masakhane      ✓            Dense      0.1068    0.2046   0.0395     0.1442
 hybrid_mdpr_msmarco_clir_splade        h2oloo       ✓           Hybrid      0.0909    0.2338   0.0435     0.1807

Table 10
Swahili Results.


 Run                                    Team      End-to-End   Model Type   nDCG@20   MRR@10    MAP      Recall@100
 bm25-dt-mT5-pft-rerank                 h2oloo       DT         Reranking    0.5819    0.6211   0.4512     0.8057
 dt.plaid                              HLTCOE        DT           Dense      0.4793    0.6036   0.3657     0.7240
 plaid-xlmr.mlmfine.tt.jholo           HLTCOE        ✓            Dense      0.4297    0.5438   0.3044     0.6748
 plaid-xlmr.mlmfine.tt                 HLTCOE        ✓            Dense      0.4189    0.5434   0.2985     0.6394
 bm25-dt-afrimt5-rerank                 h2oloo       DT         Reranking    0.4103    0.5014   0.3162     0.6377
 splade.bm25-mt5-rerank                 h2oloo       ✓          Reranking    0.4071    0.5822   0.2808     0.5037
 splade-mt5-rerank                      h2oloo       ✓          Reranking    0.4071    0.5822   0.2808     0.5037
 plaid-xlmr.mlmfine.et                 HLTCOE        ✓            Dense      0.3804    0.5520   0.2707     0.6950
 dt.bm25-rm3                           HLTCOE        DT          Sparse      0.3555    0.4909   0.2696     0.6273
 plaid-xlmr.tt                         HLTCOE        ✓            Dense      0.3522    0.4731   0.2505     0.5784
 splade-afrimt5-rerank                  h2oloo       ✓          Reranking    0.3138    0.4996   0.2130     0.4100
 plaid-xlmr.et                         HLTCOE        ✓            Dense      0.2627    0.4214   0.1863     0.5176
 afriberta_base_ckpt25k               Masakhane      ✓            Dense      0.2357    0.3763   0.1423     0.4369
 afro_xlmr_..._sw_mrtydi_ft           Masakhane      ✓            Dense      0.2296    0.4080   0.1435     0.4418
 afroxlmr_..._ft_ckpt2000             Masakhane      ✓            Dense      0.2210    0.4009   0.1186     0.377
 hybrid_afriberta_dpr_splade            h2oloo       ✓           Hybrid      0.2015    0.3454   0.1205     0.3786
 afroxlmr_base_..._ckpt1000           Masakhane      ✓            Dense      0.1981    0.3500   0.1071     0.3947
 afriberta_base_..._mrtydi_ft         Masakhane      ✓            Dense      0.1688    0.2999   0.0880     0.3278
 hybrid_mdpr_msmarco_clir_splade        h2oloo       ✓           Hybrid      0.1642    0.2921   0.0968     0.3452
 afriberta_base_..._sw_miracl         Masakhane      ✓            Dense      0.1634    0.2612   0.0922     0.3658
 afriberta_..._sw_miracl_ft_ckpt100   Masakhane      ✓            Dense      0.1594    0.2712   0.0865     0.3063

Table 11
Yoruba Results.
Figure 4: Boxplots showing nDCG@20 for Hausa Queries.
Figure 5: Boxplots showing nDCG@20 for Somali Queries
Figure 6: Boxplots showing nDCG@20 for Swahili Queries
Figure 7: Boxplots showing nDCG@20 for Yoruba Queries