<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <email>gabriella.pasi@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Given MoEs</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Mixture-of-Experts, Representation Learning, Dense Neural Retrievers</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics</institution>
          ,
          <addr-line>Systems and Communication (DISCo)</addr-line>
          ,
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While Dense Retrieval Models (DRMs) have advanced Information Retrieval (IR), they often sufer from limited generalizability and robustness. Various studies address these limitations with representation learning techniques that leverage the Mixture-of-Experts (MoE) architecture. Unlike prior works in IR that integrate MoE within the Transformer layers of DRMs, we add a single MoE block (SB-MoE) after the output of the final Transformer layer. Our empirical evaluation investigates how SB-MoE compares, in terms of retrieval efectiveness, to standard model fine-tuning. sensitivity to its hyperparameters (i.e., the number of experts), we also investigate our model's performance under diferent expert configurations. Results show that for lightweight DRMs, consistently outperforming their fine-tuned counterparts. For larger DRMs, SB-MoE requires more training data to deliver improved retrieval performance. Our code is available online at: https://anonymous.4open.science/r/DenseRetrievalMoE.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Dense Retrieval⋆
Efrosyni</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Dense Retrieval Models (DRMs) can capture the semantic context of queries and documents
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and often outperform sparse lexicon-based models such as BM25 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] across various IR tasks.
However, their dependence on large labeled datasets and limited cross-domain generalizability
often requires additional fine-tuning for robust adaptation to diferent tasks or domains. In this
paper, we investigate the efectiveness of an enhanced bi-encoder DRM architecture leveraging
Mixture-of-Experts (MoE) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in various dense retrieval tasks. Unlike previous studies in IR
that integrate MoE within each Transformer layer [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], we apply a single MoE block (SB-MoE)
on the final output embeddings of the underlying DRM. SB-MoE is trained in an unsupervised
manner to automatically optimize each expert and dynamically aggregate their outputs, adapting
predictions to the input embeddings, i.e., the query and document representations produced by
the underlying DRM. We utilize two datasets of the BEIR collection [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (i.e., Natural Questions
(NQ) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and HotpotQA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), and two of the Multi-Domain Benchmark by Bassani et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (i.e.,
Political Science (PS) and Computer Science (CS)), to empirically evaluate SB-MoE’s retrieval
efectiveness for open-domain Q&amp;A, and domain-specific academic search.
      </p>
      <p>
        This work has the following contributions: (1) We introduce a modular MoE framework, SB-MoE,
which operates on the query and document embeddings produced by an underlying bi-encoder
DRM architecture; (2) We conduct an empirical evaluation using three DRMs (Contriever,
BERT, and TinyBERT) investigating SB-MoE’s retrieval performance and its sensitivity to
hyperparameters (i.e., the number of employed experts), compared to standard model fine-tuning
across four benchmarks.
⋆This is an extended abstract of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
(CC BY 4.0).
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>Query-level
MoE Layer</p>
      <p>Document-level</p>
      <p>MoE Layer
Output Query Embedding</p>
      <p>Output Document Embedding
Pooling</p>
      <p>Pooling
Gating
Function</p>
      <p>Expert 1</p>
      <p>Expert 2</p>
      <p>Expert n</p>
      <p>Expert 1</p>
      <p>Expert 2</p>
      <p>Expert n
Gating</p>
      <p>Function
Query Embedding Representation</p>
      <p>Document Embedding Representation</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        DRMs often outperform lexicon-based models (e.g., BM25 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), since they can capture the
semantic context of queries and documents. They project both queries and documents in
a common dense vector space and score documents through similarity functions for a given
query [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]. In this work, we leverage three DRMs. Contriever [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is a state-of-the-art
BERT-based model that exploits contrastive learning, a Machine Learning technique that uses
pairs of positive and negative examples to learn meaningful and distinctive representations
of queries and documents. Besides BERT [15], we also use TinyBERT [16], which leverages
knowledge distillation [17] to transfer knowledge from its larger counterpart (BERT) to a tinier
version, reducing training times and computational expenses. DRMs often showcase continuous
adaptation needs, which can lead to low generalizability and robustness [18, 19]. The MoE
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] framework has been employed in various approaches to mitigate these issues. MoE can
handle multiple types of data and tasks [20, 21] and has been used in tasks such as classification
[22], and multi-lingual machine translation [23]. MoE has been employed for IR tasks such as
passage retrieval [
        <xref ref-type="bibr" rid="ref5">5, 24</xref>
        ], and Q&amp;A [
        <xref ref-type="bibr" rid="ref6">6, 25, 26</xref>
        ]. These approaches either integrate MoE blocks
into every layer of the Transformer model (substantially increasing the number of parameters)
or only partially leverage MoE by applying it solely to the query representation. In our work, we
apply a single MoE block to both query and document representations and train the obtained
architecture end-to-end for retrieval.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>SB-MoE builds upon a bi-encoder DRM architecture [27], which allows for independent encoding
of documents and queries to enhance scalability and to enable the computation of relevance scores
through a similarity function (e.g., cosine similarity). The proposed model’s architecture consists
of three parts (Figure 1): (1) the experts, operating on the query and document representations
produced by the underlying DRM; (2) the gating function, trained in an unsupervised manner
to indicate the most appropriate expert(s) for a given input; and (3) the pooling module, used
in the final stage to aggregate the experts’ representations and produce the final embedding to
be used for similarity estimation between the query and documents.</p>
      <p>The experts receive as input the query or document embedding as produced by the underlying
DRM. The output is  modified representations, where  is the number of employed experts. The
gating function receives the same input and produces an  -dimensional vector, which indicates
the importance of each experts contribution to the final query or document embedding. We
rely on noisy Top-1 gating, as proposed by Shazeer et al. [23], for training the gating function.
This approach ensures that SB-MoE can explore every expert during training, enhancing the
robustness of the model. During inference, the pooling module uses two diferent strategies. The
ifrst one is Top-1 gating [ 28] (SB-MoETOP-1), which selects solely the output of the expert that
the gating function assigned the highest score to. The second strategy (SB-MoEALL) calculates
probability scores from the gating function’s output vector through a softmax normalization
[29], and produces the final embedding, which is the weighted sum of all experts’ outputs.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Experimental Analysis</title>
      <p>This section presents the empirical evaluation conducted to answer the following research
questions (RQs):</p>
      <p>RQ1 How does SB-MoE compare, in terms of efectiveness, to standard model fine-tuning?
RQ2 How does the number of experts impact the retrieval efectiveness of SB-MoE?</p>
      <sec id="sec-5-1">
        <title>4.1. Experimental Setup</title>
        <p>
          For RQ1, we employ 6 distinct experts across all models and datasets. For RQ2, we vary the
number of experts from 3 to 12 with a step of 3. This setup is based on previous studies [30, 31],
which suggest that a high number of experts does not always yield performance improvements [32],
and experiment with expert counts ranging from 2 to 8 [25, 24, 33]. We follow the architecture
proposed by Houlsby et al. [34], where each expert consists of a feed-forward network (FFN) with
a down-projection layer that reduces the input dimension by half, followed by an up-projection
FFN layer, which restores the vector dimension to the original embedding size. The gating
function includes a single hidden layer that reduces the input dimension by half, and an output
FFN layer with dimensionality equal to the number of experts. During training, we use a batch
size of 64. The learning rate is set to 10−6 for the underlying DRM and 10−4 for the experts.
TinyBERT is trained for 30 epochs across all datasets, while BERT and Contriever are trained
for 20 epochs due to resource constraints and longer training times, on all datasets except CS,
where they are trained for 10 epochs since the collection’s training queries are ∼3.5 times more
than the second largest collection used (PS). We reserve 5% of each training set for validation
and keep the checkpoint with the lowest validation loss. We set the random seed to 42 and
use contrastive loss [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] with a temperature of 0.05. For our evaluation, we use NDCG@10
and R@100, two metrics commonly used on BEIR, for comparability. Statistical significance
is assessed using two-sided paired Student’s  -tests with Bonferroni multiple testing correction,
at a significance level of 0.05. We integrate SB-MoE into three diferent DRMs and compare its
retrieval efectiveness to that achieved by the underlying DRM, fine-tuned on the same training
data and hyperparameters. We refer to these baseline experiments as Fine-tuned.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Results and Discussion</title>
        <p>RQ1. As shown in Table 1, SB-MoE consistently improves NDCG@10 and Recall@100, especially
for lightweight models. For example, on TinyBERT, SB-MoE leads to noticeable performance
Fine-tuned model
3 experts
6 experts
9 experts
12 experts
0.20
0.05
0.00
gains in both metrics across all datasets, with a marked increase in HotpotQA, where SB-MoEALL
achieved an NDCG@10 score of .171 compared to .158 of the fine-tuned version. However, for
larger models like BERT and Contriever, the integration of SB-MoE had a marginal impact, with
similar or slightly worse retrieval performance compared to Fine-tuned. These results suggest
that in models already equipped with a substantial number of parameters, SB-MoE’s advantages
may not be so prominent, potentially due to redundancy when additional experts are employed.
Therefore, the integration of SB-MoE particularly benefits lightweight models.</p>
        <p>RQ2. As SB-MoE seems to benefit significantly lightweight models, we leverage TinyBERT to
understand the impact of the number of experts, by configuring SB-MoE with 3, 6, 9, and 12
experts and evaluating across all datasets (Figure 2). Our findings show variations in performance
for diferent expert counts across datasets, which can also lead to the maximization of diferent
performance measures, as observed in the case of NQ, where the employment of 12 experts
maximizes NDCG@10, but Recall@100 is maximized with 9 experts. Therefore, the number of
employed experts is a hyperparameter that requires tuning with respect to the domain and the
addressed retrieval task.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions</title>
      <p>In this work, we integrate a single Mixture-of-Experts block (SB-MoE) into Dense Retrieval Models
(DRMs) and conduct an experimental investigation on its efectiveness in diferent dense retrieval
tasks. Results show that SB-MoE significantly enhances the retrieval performance of lightweight
DRMs, consistently improving NDCG@10 and R@100 across datasets. However, larger DRMs
only marginally benefit from SB-MoE, indicating that models with a higher parameter count need
dataset-specific optimization to see measurable gains. Our analysis reveals that the number of
employed experts is a key hyperparameter, which influences SB-MoE’s performance and requires
task and domain-specific calibration.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has received funding from the European Unions Horizon Europe research and innovation
programme under the Marie Skodowska-Curie grant agreement No 101073307.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
pervised dense information retrieval with contrastive learning, Transactions on Machine
Learning Research (2022). URL: https://openreview.net/forum?id=jKN1pXi7b0.
[15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19- 1423.
[16] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, Q. Liu, TinyBERT:
Distilling BERT for natural language understanding, in: Findings of the Association
for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics,
Online, 2020, pp. 4163–4174. doi:10.18653/v1/2020.findings- emnlp.372.
[17] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, Y. Bengio, Fitnets: Hints for
thin deep nets, arXiv (2014).
[18] Y. Liu, R. Zhang, J. Guo, M. de Rijke, Y. Fan, X. Cheng, Robust neural information retrieval:
An adversarial and out-of-distribution perspective, CoRR abs/2407.06992 (2024). URL: https:
//doi.org/10.48550/arXiv.2407.06992. doi:10.48550/ARXIV.2407.06992. arXiv:2407.06992.
[19] G. Sidiropoulos, E. Kanoulas, Analysing the robustness of dual encoders for dense retrieval
against misspellings, in: E. Amigó, P. Castells, J. Gonzalo, B. Carterette, J. S. Culpepper,
G. Kazai (Eds.), SIGIR ’22: The 45th International ACM SIGIR Conference on Research
and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, ACM, 2022,
pp. 2132–2136. URL: https://doi.org/10.1145/3477495.3531818. doi:10.1145/3477495.3531818.
[20] R. Collobert, S. Bengio, Y. Bengio, A parallel mixture of svms for very large scale problems,</p>
      <p>Advances in Neural Information Processing Systems 14 (2001).
[21] M. Li, M. Li, K. Xiong, J. Lin, Multi-task dense retrieval via model uncertainty fusion for
open-domain question answering, in: Findings of the Association for Computational
Linguistics: EMNLP 2021, Association for Computational Linguistics, Punta Cana, Dominican
Republic, 2021, pp. 274–287. doi:10.18653/v1/2021.findings- emnlp.26.
[22] D. Eigen, M. Ranzato, I. Sutskever, Learning factored representations in a deep mixture of
experts, arXiv preprint arXiv:1312.4314 (2013).
[23] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously
large neural networks: The sparsely-gated mixture-of-experts layer, in: International
Conference on Learning Representations, 2017. URL: https://openreview.net/forum?id=B1ck
MDqlg.
[24] G. Ma, X. Wu, P. Wang, S. Hu, Cot-mote: exploring contextual masked
autoencoder pre-training with mixture-of-textual-experts for passage retrieval, arXiv preprint
arXiv:2304.10195 (2023).
[25] D. Dai, W.-J. Jiang, J. Zhang, W. Peng, Y. Lyu, Z. Sui, B. Chang, Y. Zhu, Mixture of
experts for biomedical question answering, in: Natural Language Processing and Chinese
Computing, 2022. URL: https://api.semanticscholar.org/CorpusID:248218762.
[26] P. Kasela, G. Pasi, R. Perego, N. Tonellotto, Desire-me: Domain-enhanced supervised
information retrieval using mixture-of-experts, in: Advances in Information Retrieval,
Springer Nature Switzerland, Cham, 2024, pp. 111–125.
[27] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese
BERTnetworks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019,
pp. 3982–3992. doi:10.18653/v1/D19- 1410.
[28] Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Y. Zhao, A. M. Dai, Z. Chen, Q. V.</p>
      <p>Le, J. Laudon, Mixture-of-experts with expert choice routing, in: Advances in Neural
Information Processing Systems 35: Annual Conference on Neural Information Processing
Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022,
2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/2f00ecd787b432c1d36f3de9800728e
b-Abstract-Conference.html.
[29] M. I. Jordan, R. A. Jacobs, Hierarchical Mixtures of Experts and the EM Algorithm,
Neural Computation 6 (1994) 181–214. URL: https://doi.org/10.1162/neco.1994.6.2.181.
doi:10.1162/neco.1994.6.2.181.
[30] X. Li, S. He, J. Wu, Z. Yang, Y. Xu, Y. jun Jun, H. Liu, K. Liu, J. Zhao, Mode-cotd:
Chainof-thought distillation for complex reasoning tasks with mixture of decoupled lora-experts,
in: Proceedings of the 2024 Joint International Conference on Computational Linguistics,
Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy,
ELRA and ICCL, 2024, pp. 11475–11485. URL: https://aclanthology.org/2024.lrec-main.1003.
[31] T. Zadouri, A. Üstün, A. Ahmadian, B. Ermis, A. Locatelli, S. Hooker, Pushing mixture
of experts to the limit: Extremely parameter eficient moe for instruction tuning, in: The
Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria,
May 7-11, 2024, OpenReview.net, 2024.
[32] T. Chen, Z. Zhang, A. K. Jaiswal, S. Liu, Z. Wang, Sparse moe as the new dropout:
Scaling dense and self-slimmable transformers, in: The Eleventh International Conference
on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, OpenReview.net,
2023. URL: https://openreview.net/forum?id=w1hwFUb_81.
[33] Y. Wang, S. Agarwal, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, J. Gao, AdaMix:
Mixture-of-adaptations for parameter-eficient model tuning, in: Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing, Association
for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 5744–5760.
doi:10.18653/v1/2022.emnlp- main.388.
[34] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo,
M. Attariyan, S. Gelly, Parameter-eficient transfer learning for NLP, in: Proceedings of the
36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, PMLR, 2019, pp. 2790–2799. URL: https://proceedings.mlr.press/v97/houl
sby19a.html.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Sokli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kasela</surname>
          </string-name>
          , G. Peikos, G. Pasi,
          <article-title>Investigating mixture of experts in dense retrieval</article-title>
          ,
          <source>CoRR abs/2412</source>
          .11864 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2412.11864. doi:
          <volume>10</volume>
          .48550 /ARXIV.2412.11864. arXiv:
          <volume>2412</volume>
          .
          <fpage>11864</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <article-title>An introduction to neural information retrieval</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>13</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          , Okapi at trec-3, Nist Special Publication Sp
          <volume>109</volume>
          (
          <year>1995</year>
          )
          <fpage>109</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Nowlan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <source>Adaptive Mixtures of Local Experts, Neural Computation</source>
          <volume>3</volume>
          (
          <year>1991</year>
          )
          <fpage>79</fpage>
          -
          <lpage>87</lpage>
          . doi:
          <volume>10</volume>
          .1162/neco.
          <year>1991</year>
          .
          <volume>3</volume>
          .1.79.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , X. Cheng, Came:
          <article-title>Competitively learning a mixture-of-experts model for first-stage retrieval</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          . (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1145/3678880, just Accepted.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Keutzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Mixture-of-experts meets instruction tuning: A winning combination for large language models</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=6mLjDwYte5.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models</article-title>
          ,
          <source>in: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)</source>
          ,
          <year>2021</year>
          . URL: https://openreview.net/forum?id=wCu6T5xFjeJ.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kwiatkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palomaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Redfield</surname>
          </string-name>
          , M. Collins,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Epstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kelcey</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrov</surname>
          </string-name>
          , Natural Questions:
          <article-title>A Benchmark for Question Answering Research, Transactions of the Association for Computational Linguistics 7 (</article-title>
          <year>2019</year>
          )
          <fpage>453</fpage>
          -
          <lpage>466</lpage>
          . URL: https://doi.org/10.1162/tacl_a_00276. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00276</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>HotpotQA: A dataset for diverse, explainable multi-hop question answering</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>2369</fpage>
          -
          <lpage>2380</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D18</fpage>
          -1259.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bassani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kasela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raganato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pasi,</surname>
          </string-name>
          <article-title>A multi-domain benchmark for personalized search evaluation</article-title>
          ,
          <source>in: Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management, CIKM '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>38223827</fpage>
          . doi:
          <volume>10</volume>
          .1145/3511808.3557536.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Unsupervised corpus aware language model pre-training for dense passage retrieval, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>2843</fpage>
          -
          <lpage>2853</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>203</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kamalloo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lassance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , J.
          <string-name>
            <surname>-H. Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Resources for brewing beir: Reproducible reference models and statistical analyses</article-title>
          ,
          <source>in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>14311440</fpage>
          . doi:
          <volume>10</volume>
          .1145/3626772.3657862.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , A. Overwijk, COCO-DR:
          <article-title>Combating the distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>1462</fpage>
          -
          <lpage>1479</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>95</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Caron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, Unsu-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>