<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LexiSemIR: A Two-Stage Re-ranking Framework with BM25 and Zero-Shot Bi-Encoder</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Swati Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tanusree Nath</string-name>
          <email>tanusreenath32@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vedika Gupta</string-name>
          <email>vedika.nit@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manjari Gupta</string-name>
          <email>manjari@bhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Banaras Hindu University</institution>
          ,
          <addr-line>Varanasi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jindal Global Business School, O.P. Jindal Global University</institution>
          ,
          <addr-line>Sonipat</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Most standard Information Retrieval (IR) models primarily rely on keyword matching, which can be inadequate when a deeper contextual understanding is required. In such cases, it becomes essential to capture both lexical and semantic relationships between query-document pairs. To address this limitation, our team CodeWeavers proposes LexiSemIR, a two-stage re-ranking-based model developed for the CMIR-2025 (Code-Mixed Information Retrieval) shared task on Bengali-English code-mixed text. In the first stage, the top k documents are retrieved using a lexical bag-of-words model (BM25). These are then re-ranked in the second stage using a zero-shot bi-encoder, which computes semantic similarity between query and document embeddings. The proposed approach balances simplicity and performance, while minimizing trainable parameters due to its zero-shot design. LexiSemIR secured 3rd place in the CMIR-2025 shared task, achieving MAP = 0.1546 and P@5 = 0.38, thereby outperforming the BM25 baseline in early precision. The results highlight the model's ability to efectively combine lexical and semantic retrieval strategies for robust performance in code-mixed IR settings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Code-mixed language</kwd>
        <kwd>Information retrieval</kwd>
        <kwd>BM25</kwd>
        <kwd>Bi-Encoder</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Information Retrieval (IR) is broadly defined as "finding material (usually documents) of an unstructured
nature (usually text) that satisfies an information need from within large collections (usually stored on
computers)" [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In modern times, IR finds its applications in a variety of tasks, including internet
browsing, question answering systems, personal assistants, chat-bots, and digital libraries [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. There has
also been a surge of user-generated content in code-mixed languages in recent years, which has further
complicated the task of information retrieval [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Code-mixing refers to the act of "mixing two or more
languages in a single discourse" [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. While code-mixing may help the model understand multilingual
similarities, it becomes susceptible to hurting the retrieval efectiveness [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Most traditional IR systems
depend on lexical matching-based algorithms (BM25, Heimstra-LM, etc.), which may prove to be
inefective in many instances where the same word has diferent meanings or where diferent words
have the same meaning [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In such cases, there is a need to have a deeper semantic understanding
of the queries and documents. To eliminate this drawback, semantic-matching-based algorithms have
been developed (vector space model, neural networks, etc.) that focus on the meanings of tokens. To
further improve the eficiency of a single IR algorithm, two-stage retrieval systems have been developed.
Since the 1990s, two-stage retrieval systems have undergone constant improvement with the advent of
new methods and technologies. One of the recent studies has used a two-stage retrieval system using
an adapted BM25 and a neural ranking model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Instead of the traditional two-stage retrieval system, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] introduced a serverless three-stage retrieval
system with BM25, monoBERT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and duoBERT. Another study by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] also developed a two-stage
retrieval system using BM25 and monoBERT encoder. Following these works, our study also aims to
build a two-stage retrieval model. However, in place of monoBERT in the second stage, we employ
MPNet in a zero-shot setting. MPNet makes a good candidate as a two stage re-ranker because
documents are often encoded using sentence embeddings [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In addition to that, using zero-shot
mode to transfer knowledge from English to other languages is a popular approach [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This not
only makes the model faster and scalable, but also allows for independent encoding of queries and
documents via the bi-encoder architecture instead of the traditional cross-encoder. A similar study
by [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] also employed BM25 and a SentenceTransformer-based encoder to build a two-stage retrieval
system. However, this study used fine-tuning to fit the SentenceTransformer on their data. Our proposed
approach uses SentenceTransformer in a zero-shot setting, which makes it faster and parameter-eficient,
while maintaining comparable performance in the given task.
      </p>
      <p>The overall contrbutions of our work are summarized below:
• We propose LexiSemIR, a two-stage re-ranking-based retrieval framework, combining a lexical
BM25 ranker with a zero-shot semantic bi-encoder to capture both lexical and semantic meaning of
query document pairs. This model elegantly balances performance with simplicity and parameter
eficiency. The proposed model ranked 3rd in the CMIR-2025 shared task on Bengali-English
code-mixed text, with a MAP score of 0.1546, NDCG score of 0.2767, P@5 of 0.38, P@10 of 0.2833.
• We conduct a detailed error analysis of the proposed model and provide valuable insights into its
underlying biases and sources of error. In addition, the query-specific variations are highlighted,
which directly afect the retrieval quality. This insight provides actionable guidance for
queryexpansion techniques in future work.
• A sensitivity analysis is performed across multiple _ values that demonstrates how retrieval
depth influences performance stability across metrics.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the Art</title>
      <p>
        Code-mixed information retrieval research has undergone significant improvement in the recent years.
One of the major breakthroughs in this area is the creation of new corpora in low-resource languages.
Early studies like [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] introduced such a corpus in Hindi-English code-mixed social media data. Another
study by [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] also introduced a track for mixed-script information retrieval in Hindi at FIRE-2016.
Bengali, being such a language also started gaining attention from the research community. A study
by [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] developed a Bengali information retrieval system by introducing a Bengali text corpora, using
advanced preprocessing techniques and applying TF-IDF with cosine similarity to retrieve relevant
answers to queries. Another study by [18] experimented with prompt engineering and mathematical
modeling with GPT-3.5. Some studies like [19] and [20] combined multiple languages (Hindi, Bengali
and Marathi). In the study by [19], several indexing and retrieval strategies were evaluated using various
techniques like Divergence from Randomness variants, Okapi BM25, TF-IDF, and statistical language
models.
      </p>
      <p>Some of the early works also focused on cross-lingual information retrieval. A study [21] developed
a framework to retrieve English documents using Hindi and Bengali queries using machine translation
and automatic query generation. A similar study by [22] assembled a system to retrieve English
documents using Bengali, Hindi and Telugu queries. To achieve this, they used a combination of
bilingual dictionaries, sufix-stripping stemmers, transliteration and TF-IDF ranking.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section presents the detailed methodology of the proposed approach. The goal of this work is to
develop a Code-Mixed Information Retrieval (CMIR) system, focusing on code-mixed Bengali-English
queries and documents. Standard baseline information retrieval models mainly rely on lexicon-based
keyword matching (TF-IDF, BM25, PL2, InL2, Hiemstra-LM), completely discounting the semantic
meaning of natural text. To overcome this drawback, the LexiSemIR (Lexical + Semantic Information
Retrieval) model is proposed, which efectively balances lexical and semantic matching of
querydocument pairs. By employing a two-step retrieval process and re-ranking technique, the LexiSemIR
model ensures that both surface-level keyword overlap and deeper contextual understanding between
query-document pairs are captured. An overview of the model is given in Figure 1. The detailed
workflow of the model is described in the succeeding sections.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Setup</title>
        <p>The dataset consists of a set of queries Q and documents D. Both the queries and the documents are
provided in code-mixed Bengali. For the training phase, the relevance judgements for the queries are
available in a separate file, where each query is mapped to one or more relevant documents, with a
binary relevance score. Table 1 summarizes the dataset statistics.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental Design</title>
        <p>The retrieval is performed in two stages. In the first stage of the retrieval, the BM25 ranker is used to
retrieve the top 100 matching documents [23], followed by a bi-encoder-based re-ranking to retrieve
the top 10 amongst those.</p>
        <p>BM25. It is a classic probabilistic bag-of-words model, used as the first-stage retriever. Given a query
Q with tokens {1, 2, . . . , }, and a set of documents  = {1, 2, . . . ,  }, where N is the total
number of documents in the corpus, BM25 calculates the relevance score of document  with respect
to query Q as:</p>
        <p>BM25( , ) = ∑︁ IDF() ·

=1
 (,  ) + 1 ·</p>
        <p>1 −  +  ·
 (,  ) · ( 1 + 1)
︁(
|| ︁)
avgdl
Here, the 1 and  are two hyperparameters. 1 controls the term-frequency saturation while  controls
document length normalization.  (,  ) is the frequency of term  in document  ,  is the
average document length in the collection and IDF stands for the Inverse-Document Frequency of a
term , It is calculated as:</p>
        <p>IDF() = ln
︂(  − ( ) + 0.5
() + 0.5
+ 1
︂)
In the above equation, () refers to the number of documents containing the term .</p>
        <p>Bi-encoder Unit: The documents retrieved in the first stage, along with the query, are passed to a
pre-trained sentence transformer-based bi-encoder unit. The bi-encoder unit consists of the
all-mpnetbase-v2 Sentence-Transformer (based on MPNet [24]), which parallelly processes the query and the
document. Given the initial query , let  = {1, 2, . . . , 100} be the top 100 documents retrieved
in the first stage. If  : text → R768 denotes the bi-encoder function that maps a piece of text to a
dense vector, then the query embeddings  and the document embedding  are calculated as:
emb =  () ∈ R768 and emb =  ( ) ∈ R768 respectively. The bi-encoder unit simultaneously
encodes each document and the query to produce dense vectors that capture their semantic meaning.
The final relevance score (,  ) of each document with respect to the query is calculated by computing
the cosine-similarity between their respective embedding vectors. It is calculated as follows:
(,  ) = cos(emb, emb ) =</p>
        <p>emb⊤emb
‖emb‖2 ‖emb ‖2</p>
        <p>This score is used to retrieve and align the final top 10 relevant documents with respect to query Q.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Hyperparameters</title>
        <p>Hyperparameters are a set of configuration variables that control the model’s performance. By
configuring the hyperparameter settings, a model can be tuned for optimal performance. Table 2 summarizes
the diferent hyperparameters used in LexiSemIR model along with their descriptions.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Training and Testing</title>
        <p>judgement file that maps each query to at least one document. The LexiSemIR model involves limited
tuning only in Stage 1 (BM25). Specifically, we adjusted BM25 hyperparameters using the training set
queries and relevance judgements. Stage 2 (the bi-encoder) was employed in a purely zero-shot setting
without any additional fine-tuning. Thus, while the overall framework is evaluated on both training
and test queries, only Stage 1 undergoes hyperparameter tuning. For testing, 30 queries are given. The
trained model is used to retrieve relevant documents for the test queries, the results of which are further
evaluated by the organizers to publish the final evaluation metrics.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Our team CodeWeavers, submitted two runs in the CMIR-2025 shared task. Run 1 corresponds to an
enhanced version of Hiemstra-LM with advanced query and document preprocessing. Run 2 corresponds
to the LexiSemIR model, which secured 3rd rank in the overall competition. The test metrics were
released by organizers, as we did not have direct access to relevance judgments. The reported evaluation
metrics on the test queries for both runs are listed in Table 3.
Run 1 0.15627
Run 2 (LexiSemIR) 0.154625
0.341038
0.276684</p>
      <p>Since the relevant judgements for the test queries were not available, further analysis of the model
based on errors and sensitivity are conducted on the training set. Table 4 shows the performance metrics
of the model on training queries.
Run 1 0.4162
Run 2 (LexiSemIR) 0.644196</p>
      <p>0.4711
0.411855</p>
      <sec id="sec-4-1">
        <title>4.1. Error Analysis</title>
        <p>This section talks about the error analysis of the proposed model in the training phase. Figure 2 shows
the confusion matrix corresponding to the proposed LexiSemIR model. From the figure, it is evident
that a total of 67 out of 378 relevant documents were retrieved by the model. It can also be observed
that the false-negative (FN) rate is higher than the false-positive (FP) rate. The FN rate shows that a
considerable portion of relevant documents were not ranked in the top ten, and allows for an area of
significant improvement. The FP rate is also quite high, signalling that the model is also confident about
placing irrelevant documents in the top ten.</p>
        <p>Figure 3 shows the rank distribution of the 67 retrieved documents. A positive trend can be observed
in it, where the overall number of retrieved documents decreases as the rank goes down, showing that
the model successfully places the most relevant documents in the top ranks.</p>
        <p>For a more detailed error analysis, an evaluation of the retrieved documents based on each query is
presented in Table 5. It is observed from the table that queries 19 and 1 retrieved the highest number of
relevant documents, followed by queries 13 and 14. The worst performing queries are 2 and 12, with 0
retrieved documents. Queries 21 and 25 have the highest number of false negatives, while queries 2 and
12 have the highest number of false positives.</p>
        <p>A closer inspection reveals that the poor performance for queries 2 and 12 may be attributed to
their predominant use of Bengali tokens with minimal English code-mixing. Since all-mpnet-base-v2
is primarily trained on English text, it lacks robust multilingual alignment capabilities. Consequently,
Bengali-dominant inputs fall outside its pretraining distribution, leading to weaker semantic
representations. The absence of suficient English context limits the model’s ability to generate meaningful
embeddings, thereby reducing retrieval efectiveness for such queries.</p>
        <p>Overall, this analysis shows that the model’s performance variation is highly query-specific.
Addressing this issue by improving the handling of queries with limited code-mixing could significantly
enhance the model’s ability to mitigate false negatives and improve generalization in future iterations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>The proposed model secured third place in the overall CMIR-2025 competition. This achievement
indicates the efectiveness of the model. Although the model performed well on the dataset, it is
important to note that the query set sizes for the train and the test phases were relatively small (20 and
30, respectively). This design makes the dataset highly scalable, making space for better results when
applied to larger datasets. While the model showed comparable performance with the hyperparameter
settings mentioned in Section 3.3, an additional analysis based on the _ hyperparameter has been
performed to assess the model’s sensitivity towards it. The other two hyperparameters ({1} and {})
are kept at their default standard values. Similar to the error analysis, the sensitivity analysis is also
performed on the train queries due to the unavailability of the relevance judgements for test queries.</p>
      <sec id="sec-5-1">
        <title>5.1. Sensitivity Analysis</title>
        <p>This section talks about the sensitivity of the model on changing the _ value in the first stage of
retrieval. After initially submitting the model by tuning it with _ = 100, it was later evaluated with
seven other values of _ to analyse its sensitivity, and the corresponding performance metrics on
the train data were observed. The graph corresponding to the observed values of performance metrics
with respect to diferent values of _ is shown in Figure 4.</p>
        <p>From the figure, it can be deciphered that the overall best value of _ is 150, with the highest
value in both NDCG@10 and P@10. _ = 25 had the highest MAP score, while _ = 50 had the
highest value for P@5.</p>
        <p>It can also be observed that the MAP score is the least sensitive to the change in _ value. On the
other hand, P@5 shows drastic sensitivity to changing the _ value from 25 to 50, where it rises
from being the minimum to the maximum. After that, it remains fairly stable before dipping again at
200. For both NDCG@10 and P@10, a gradual but uneven upward trend is observed as the value of
_ increases from 25 to 150. However, at 175, the values of these two metrics drop drastically.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study proposed a two-stage re-ranking-based framework as a part of the CMIR-2025 shared task in
code-mixed information retrieval [25], [26]. The model uses BM25 as the first-stage ranker to retrieve
the top 100 documents. These documents are then passed to the second stage, where a zero-shot
bi-encoder computes the query and document embeddings. A cosine similarity score between them
provides the relevance measure, which in turn is used to retrieve the top 10 final documents.</p>
      <p>The proposed LexiSemIR model, which represents the second run submitted for the task, secured
the third rank in the competition, demonstrating competitive performance. Although Run 1 achieved
slightly higher NDCG and MAP scores, Run 2 (LexiSemIR) showed improvements in P@5 and P@10,
indicating enhanced precision at higher ranks through some comparable trade-ofs.</p>
      <p>This study also provides a detailed analysis of the model on the training data. A comprehensive error
analysis was conducted on the training queries, ofering valuable insights into the model’s performance,
biases, and potential areas for improvement. In addition, a sensitivity analysis on the training queries
highlighted the performance trend with varying values of the</p>
      <p>Future work could focus on more advanced preprocessing of code-mixed data and the incorporation
of language-specific encoders to achieve a more accurate semantic understanding.</p>
      <p>Future scope of this study could involve more advanced preprocessing of code-mixed data and
incorporating language-specific encoders to achieve more accurate semantic understanding.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[18] A. Deroy, S. Maity, Retrievegpt: Merging prompts and mathematical models for enhanced
codemixed information retrieval, arXiv preprint arXiv:2411.04752 (2024).
[19] J. Savoy, L. Dolamic, M. Akasereh, Information retrieval with hindi, bengali, and marathi
languages: Evaluation and analysis, in: Multilingual Information Access in South Asian Languages:
Second International Workshop, FIRE 2010, Gandhinagar, India, February 19-21, 2010 and Third
International Workshop, FIRE 2011, Bombay, India, December 2-4, 2011, Revised Selected Papers,
Springer, 2013, pp. 334–352.
[20] J. Leveling, G. J. Jones, Sub-word indexing and blind relevance feedback for english, bengali, hindi,
and marathi ir, ACM Transactions on Asian Language Information Processing (TALIP) 9 (2010)
1–30.
[21] D. Mandal, S. Dandapat, M. Gupta, P. Banerjee, S. Sarkar, Bengali and hindi to english
crosslanguage text retrieval under limited resources., in: CLEF (Working Notes), 2007.
[22] S. Bandyopadhyay, T. Mondal, S. K. Naskar, A. Ekbal, R. Haque, S. R. Godhavarthy, Bengali, hindi
and telugu to english ad-hoc bilingual task at clef 2007, in: Workshop of the Cross-Language
Evaluation Forum for European Languages, Springer, 2007, pp. 88–94.
[23] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al., Okapi at TREC-3,</p>
      <p>British Library Research and Development Department, 1995.
[24] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, Mpnet: Masked and permuted pre-training for language
understanding, Advances in neural information processing systems 33 (2020) 16857–16867.
[25] S. Chanda, K. Tewari, S. Pal, Overview of the cmir track at fire 2025: Code-mixed information
retrieval from social media data, in: FIRE ’25: Proceedings of the 17th Annual Meeting of the
Forum for Information Retrieval Evaluation. December 17-20, Varanasi , India, Association for
Computing Machinery (ACM), New York, NY, USA, 2025.
[26] S. Chanda, K. Tewari, S. Pal, Overview of the cmir track at fire 2025: Code-mixed information
retrieval from social media data, in: K. Ghosh, T. Mandl, S. Pal, S. Majumdar, A. Chakraborty
(Eds.), Forum for Information Retrieval Evaluation (Working Notes) (FIRE 2025) December 17-20,
Varanasi , India, CEUR-WS.org, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          , Introduction to information retrieval, volume
          <volume>39</volume>
          , Cambridge University Press Cambridge,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Hambarde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Proenca</surname>
          </string-name>
          ,
          <article-title>Information retrieval: recent advances and beyond</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>76581</fpage>
          -
          <lpage>76604</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Pal,</surname>
          </string-name>
          <article-title>The efect of stopword removal on information retrieval for code-mixed data obtained via social media</article-title>
          ,
          <source>SN Comput. Sci. 4</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on code-mixed information retrieval from social media data</article-title>
          ,
          <source>in: FIRE 2024 Working Notesl, CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          , p.
          <fpage>124</fpage>
          -
          <lpage>128</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>4054</volume>
          /
          <fpage>T2</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on code-mixed information retrieval from social media data, in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation, Association for Computing Machinery</article-title>
          ,
          <year>2025</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , S.-w. Hwang,
          <article-title>Contrastivemix: overcoming code-mixing dilemma in cross-lingual transfer for information retrieval</article-title>
          ,
          <source>in: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Strohman</surname>
          </string-name>
          , et al.,
          <article-title>Search engines: Information retrieval in practice</article-title>
          , volume
          <volume>520</volume>
          ,
          <string-name>
            <surname>Addison-Wesley</surname>
            <given-names>Reading</given-names>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kaushish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vijayvargiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Optimized spoken query cross-lingual document retrieval using bm25 and neural re-ranking with adamw</article-title>
          ,
          <source>in: 2025 4th OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 5</source>
          .0, IEEE,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          <article-title>, Multi-stage document ranking with bert</article-title>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>14424</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Passage re-ranking with bert</article-title>
          , arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>04085</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ding,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Serverless bm25 search and bert reranking</article-title>
          .,
          <source>in: DESIRES</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sannigrahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Van</given-names>
            <surname>Genabith</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>España-Bonet, Are the best multilingual document embeddings simply based on sentence embeddings?</article-title>
          ,
          <source>arXiv preprint arXiv:2304.14796</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Litschko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <article-title>Boosting zero-shot cross-lingual retrieval by training on artificially code-switched data</article-title>
          ,
          <source>arXiv preprint arXiv:2305.05295</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W. A. G.</given-names>
            <surname>Kodri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fitriadi</surname>
          </string-name>
          ,
          <article-title>Fine-hybrid: Integration of bm25 and finetuned sbert to enhance search relevance</article-title>
          ,
          <source>Teknika</source>
          <volume>14</volume>
          (
          <year>2025</year>
          )
          <fpage>213</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chakma</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <article-title>Cmir: A corpus for evaluation of code mixed information retrieval of hindienglish tweets</article-title>
          ,
          <source>Computación y Sistemas</source>
          <volume>20</volume>
          (
          <year>2016</year>
          )
          <fpage>425</fpage>
          -
          <lpage>434</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chakma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bandyopadhyay</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Choudhury, Overview of the mixed script information retrieval (msir) at fire-2016, in: Forum for information retrieval evaluation</article-title>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kowsher</surname>
          </string-name>
          , I. Hossen,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <article-title>Bengali information retrieval system (birs</article-title>
          ),
          <source>International Journal on Natural Language Computing (IJNLC) 8</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>