<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Romanised to Relevant: Multistage Information Retrieval for Code-Mixed Multilingual Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lakshay Sawhney</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sumit Goswami</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Defence Research and Development Organisation (DRDO)</institution>
          ,
          <addr-line>New Delhi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Thapar Institute of Engineering and Technology</institution>
          ,
          <addr-line>Patiala, Punjab</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The emergence of multilingual communication on social media platforms such as Facebook, Twitter and Whatsapp poses a compelling challenge for information retrieval in code-mixed conditions within the field of Artificial Intelligence, particularly Natural Language Processing. In this paper, we address the problem of Code-Mixed Information Retrieval for English-Bengali text with special attention to informal user-generated content in Facebook posts using a corpus of 107,900 documents. In contrast to any previous research, which is mostly limited to either a monolingual or well formatted corpus, our model addresses the compound problems of transliteration, inconsistent orthography and syntactic ambiguity inherent to code-mixed communication. We propose a multistage retrieval and reranking pipeline that combines lexical retrieval (BM25), dense semantic retrieval (E5) and cross-encoder reranking techniques (MiniLM), further optimised using a meta-learner based on XGBoost. Our approach achieves a peak NDCG@10 score of 99.68% during model development and validation. The final system, submitted as Team Defense_NLP, secured 2nd position internationally at the CMIR 2025 Data Competition, demonstrating the robustness of the proposed methodology across diverse evaluation settings. We further present novel insights into preprocessing failures, cross-encoder limitations (XLM-R) and score fusion strategies. Overall, the proposed architecture establishes a new benchmark for information retrieval in defence-relevant multilingual and code-mixed contexts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Code-Mixed IR</kwd>
        <kwd>English-Bengali Retrieval</kwd>
        <kwd>Cross-Encoder</kwd>
        <kwd>MiniLM</kwd>
        <kwd>Meta-Learner</kwd>
        <kwd>BM25</kwd>
        <kwd>E5</kwd>
        <kwd>XGBoost</kwd>
        <kwd>DRDO</kwd>
        <kwd>GPT-3</kwd>
        <kwd>5</kwd>
        <kwd>Social Media Corpus</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The problem addressed in this work arises from the rapid proliferation of multilingual and transliterated
communication on social-media platforms such as Facebook, Twitter and WhatsApp, which has posed
significant challenges to traditional Information Retrieval (IR) systems. A notable linguistic phenomenon
in South Asia is the widespread use of code-mixed Bengali-English text, written primarily in romanized
form without adherence to any orthographic standard. These messages, often semi-structured and
informal, resist tokenization, segmentation and traditional keyword-based retrieval approaches.</p>
      <p>The motivation behind this study stems from its national relevance in defence-oriented contexts.
Organisations such as the Defence Research and Development Organisation (DRDO) increasingly
require robust tools to extract meaningful signals from publicly available communication in Indian
languages, especially in regions where roman-script multilingualism is dominant. English–Bengali,
being one of the most widely spoken bilingual pairs in eastern India, presents a unique set of challenges
which include non-standard spellings, mixed vocabulary, lack of grammar formalism and semantic drift
across languages.</p>
      <p>
        However, despite growing interest in code-mixed IR, a clear research gap remains in building
lightweight and deployable systems for bilingual social-media retrieval. Recent work in this area,
including the FIRE 2024 shared task on Code-Mixed Information Retrieval [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], focuses specifically
on social-media datasets involving English–Bengali transliterated queries. While prompting-based
large language models (LLMs), such as GPT-3.5 Turbo [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], achieved state-of-the-art performance in
that shared task, such approaches are computationally expensive, require careful prompt engineering,
and are often unsuitable for real-time or constrained deployments in institutional and defence-oriented
settings. Our work proposes an alternative approach that is lightweight, interpretable, and tunable,
without sacrificing retrieval accuracy.
      </p>
      <p>
        To address this research gap, the contributions of this paper are structured around a multi-stage
hybrid retrieval and reranking pipeline that combines both lexical and semantic signals with a final
score-fusion mechanism:
• A lexical retriever (BM25) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to ensure high-recall matches based on surface-level term overlap.
• A dense retriever (E5) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to capture semantic relationships in noisy, romanized queries.
• A cross-encoder reranker (MiniLM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], fine-tuned on domain-specific triplets.
• Finally, a meta-learning module (XGBoost) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to dynamically fuse relevance signals from diferent
stages.
      </p>
      <p>Our pipeline was evaluated on a code-mixed English-Bengali social media corpus modeled after FIRE
shared task data, achieving a peak NDCG@10 of 99.68%, significantly outperforming all previously
reported systems under comparable conditions.</p>
      <p>
        In addition to the overall performance gains, our study also ofers novel empirical insights into several
underexplored aspects of code-mixed IR:
• Phonetic normalization, stemming, and lemmatization [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] - while generally helpful in monolingual
      </p>
      <p>
        NLP were found to degrade performance in noisy code-mixed settings;
• Heavier rerankers such as XLM-R [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and contrastive training pipelines [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] failed to generalize
well due to overfitting and ineficiency;
• Surprisingly, a lightweight cross-encoder like MiniLM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], when coupled with score fusion,
consistently outperformed larger multilingual transformers;
• Finally, our meta-learner approach, using XGBoost [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], contributed significant improvements
over static fusion or single-stage reranking.
      </p>
      <p>These findings provide not only a practical and deployable system, but also a deeper understanding
of the failure modes and design choices that shape real-world code-mixed IR performance.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the art</title>
      <p>In the context of our study, the task focuses on retrieving relevant social-media posts written in informal,
code-mixed English–Bengali text. The goal of this section is to summarize existing retrieval approaches
that address similar challenges of transliteration, spelling inconsistency, and multilingual noise.</p>
      <sec id="sec-2-1">
        <title>2.1. Lexical Retrieval Methods</title>
        <p>
          The Traditional IR systems rely on lexical overlap using models like BM25. The FIRE 2021 CMIR task [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
defined the first formal evaluation framework for code-mixed retrieval and evaluated term-matching
methods over romanized English-Bengali queries. These methods compute a score function
BM25(, ) ∝
∑︁ IDF() · TF(, )
∈∩
||
which assumes lexical alignment, an assumption that breaks under noisy, informal spelling (e.g., “kemon”
vs “kaemon” vs “kemne”).
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Dense &amp; Semantic Retrieval</title>
        <p>
          To overcome lexical mismatch, dense embedding methods such as SBERT [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and E5 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] have been
employed, where queries and documents are mapped into a shared semantic space:
dense(, ) = ⟨(), ()⟩
where  and  are transformer-based encoders. However, these methods struggle in transliterated
domains unless pre-trained on multilingual or romanized corpora.
        </p>
        <p>
          E5, introduced by Wang et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], improves over SBERT by using an instruction-tuned transformer
to separate query and document encoders, but its zero-shot application to code-mixed queries is still
limited by embedding drift and lack of language-specific fine-tuning.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Cross-Encoder Reranking</title>
        <p>
          Cross-encoders, such as BERT [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], MiniLM [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and XLM-R [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], compute relevance scores by modeling
the query-document pair jointly:
        </p>
        <p>cross(, ) = MLP(︀ [CLS; ; ])︀</p>
        <p>
          This setup allows for richer interaction, but scales poorly with corpus size. The FIRE 2024 winning
systems [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] used GPT-3.5 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] for joint representations, but required prompt templates and incurred high
latency and inference cost, rendering them impractical for deployments such as in defence or embedded
devices.
(2)
(3)
(4)
(6)
(7)
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.5. Preprocessing &amp; Code-Mixed Language Handling</title>
        <p>
          Past work such as Chanda et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and Mandal et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] emphasized normalization and language
identification as preprocessing steps. However, phonetic normalization and stemming [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] were only
partially efective due to the unstable orthographic structure of roman-script Bengali, which lacks a
standardized grapheme-phoneme mapping [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>Our findings challenge this practice by empirically showing that conventional NLP preprocessing
pipelines</p>
        <p>:  ↦→ ′
actually introduce noise in code-mixed IR tasks when</p>
        <p>∈/ native-script-safe.</p>
        <p>As evident from Table 1, each existing approach sufers from key limitations in handling noisy,
transliterated code-mixed queries. Our proposed pipeline systematically addresses all these weaknesses
through a multi-stage design that incorporates lexical recall, semantic embedding, and learned score
fusion to deliver high top- accuracy.</p>
        <p>
          In addition to the five core works described in detail, we also reviewed several other influential papers
that shaped our model design and understanding of the CMIR domain [
          <xref ref-type="bibr" rid="ref15">15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27</xref>
          ]. These works span dense retriever techniques, reranker training methods, multi-stage
ranking architectures, and learning-to-rank strategies.
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>2.4. Fusion and Multi-Stage Retrieval</title>
        <p>Some recent works have attempted simple score fusion
hybrid =  ·  lex + (1 − ) · 
dense,
but lacked deeper reranking or learning-to-rank stages. Our work builds on this by introducing a
meta-reranker</p>
        <p>
          meta : R → R (5)
that learns optimal fusion via XGBoost. Related literature on Learn-to-Rank [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and score injection
methods [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] informs this design, though few papers apply them to code-mixed domains.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset and Problem Definition</title>
      <p>We evaluated our retrieval system on the code-mixed FIRE CMIR-2025 shared task dataset [28, 29],
comprising a corpus of 107,900 documents and 20 unique queries. The corpus is derived from
realworld Facebook posts written in informal English–Bengali code-mixed language, exhibiting spelling
inconsistency, phonetic variance and morphological noise. All queries in the training set are exclusively
written in romanized script, with no native Bengali script tokens present.</p>
      <p>On average, each document contains 12.64 tokens, while queries average 40.20 tokens, indicating high
lexical sparsity and a wide semantic span. To standardize token representation and enable uniform input
encoding, we transliterate all native-script Bengali tokens into romanized form during preprocessing
(discussed in Section 6) to generalise the model.</p>
      <p>The dataset includes binary graded relevance judgments for each query–document pair. Let  =
{1, 2, . . . , } be the set of input queries,  = {1, 2, . . . , } the document corpus, and rel :
 ×  → {0, 1} be the binary ground truth function indicating relevance.</p>
      <p>The goal of the Information Retrieval task is to learn a scoring function:</p>
      <p>rank :  ×  → R
that induces a total ordering over documents for any given query, such that the top- documents
selected by rank(, )· maximize agreement with rel (, ·) under standard evaluation metrics such as
NDCG@10.</p>
      <p>Table 2 shows representative user queries from the CMIR dataset, demonstrating the informal,
code-mixed, and phonetically inconsistent structure of real-world search inputs.
hi all amar akta urgent information er prayojon apnara keu ki asian institute of
gastroenterology te consult koriyechen amar mother in law ke okhane niye jete chai
for consultation</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>We designed a multi-stage Information Retrieval (IR) pipeline optimized for code-mixed information
retrieval. Formally, our system defines a scoring function</p>
      <p>rank :  ×  → R,
where  is the set of queries and  the corpus, to maximise relevance agreement under NDCG@10.</p>
      <p>function BM25(, ) as:</p>
      <sec id="sec-4-1">
        <title>4.1. Lexical + Semantic Baseline</title>
        <p>Our baseline model combines two models - BM25 and SBERT as a lexical retriever and semantic reranker,
respectively, combined via score fusion:
• BM25 Retrieval: BM25 ranks a document  ∈  against a query  ∈  using the scoring
∑︁ log
∈
︂(  −   + 0.5
 + 0.5</p>
        <p>︂)
+ 1 ·
( + 1) ,</p>
        <p>︁(
, +  · 1 −  +  ·
|| )︁ ,
avgdl
where  is the corpus size,  is the number of documents containing term , , is the frequency
of  in  and  = 1.2,  = 0.75.
code-mixed settings.</p>
        <p>BM25 is robust for exact token matching but lacks semantic understanding, especially in noisy,
• SBERT Reranker: Each query-document pair (, ) in the BM25 top- is encoded using a
Sentence-BERT model to obtain contextual embeddings:</p>
        <sec id="sec-4-1-1">
          <title>The semantic relevance is computed via cosine similarity:</title>
          <p>= SBERT(),  = SBERT()
SBERT(, ) = cos(, ) =</p>
          <p>·  
‖‖ · ‖ ‖</p>
          <p>This reranking captures cross-lingual and transliterated semantics that BM25 alone cannot.
• Score Fusion: To jointly leverage lexical and semantic signals, we apply weighted score fusion:
hybrid(, ) =  ·  BM25(, ) + (1 − ) ·</p>
          <p>
            SBERT(, ),
where  ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] balances surface-form matching and semantic similarity.
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Hybrid Retrieval</title>
        <sec id="sec-4-2-1">
          <title>E5-Mistral: To further strengthen semantic generalization, we ensemble BM25 with a stronger dense retriever,</title>
          <p>hybrid(, ) =  ·  BM25(, ) + (1 − ) ·</p>
          <p>
            E5(, ),
where E5 denotes cosine similarity and  ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] is tuned empirically.
          </p>
          <p>As evident from Fig. 1, we find  = 0.6 is the most optimal, which balances E5’s semantic
generalization with BM25’s precision.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Reranking via Cross Encoder</title>
        <p>The top-100 documents from hybrid retrieval are reranked using a MiniLM CrossEncoder fine-tuned on
CMIR relevance pairs using margin ranking loss:
ℒrank =</p>
        <p>∑︁
(,+,− )
max (︀ 0,  −  (, 
+) +  (, − ))︀ ,
where  is the CrossEncoder score and  = 1 is the margin.</p>
        <p>XLM-R Large was also evaluated but consistently underperformed due to domain mismatch and
limited training data, showing degraded generalization as reported in Table 3.
◁ (, ) is the predicted fused score</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Meta-Learner for Reranking via Dynamic Score Fusion</title>
        <p>
          Inspired by prior work [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we treat the fusion of BM25, E5 and MiniLM scores as a feature vector
and learn an optimal scoring function (as explained in Algorithm 1 in detail):
s, = [BM25, MiniLM] ∈ R3
        </p>
        <p>meta : R3 → R</p>
        <p>We compare Logistic Regression (LR) and XGBoost. LR fails to capture non-linear dependencies and
underperforms. XGBoost, trained on rankwise relevance as supervision, generalizes best and sets the
ifnal prediction score.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Experimental Setup</title>
        <p>This section details the experimental environment used for training, evaluation and inference. It
discusses the hardware configuration, key hyperparameter settings for each component of the pipeline
and the overall computational cost associated with model fine-tuning and retrieval.</p>
        <p>All experiments were executed using Google Colab T4 GPUs and Colab CPUs across multiple sessions.
Lightweight retrieval stages (BM25 and E5 inference) were run on CPU, while MiniLM fine-tuning and
reranking were executed on GPU.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The proposed pipeline integrates multiple retrieval and reranking stages to optimize top- ranking
performance on code-mixed Bengali-English queries. We employ a hybrid retriever that linearly fuses
BM25 and E5-Mistral dense embeddings using an empirically tuned weight of  = 0.6 , followed by
reranking using a MiniLM-based CrossEncoder fine-tuned with margin ranking loss. To further refine
ranking fidelity, we incorporate a meta-learning layer using XGBoost that fuses lexical, semantic and
reranker scores for final ordering.</p>
      <p>This multi-stage architecture is intentionally lightweight yet expressive, enabling high retrieval
quality even under limited supervision and noisy code-mixed data. The complete retrieval and reranking
pipeline is illustrated in Fig. 2.</p>
      <p>Tables 6 and 7 report the final evaluation metrics on the training and test sets, respectively. While
the meta-ranker achieves near-oracle performance on training data, test-time results show that MiniLM
provides stronger ranking quality, whereas the meta-learner yields higher precision at deeper cutofs.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The previous sections presented the methodology, experimental setup and quantitative results of our
multi-stage retrieval pipeline. In this section, we shift focus from raw performance metrics to critical
analysis of the system’s behavior across diferent components and experimental conditions. Specifically,
we examine three key aspects: (i) the role of preprocessing and normalization strategies and why certain
conventional NLP techniques failed in code-mixed environments, (ii) the limitations of advanced
modellevel experiments such as XLM-R reranking, contrastive training, and joint fine-tuning, and (iii) the
discrepancy observed between training and test evaluations, which highlights important considerations
for generalization in real-world deployments. Together, these analyses reveal not only the strengths of
our approach but also the boundary conditions under which it may falter, thereby ofering a deeper
understanding of design choices for code-mixed information retrieval.</p>
      <sec id="sec-6-1">
        <title>6.1. Preprocessing Experiments and Insights</title>
        <p>Preprocessing formed a critical part of our investigation, as conventional NLP pipelines are often
assumed to improve retrieval but may behave unpredictably in noisy, code-mixed settings. To evaluate
this systematically, we designed a series of ablation experiments covering both basic cleaning steps and
advanced normalization techniques. The following subsections present the setup, techniques explored,
empirical results, and failure cases, followed by the key insights that guided our final preprocessing
pipeline.
6.1.1. Motivation and Experimental Setup
We conducted extensive preprocessing ablation studies on the BM25 baseline to assess whether
conventional preprocessing techniques that are typically beneficial in monolingual IR, translate well to
informal, multilingual, and transliterated data. All preprocessing was applied on a training set of 20
English–Bengali code-mixed queries (as given in Section 3). Evaluation was based solely on lexical
retrieval performance (BM25), as semantic retrievers like E5 or SBERT rely on richer token semantics
and may be negatively impacted by aggressive text reduction (e.g., stopword removal).
6.1.2. Techniques Used
We categorized preprocessing into two tiers:
• Tier 1 (Basic Cleaning): this included Lowercasing, Punctuation Removal, English Stopword</p>
        <p>Removal, Bengali Stopword Removal and Combined Stopword Removal [30].
• Tier 2 (Advanced Normalisation): this included Stemming, Lemmatization, Phonetic
Normalisation, Named Entity Recognition and Preservation and Combined Tier-2 Pipeline.</p>
        <p>All variants were benchmarked against a lowercase-only baseline.
6.1.3. Results and Observations
The results of the preprocessing techniques are shown in Table 8.
6.1.4. Key Insights
• Best Performing Pipeline: Lowercase + Combined Stopwords (English + Bengali) removal.
• Worst Performing Pipeline: Combined Tier-2 preprocessing (which included all advanced
normalizations).
• Techniques that Reduced Performance: Stemming/Lemmatization, Phonetic Normalization,
and NER &amp; NEP.
6.1.5. Preprocessing Failure Analysis and Novel Insights
• Stemming / Lemmatization
– Failure Reason: These techniques distorted syntactic and morphological variants into
unnatural roots, often collapsing dissimilar terms or breaking lexical overlaps (e.g., “running”
and “runway” both reduced to “run”).
– Novel Insight: Lexical retrievers like BM25 benefit more from surface-form diversity than
stemmed uniformity in informal, morphologically diverse code-mixed corpora.
• Phonetic Normalization
• NER &amp; NEP
– Failure Reason: Over-aggressive collapsing of transliterated terms led to false matches or
mismatches (e.g., “balcony”, “balconi”, “balkauni”), reducing exact string overlap.
– Novel Insight: In noisy, user-generated code-mixed text, naive phonetic normalization
increases variance and hurts sparse lexical models like BM25.
– Failure Reason: Replacing entities (e.g., “Dr Gurava Reddy”) removed context-rich surface
forms, breaking exact lexical matches.
– Novel Insight: Entity abstraction oversimplifies code-mixed queries where informative
context and redundancy often carry retrieval-relevant signal.
6.1.6. Transliteration of Native Script
To generalize the model for test queries written in native Bengali script, transliteration [31] was added
in the preprocessing pipeline to convert the native script into Romanized Bengali on which the model is
trained. This step significantly improves performance in the case of queries containing native Bengali
script.
6.1.7. Summary of Findings
Our experiments reveal that standard NLP preprocessing is not directly transferable to code-mixed
retrieval tasks. Excessive normalization can discard subtle cues inherent in code-mixed discourse.
Lexical models benefit from minimal yet targeted cleaning-lowercasing, stopword removal in both
languages, and language-aware transliteration. Importantly, these insights are most relevant to sparse
lexical methods like BM25 and were not tested on dense retrievers, which rely on semantic completeness
of input text.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Model-Level Experiments and Limitations</title>
        <p>In this section, we analyze three experimental approaches that underperformed in our code-mixed IR
task: (i) XLM-R Large reranking, (ii) contrastive pairwise fine-tuning, and (iii) joint retriever-reranker
training. While each technique has shown promise in prior literature, their failure in our context reveals
critical caveats and usage boundaries.
6.2.1. XLM-R Reranking Failure
• What we did: We fine-tuned an XLM-R Large CrossEncoder on &lt;query, document, label&gt;
triplets using margin ranking loss for reranking the top-100 retrieved candidates.
• Reasons for Failure: Despite its multilingual pretraining, XLM-R failed to adapt well to noisy,
transliterated, code-mixed English–Bengali inputs. Its deeper architecture (560M parameters)
likely overfit the small training set (25 queries × 150 triplets), resulting in poor generalization.</p>
        <p>Moreover, its reliance on native-script embeddings did not align with romanized inputs.
• When Not to Use: Avoid deploying XLM-R on small code-mixed datasets with transliterated
text and informal grammar, especially when compute is constrained or generalization is critical.
6.2.2. Contrastive Training Failure
• What we did: We attempted pairwise contrastive fine-tuning of MiniLM CrossEncoder using
triplet loss, with randomly sampled &lt;query, positive, negative&gt; triples from the hybrid
top-100 results.
• Reasons for Failure: The contrastive signal collapsed due to noisy negatives: BM25+E5 hybrid
results often returned false positives or borderline irrelevants, which blurred the distinction
between positive and negative pairs. As a result, the margin loss could not learn meaningful
ranking boundaries.
• When Not to Use: Avoid contrastive fine-tuning on small or noisy retrieval sets, especially when
hard negatives are unavailable or unverified. Sampling from top-100 candidates can introduce
semantically ambiguous pairs that hinder training.
6.2.3. Joint Fine-Tuning Failure
• What we did: We jointly fine-tuned a dual-encoder retriever (E5) and a MiniLM reranker using
score fusion supervision from existing relevance annotations and model scores.
• Reasons for Failure: The dual loss signal was unstable: the retriever gradients interfered with the
reranker’s learning trajectory. Given the small query set and sparse labels, the retriever’s updates
dominated and distorted the reranker’s focus on local query-document semantics. Furthermore,
ifne-tuning both components simultaneously reduced interpretability of individual contributions.
• When Not to Use: Joint fine-tuning should be avoided in low-resource or high-variance scenarios,
where label noise and signal conflict between stages can reduce modular robustness and impair
ifnal reranking quality.
6.2.4. Final Insights
Across all three techniques, a common theme emerged: complex architectures and coupled training
objectives underperform in noisy, small-scale, code-mixed environments. Simpler models (like
finetuned MiniLM CrossEncoder) with clean supervision and modular pipelines yielded significantly higher
top- accuracy. This highlights the importance of task-appropriate modeling over brute-force scaling.
As visualized in Fig. 3, MiniLM’s lightweight design provides a clear advantage in both recall and
ranking quality, confirming that model simplicity yields better generalization in code-mixed retrieval.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Training–Testing Evaluation Gap</title>
        <p>While our model achieved state-of-the-art performance on the training set, a noticeable gap was observed
when evaluated on test queries. This discrepancy is not unique to our system but is symptomatic of
code-mixed IR tasks, where the linguistic noise and query diversity in test data often exceed what can
be captured in a limited training distribution. Notably, the system still placed 2nd overall at the CMIR
2025 Data Competition, indicating it is not merely overfitted but captures transferable retrieval patterns.
Although the gap appears numerically large, it reflects extreme data imbalance rather than model
instability. With only twenty labeled training queries, variance across folds is statistically expected; a
paired analysis of query-level NDCG confirmed consistent rank ordering across models, supporting the
reliability of the observed trends.
6.3.1. Reasons for the gap
• Data scarcity:
• Domain mismatch:
• Supervision noise:
– Only ∼20 labeled queries were available in the training set.
– As a result, fine-tuned components such as MiniLM and XGBoost inevitably learned patterns
biased toward the training distribution, limiting their ability to generalize.
– Test queries exhibited greater spelling irregularity, transliterated variants unseen during
training, and occasional native-script forms.
– Our transliteration step partially alleviated this issue but could not capture all
out-ofdistribution cases.
– Negative examples generated for contrastive fine-tuning often contained borderline or
ambiguous cases.
– This blurred the decision boundary between relevant and non-relevant pairs, making it
harder for rerankers to learn sharp distinctions.
6.3.2. Strengths that stabilised performance
• Modular design (BM25 + E5 + MiniLM + fusion):
• XGBoost meta-learner (score fusion):
– The multi-stage pipeline provided complementary signals that compensated when a single
stage underperformed.
– This modularity prevented catastrophic collapse and ensured more stable performance
under noisy test conditions.
– Learned dynamic weights across lexical, dense, and reranker stages.
– Reduced overfitting to the training distribution and improved consistency across diverse
test queries.
• Transliteration pipeline:
– Directly boosted recall on queries containing Bengali-script tokens.</p>
        <p>– Ofered a robustness advantage that was absent in most baseline systems.
6.3.3. Future directions
• More labeled queries:
– Semi-supervised techniques such as pseudo-labeling and active learning can expand the
training set.
• Hard-negative mining:
• Indic-aware pretraining:
– This would reduce variance and improve model generalization.
– Incorporating more informative negatives in contrastive fine-tuning can sharpen decision
boundaries.</p>
        <p>– This helps rerankers distinguish between borderline relevant and non-relevant pairs.
• Spelling-variant generation and entity aliases:
– Cross-lingual or Indic-script corpora could improve robustness to native-script and
mixedscript queries.
– Especially useful for unseen transliterated or noisy forms.
– Expanding the query/document space with surface-form variants would enhance test-time
coverage.</p>
        <p>– Reduces brittleness to creative spellings and inconsistent named entities.
• Per-query dynamic fusion:
– Calibrating retriever–reranker weights based on query-level features or uncertainty.
– Ensures better handling of out-of-distribution (OOD) cases.
6.3.4. Takeaway
• Despite the visible gap, the stability of the modular pipeline under noisy, code-mixed conditions,
backed by securing the 2nd position, demonstrates a system that generalizes beyond its limited
training distribution despite apparent overfitting indicators.
• Instead, it captures transferable retrieval behavior with clear avenues for further generalization,
ofering practical lessons for real-world code-mixed IR systems.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper, we presented a multi-stage retrieval and reranking pipeline tailored for code-mixed
English–Bengali queries. By integrating lexical (BM25), semantic (E5), and cross-encoder (MiniLM)
signals, and further refining them through an XGBoost-based meta-learner, our system achieved
stateof-the-art training performance and secured the 2nd position in the FIRE CMIR-2025 data competition.</p>
      <p>The study highlighted several key insights. First, conventional preprocessing methods such as
stemming, lemmatization and phonetic normalization, though efective in monolingual IR, degraded
performance in noisy code-mixed contexts. Minimal, targeted cleaning combined with transliteration
of native-script tokens proved far more reliable. Second, large multilingual models like XLM-R and
joint fine-tuning strategies failed to generalize under low-resource, noisy conditions, underscoring
the importance of lightweight yet modular architectures. Third, the analysis of training–testing
discrepancies revealed limitations due to data scarcity, domain mismatch, and supervision noise, while
also confirming the stabilizing role of our modular design, transliteration pipeline, and dynamic score
fusion.</p>
      <p>Overall, our findings reinforce that simplicity, modularity, and language-aware preprocessing can
outperform brute-force scaling in code-mixed IR. The proposed architecture not only sets a strong
benchmark but also ofers transferable lessons for multilingual and defence-relevant applications.
Future work will focus on expanding labeled data through semi-supervised techniques, incorporating
Indic-aware pretraining and exploring per-query adaptive fusion to further enhance generalization.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT to refine grammar, improve wording,
and generate preliminary scafolds for certain sections. After using this tool, the authors reviewed,
edited and verified the content to ensure accuracy. The authors take full responsibility for the final text.
[16] N. Dandekar, Pointwise vs pairwise vs listwise learning to rank, Medium, 2022. URL: https:
//medium.com/@nikhilbd/pointwise-vs-pairwise-vs-listwise-learning-to-rank-80a8fe8fadfd.
[17] G. de Souza P. Moreira, et al., Enhancing q&amp;a text retrieval with ranking models: Benchmarking,
ifne-tuning and deploying rerankers for rag, arXiv preprint, 2024. URL: https://arxiv.org/abs/2409.
07691.
[18] X. Ren, et al., Rocketqa-v2: A joint training method for dense passage retrieval and passage
reranking, 2021. URL: https://www.researchgate.net/publication/355219310_RocketQAv2_A_Joint_
Training_Method_for_Dense_Passage_Retrieval_and_Passage_Re-ranking.
[19] P. Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances
in Neural Information Processing Systems (NeurIPS), 2020. URL: https://arxiv.org/abs/2005.11401.
[20] R. Nogueira, W. Yang, K. Cho, J. Lin, Multi-stage document ranking with bert, arXiv preprint, 2019.</p>
      <p>URL: https://arxiv.org/abs/1910.14424.
[21] J. Guo, et al., A deep look into neural ranking models for information retrieval, Information
Processing &amp; Management 57 (2020) 102067. URL: https://doi.org/10.1016/j.ipm.2019.102067. doi:10.
1016/j.ipm.2019.102067.
[22] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of ir techniques, ACM Transactions
on Information Systems 20 (2002) 422–446. URL: https://doi.org/10.1145/582415.582418. doi:10.
1145/582415.582418.
[23] A. Singh, et al., Low-resource neural ir for indic languages, in: Proceedings of IC2SDT, 2022.
[24] R. Bhatt, et al., Eficient keyword-based search in regional news archives, in: Proceedings of</p>
      <p>IC2SDT, 2023.
[25] S. Sharma, R. Tiwari, Domain adaptation for code-mixed ir, in: Proceedings of IC2SDT, 2024.
[26] S. Goswami, et al., Anomaly detection in social media streams using semi-supervised lda, DRDO</p>
      <p>Journal, 2020.
[27] S. Goswami, et al., Language-agnostic entity recognition for military surveillance, in: Proceedings
of IC2SDT, 2023.
[28] S. Chanda, K. Tewari, S. Pal, Findings of the code-mixed information retrieval from social media
data (cmir) shared task at fire 2025, Forum for Information Retrieval Evaluation (Working Notes),
CEUR-WS.org, 2025.
[29] S. Chanda, K. Tewari, S. Pal, Overview of the cmir track at fire 2025: Code-mixed information
retrieval from social media data, in: Proceedings of the 17th Annual Meeting of the Forum for
Information Retrieval Evaluation (FIRE 2025), Association for Computing Machinery (ACM), 2025.
[30] S. Chanda, S. Pal, The efect of stopword removal on information retrieval for code-mixed data
obtained via social media, SN Computer Science 4 (2023) 20. doi:10.1007/s42979-023-01942-7.
[31] M. Kumbhar, K. Thakre, Language identification and transliteration approaches for code-mixed
text, Journal of Engineering Science and Technology Review 17 (2024) 63–70. URL: https://doi.org/
10.25103/jestr.171.09. doi:10.25103/jestr.171.09.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on code-mixed information retrieval from social media data, in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation (FIRE</article-title>
          <year>2024</year>
          ),
          <article-title>Association for Computing Machinery</article-title>
          (ACM),
          <year>2024</year>
          . URL: https://doi.org/ 10.1145/3734947.3735670. doi:
          <volume>10</volume>
          .1145/3734947.3735670.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on code-mixed information retrieval from social media data</article-title>
          ,
          <source>in: FIRE 2024 Working Notesl, CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          , p.
          <fpage>124</fpage>
          -
          <lpage>128</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>4054</volume>
          /
          <fpage>T2</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Exploring chatgpt for next-generation information retrieval: Opportunities and challenges</article-title>
          ,
          <source>Web Intelligence</source>
          <volume>22</volume>
          (
          <year>2024</year>
          )
          <fpage>31</fpage>
          -
          <lpage>44</lpage>
          . URL: https://doi.org/10.3233/WEB-230363. doi:
          <volume>10</volume>
          .3233/WEB-230363.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          , Okapi at trec-3, in:
          <source>Proceedings of the Third Text REtrieval Conference (TREC-3)</source>
          , NIST,
          <year>1995</year>
          . URL: https://www. microsoft.com/en-us/research/wp-content/uploads/2016/02/okapi_trec3.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Text embeddings by weakly-supervised contrastive pre-training, arXiv preprint</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2212.03533.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Minilm:
          <article-title>Deep self-attention distillation for task-agnostic compression of pre-trained transformers</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2002</year>
          .10957.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Burges</surname>
          </string-name>
          , et al.,
          <article-title>Learning to rank using gradient descent</article-title>
          ,
          <source>in: Proceedings of the International Conference on Machine Learning (ICML)</source>
          ,
          <year>2005</year>
          . URL: https://www.microsoft.com/en-us/research/ publication/learning-to
          <article-title>-rank-using-gradient-descent/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mandal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nanmaran</surname>
          </string-name>
          ,
          <article-title>Normalization of transliterated words in code-mixed data using seq2seq model &amp; levenshtein distance</article-title>
          ,
          <source>in: Proceedings of the EMNLP Workshop</source>
          on Noisy User-generated
          <string-name>
            <surname>Text (W-NUT)</surname>
          </string-name>
          ,
          <year>2018</year>
          . URL: https://aclanthology.org/W18-6107.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          , et al.,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          ,
          <source>in: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          ,
          <year>2020</year>
          . URL: https: //aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          , et al.,
          <article-title>Dense passage retrieval for open-domain question answering</article-title>
          ,
          <source>in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2020</year>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>550</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mandal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Mandal, Overview of the shared task on code-mixed information retrieval (cmir) at fire 2021</article-title>
          ,
          <source>in: Proceedings of FIRE 2021 Working Notes</source>
          , volume
          <volume>3159</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3159</volume>
          /
          <fpage>T1</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the EMNLP/IJCNLP</source>
          ,
          <year>2019</year>
          . URL: https://aclanthology.org/D19-1410.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Askari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abolghasemi</surname>
          </string-name>
          , G. Pasi,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraaij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          ,
          <article-title>Injecting the bm25 score as text improves bert-based re-rankers, arXiv preprint</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2301.09728.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Doval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vilares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vilares</surname>
          </string-name>
          ,
          <article-title>On the performance of phonetic algorithms in microtext normalization</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>113</volume>
          (
          <year>2018</year>
          )
          <fpage>213</fpage>
          -
          <lpage>222</lpage>
          . URL: https://doi.org/10.1016/j.eswa.
          <year>2018</year>
          .
          <volume>07</volume>
          .016. doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2018</year>
          .
          <volume>07</volume>
          .016.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Malviya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dhingra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Mst-r: Multi-stage tuning for retrieval systems and metric evaluation, arXiv preprint</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2412.10313.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>