<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Model Fusion for Bridging Linguistic Variability in Bengali-English Code-Mixed Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachana Nagaraju</string-name>
          <email>rachananagaraju20@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hosahalli Lakshmaiah Shashirekha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Mangalore University</institution>
          ,
          <addr-line>Mangalore, Karnataka</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Code-mixed text, where lexical elements and grammatical features from multiple languages appear within the same utterance, is highly prevalent in multilingual societies. In Indian context, users frequently express themselves in their native languages, but usually in a combination of native and Roman scripts, often interspersed with English. This phenomenon poses significant challenges for both language identification and Information Retrieval (IR) due to lack of standardization, spelling variations, and informal usage. To address these challenges, Code-Mixed Information Retrieval (CMIR)-2025 shared task at Forum for Information Retrieval Evaluation (FIRE) 2025 invites researchers to design and develop models capable of retrieving relevant answers from Bengali-English code-mixed text. The task involves building retrieval systems capable of identifying relevant responses to natural language queries in Bengali-English code-mixed text, with evaluation conducted on held-out Test set using standard IR metrics. In this paper, we - team MUCS - describe our proposed model submitted to the CMIR-2025 shared task, which employs a fusion of traditional retrieval models - Best Matching 25 (BM25), Dirichlet Language Model (DirichletLM), and Query Likelihood Model - Hiemstra_LM (HiemstraLM) - combined using Reciprocal Rank Fusion (RRF) to retrieve relevant answers from Bengali-English code-mixed text written in Roman script. Our experimental results illustrate that this fusion-based retrieval approach improves efectiveness across multiple evaluation metrics, achieving a Mean Average Precision (MAP) of 0.211792, normalized Discounted Cumulative Gain (nDCG) of 0.485517, Precision at cutof 5 (P@5) of 0.42, and P@10 of 0.30, thereby securing 1 st place in the shared task. These results highlight the efectiveness of model fusion techniques like RRF for robust retrieval in noisy, informal, and multilingual online environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Code-Mixed Information Retrieval</kwd>
        <kwd>Romanization</kwd>
        <kwd>Transliteration</kwd>
        <kwd>Language Identification</kwd>
        <kwd>Bengali-English</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Fusion Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Code-mixing, where lexical elements and grammatical features from multiple languages appear within
the same utterance, is a pervasive phenomenon in multilingual societies. In India, this practice is
particularly widespread on social media platforms, where users often communicate in their native
languages but usually employ a combination of native and Roman scripts, frequently interspersed
with English. Such informal and non-standardized writing introduces challenges for Natural Language
Processing (NLP) tasks, especially language identification and IR. The lack of orthographic consistency,
frequent spelling variations, and transliteration errors make it dificult to accurately retrieve relevant
content from code-mixed corpora [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. In recent years, code-mixed text has drawn attention in a
variety of NLP tasks, including language identification, part-of-speech tagging, machine translation,
and sentiment analysis. For instance, identifying language boundaries within a sentence is non-trivial
when tokens are transliterated and phonetically ambiguous. Similarly, sentiment analysis in code-mixed
social media posts shows a degraded performance as compared to monolingual settings, due to noisy
and highly variable input [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These challenges motivate the need for task-specific resources and robust
computational models capable of handling the inherent diversity in code-mixed text.
      </p>
      <p>IR in code-mixed settings introduces an additional layer of dificulty. Unlike structured NLP tasks,</p>
      <p>
        IR in code-mixed domain requires efective matching between queries and documents, both of which
may contain inconsistent transliterations, hybrid grammar, or irregular spellings. Prior studies have
demonstrated that indexing strategies and normalization techniques can improve performance in
such scenarios [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. For example, experiments at SIGIR reported that clustered indexing improved
retrievability of code-mixed content compared to unified indexing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], while FIRE 2014 shared task
showed that transliteration normalization combined with sub-word indexing could substantially boost
retrieval efectiveness [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These findings illustrate the importance of tailoring IR approaches specifically
to noisy, code-mixed environments. This problem is particularly relevant in real-world contexts such
as migrant communities on platforms like Facebook and WhatsApp, where users share experiences,
seek advice, and exchange critical information. During COVID-19 pandemic, for example, code-mixed
conversations in online groups were a crucial source of localized guidance on health policies, mobility
restrictions, and access to resources. However, the lack of standardized scripts made it dificult for users
and retrieval systems alike to locate relevant past information eficiently. To address these challenges,
CMIR-20251 shared task at FIRE 2025 invites researchers to design and develop IR models, focusing on
the retrieval of relevant answers from Bengali-English code-mixed text written in Roman script [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ].
The task required participating systems to process natural language queries in code-mixed form and
retrieve relevant documents at the sentence or post level. This setup provided a realistic benchmark for
testing the robustness of IR methods on noisy and heterogeneous user-generated content.
      </p>
      <p>In this paper, we describe fusion of classical IR models - BM25, HiemstraLM, and DirichletLM, to
retrieve relevant answers from Bengali-English code-mixed text. The rationale behind this design is to
exploit the complementary strengths of diferent retrieval functions, thus reducing query-document
mismatch to improve overall efectiveness on code-mixed data. Experimental results demonstrate
that our fusion-based pipeline significantly outperformed individual retrieval models. On the Test
set provided by the organizers, our system achieved MAP of 0.211792, nDCG of 0.485517, P@5
of 0.42, and P@10 of 0.30, attaining 1st rank in the CMIR-2025 shared task. These results confirm
the importance of integrating multiple retrieval strategies when tackling the unique challenges posed
by code-mixed text, and highlight the potential of traditional IR methods, when carefully adapted, to
address modern CMIR problems. Our contributions are as follows:
• We develop and evaluate a classical retrieval pipeline tailored for Bengali-English code-mixed
text using a combination of BM25, DirichletLM, and HiemstraLM.
• We apply RRF to combine the outputs of these diverse models, demonstrating that fusion mitigates
the individual model’s weakness and improves ranking quality in noisy, informal inputs.
• We conduct a comparative analysis of eight retrieval models under code-mixed-compatible
indexing strategies, highlighting the critical role of pre-processing configuration for code-mixed
IR.</p>
      <p>The subsequent sections of this paper details the related works (Section 2), methodology (Section 3),
experiments, results, and implications of our approach (Section 4), declaration on generative AI (Section
5) followed by conclusion and future works (Section 6).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Prior work in CMIR has explored challenges in handling multilingual and mixed-script queries,
particularly in informal social media contexts. The application of Large Language Model (LLM) and
prompt-based retrieval to noisy, informal text has seen significant advances in recent years.
RetrieveGPT - a notable work proposed by Sun et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] integrates LLM prompting with traditional IR
models for CMIR. Their experiments showed improvements of 8–10% in precision compared to dense
retrievers, demonstrating strong contextual adaptation, although the method remains computationally
expensive and sensitive to prompt design. Chakma and Das [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] introduced a Hindi–English code-mixed
tweet corpus and evaluated classical IR models such as BM25 and Term Frequency-Inverse Document
1https://cmir-iitbhu.github.io/cmir/index.html
Frequency (TF-IDF) and reported MAP scores of 0.18 and 0.15, respectively, highlighting the dificulty
of handling transliteration noise, inconsistent spellings, and informal usage in CMIR.
      </p>
      <p>
        Mandal and Nanmaran [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] tackled the normalization challenge in transliterated text by combining
a seq2seq model with Levenshtein-distance correction. Their approach achieved 90.3% accuracy in
recovering canonical forms, thereby improving query-document alignment. However, they struggled in
longer sequences and ambiguous contexts, showing that normalization, while useful is not a complete
solution for retrieval. Bhat et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] investigated supervised learning for mixed-script query labeling
during FIRE 2014 task, using SVMs and decision trees combined with edit-distance based query expansion
and sub-word indexing. Their system achieved reasonable retrieval performance, but the reliance on
shallow features limited its robustness. Ganguly et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] further enhanced mixed-script retrieval
with rule-based fuzzy normalization, which improved recall but depends heavily on handcrafted rules,
making it dificult to generalize.
      </p>
      <p>
        Jain and Pal [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] advanced mixed-script IR research by using CRF-based token classification along
with DFR-based back-transliteration indexing. Their system achieved nDCG@10 score of 0.716,
illustrating that combining language identification with probabilistic retrieval can substantially improve
efectiveness. Ghosh et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposed CRF-based token labeling for query words and obtained
weighted F-measures of around 0.75, although their system struggled with rare tokens and
out-ofvocabulary cases. Chanda and Pal [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] investigated the role of stopword removal in Bengali–English
CMIR and proved that a corpus-specific stopword list improved MAP from 0.134 to 0.155 (a relative gain
of 16%). However, they cautioned that aggressive removal could also discard semantically important
function words that carry semantic weight in code-mixed contexts. Li et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] introduced CoIR,
a benchmark for code IR models across multiple domains. They demonstrated that dense retrievers
degrade by 10–20% under domain and script variation, underscoring their brittleness for code-mixed
scenarios. Dai et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposed the Cocktail benchmark, which integrates LLM-generated documents.
Their results revealed that neural models may rank based on stylistic patterns rather than semantic
relevance, raising concerns for retrieval in noisy and mixed environments.
      </p>
      <p>Together, these works highlight the evolution of CMIR research, from normalization and supervised
token classification to modern hybrid and LLM-driven retrieval pipelines. While each approach ofers
specific strengths—such as improved recall in normalization or strong tagging accuracy in CRF models—
they also reveal limitations in scalability, robustness, and domain transfer. Building on these insights, our
system combines classical IR models (BM25, DirichletLM, and HiemstraLM) through fusion strategies,
achieving robust performance on Bengali–English code-mixed queries in CMIR-2025 shared task.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The CMIR-2025 shared task requires retrieving relevant answers to natural language queries written in
Romanized Bengali text mixed with English. The primary challenges lie in handling noisy transliteration,
spelling variation, and lack of standardized vocabulary. To address these issues, we - team MUCS
designed fusion of classical IR models - BM25, HiemstraLM, and DirichletLM, with an emphasis on
leveraging complementary strengths through their fusion. The overall architecture of our proposed
retrieval framework is illustrated in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Preparation</title>
        <p>The dataset provided for CMIR-2025 consists of a baseline corpus in TREC format, along with a training
set of Queries and corresponding Relevance judgments (QRels). The documents originate from social
media platforms, exhibiting high variation in spelling, word order, and use of Roman script for Bengali
words. Queries are posed as natural language questions, and documents are labeled as relevant if they
contain valid answers. We used the training queries and QRels for model development and validation,
while using test queries for final evaluation. The overall statistics of the dataset is summarized in Table 1
and few query samples are shown in Table 2.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Fusion-Based Retrieval Model</title>
        <p>
          We employed the PyTerrier2 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] framework for indexing and retrieval. The corpus is indexed using
the TRECCollectionIndexer without stemming or stopword removal in order to preserve all possible
lexical signals, as stopword lists for code-mixed text remain unreliable and risky discarding semantically
meaningful words. Each document is stored with its associated identifier, and queries are parsed into
the same format for compatibility with retrieval models. Our proposed model consists of the following
traditional IR models fused together based on the ranks:
2https://pypi.org/project/python-terrier/
• Best Matching 253: A probabilistic ranking model widely used in IR due to its robustness across
domains. It ranks documents based on term frequency, document length, and inverse document
frequency.
• Query Likelihood Model - Hiemstra Language Model: A foundational approach in language
modeling for IR which treats each document as a probabilistic language model and ranks
documents based on the likelihood that they would generate the user’s query.
        </p>
        <p>
          textcolorredwrite about Hiemstra_LM
• Dirichlet Language Model4: It is a language modeling approach that estimate the likelihood of
generating a query from a document, with smoothing applied to handle unseen terms.
Each retrieval model produces an independent ranked list of documents for each query. To construct
the final output, we adopted Reciprocal Rank Fusion (RRF) [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. RRF assigns scores to documents
based on the inverse of their ranks across multiple models, giving more weight to documents ranked
highly by several systems. This approach improves ranking robustness by combining the strengths of
lexical overlap (BM25), smoothed query probabilities (DirichletLM), and statistically resilient modeling
(HiemstraLM). The fusion strategy reduces reliance on any single model and improves efectiveness,
especially for queries with transliteration variants, informal spellings, or partial term overlap.
        </p>
        <p>In addition to the fused model, we also evaluated other standard IR models to compare performance:
• TF-IDF: A vector space model where terms are weighted by their frequency in a document and
their inverse frequency in the collection. It provides a strong, fast lexical baseline.
• PL2 and DPH: Divergence From Randomness (DFR) models that measure how much a term’s
frequency in a document diverges from randomness. PL2 is based on Poisson statistics; DPH
combines probabilistic term modeling with normalized term frequency.
• DLH13: A DFR variant that uses term risk to model document scores, suitable for retrieval
scenarios where document length and term frequency vary significantly.
• IFB2: A DFR model that adjusts for the randomness of term distributions using inverse document
frequency and information gain.</p>
        <p>These models cover diverse retrieval paradigms — from lexical matching to probabilistic and
distribution-based scoring — which makes them well-suited for evaluation in noisy, code-mixed
environments.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>
        We evaluated several retrieval models for Bengali-English code-mixed IR. All models are implemented
using PyTerrier with standard parameters. Evaluation is performed using standard IR metrics: MAP5,
nDCG6, Reciprocal Rank (RR)7 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], P@5, and P@108. MAP is used as the primary metric for
evaluation, while nDCG and precision scores provide additional insight into early ranking quality. RR is
particularly useful for understanding how well the system ranks a relevant document at the top of the
result list [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ]. In CMIR, where exact matches are afected by spelling variation and transliteration
errors of code-mixed text, RR becomes important to measure how frequently at least one relevant
document appears in the top ranks. Unlike MAP or nDCG, which consider the ranking of all relevant
documents, RR focus on the rank of the first correct result. A higher RR indicates that users can find at
least one relevant result quickly, which is crucial for improving user satisfaction in noisy, informal text
retrieval scenarios.
3https://en.wikipedia.org/wiki/Okapi_BM25
4https://nlp.stanford.edu/IR-book/html/htmledition/dirichlet-prior-smoothing-1.html
5https://en.wikipedia.org/wiki/Mean_average_precision
6https://en.wikipedia.org/wiki/Discounted_cumulative_gain
7https://en.wikipedia.org/wiki/Reciprocal_rank
8https://en.wikipedia.org/wiki/Precision_and_recall
4.0.1. Baseline Indexing Configuration
The organizers provided a baseline using PyTerrier’s default indexing configuration. This includes
English stopword removal, Porter stemming, and standard TREC parser settings. Although efective
for standard English corpora, this configuration reduces retrieval efectiveness in code-mixed settings,
where transliterated Bengali words such as kotha, ache, and ki may be incorrectly stemmed or
removed. Table 3 shows the performance of various models using the baseline configuration. Among
individual models, HiemstraLM outperforms others with the highest MAP and nDCG scores.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Models with Improved Indexing</title>
        <p>To improve vocabulary coverage and matching accuracy in a Romanized, code-mixed context, we
modiifed the indexing strategy by disabling both stemming and stopword removal. The corpus is indexed
using pt.TRECCollectionIndexer with stemmer=None, stopwords=None, and overwrite=True.
This configuration retains all tokens — including informal and transliterated terms — and better preserves
lexical overlap between noisy code-mixed queries and documents. Table 4 presents the performance
of individual retrieval models under this improved indexing setup. HiemstraLM again achieved the
highest MAP among single models. A fusion model combining BM25, DirichletLM, and HiemstraLM
using RRF achieves the best nDCG and P@10, confirming the benefit of combining multiple ranking
signals.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Result Analysis</title>
        <p>Improved indexing significantly enhances retrieval quality across all models, confirming that preserving
full lexical forms benefits performance in code-mixed settings. HiemstraLM achieves the highest MAP
and P@5, while IFB2 is efective at P@10. DirichletLM performs reliably across most metrics, benefiting
from smoothing under data sparsity. The fusion model outperforms all individual models in terms of
nDCG and P@10. Combining BM25, DirichletLM, and HiemstraLM using RRF allows the system to
benefit from diferent retrieval strategies. RRF promotes documents ranked highly across models, which
improves both stability and coverage, particularly in noisy, mixed-language input. Figure 2 shows the
ranks of the participating teams in the shared task, which reveals that the team MUCS achieved the
top position based on MAP, demonstrating the efectiveness of the proposed retrieval pipeline.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Declaration on Generative AI</title>
      <p>In the course of preparing this paper, we made limited use of generative AI assistant to support the
writing process. The tool was used primarily to help with language refinement, structuring of sections,
and ensuring consistency in LaTeX formatting. All technical content, experimental design, model
implementation, and results are conceived, executed, and validated entirely by the authors. The AI
assistant did not generate novel research ideas, nor did it influence the reported findings. Its role was
strictly supportive, comparable to using grammar checkers or typesetting tools, and every piece of
content included in this manuscript is critically reviewed and approved by the authors.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>In this paper, we presented our approach for the CMIR-2025 shared task, focusing on the retrieval of
relevant information from Bengali–English code-mixed queries written in Roman script. Our team
MUCS designed a fusion model leveraging traditional retrieval models, to retrieve relevant information
from Bengali–English code-mixed queries. The proposed fusion-based model combining BM25,
HiemstraLM, and DirichletLM, achieved the highest performance with a MAP score of 0.2117, nDCG of 0.42,
 @5 of 0.30, and  @10 of 0.380. These results confirm the efectiveness of leveraging complementary
retrieval models to address the challenges of noisy, code-mixed social media text. Incorporating neural
ranking models and techniques for handling spelling variations and transliteration more efectively,
may bring improvement in the performances of the models. Further. integrating dense retrievers and
hybrid approaches may further enhance retrieval performance on complex code-mixed queries. While
our current system demonstrates robustness, these directions could provide additional improvements
for real-world multilingual and informal text retrieval scenarios.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chakma</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <article-title>Cmir: A Corpus for Evaluation of Code-Mixed Information Retrieval of Hindi-English Tweets</article-title>
          ,
          <source>Computación y Sistemas</source>
          <volume>25</volume>
          (
          <year>2021</year>
          )
          <fpage>657</fpage>
          -
          <lpage>667</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mandal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nanmaran</surname>
          </string-name>
          ,
          <article-title>Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model &amp; Levenshtein Distance</article-title>
          , arXiv preprint arXiv:
          <year>1805</year>
          .
          <volume>08701</volume>
          (
          <year>2018</year>
          ). URL: https: //arxiv.org/abs/
          <year>1805</year>
          .08701.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Pal,</surname>
          </string-name>
          <article-title>The efect of stopword removal on information retrieval for code-mixed data obtained via social media</article-title>
          ,
          <source>SN Comput. Sci. 4</source>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1007/s42979-023
          <article-title>-01942-7</article-title>
          . doi:
          <volume>10</volume>
          .1007/s42979-023-01942-7.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Solorio</surname>
          </string-name>
          , Retrievability of Code-Mixed Microblogs,
          <source>in: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>973</fpage>
          -
          <lpage>976</lpage>
          . doi:
          <volume>10</volume>
          .1145/2911451.2914736.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>U.</given-names>
            <surname>Barman</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Foster</surname>
          </string-name>
          ,
          <article-title>Mixed-Script Query Labeling Using Supervised Learning and Ad Hoc Retrieval Using Sub-Word Indexing</article-title>
          ,
          <source>in: Proceedings of FIRE</source>
          <year>2014</year>
          :
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          , volume
          <volume>1331</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>47</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1331</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ravi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <article-title>Mixed-Script Query Labeling Using Supervised Learning and Ad Hoc Retrieval Using Sub-Word Indexing</article-title>
          , in: Working Notes of FIRE 2014 -
          <article-title>Forum for Information Retrieval Evaluation, Bangalore</article-title>
          , India,
          <year>2014</year>
          , pp.
          <fpage>86</fpage>
          -
          <lpage>90</lpage>
          . URL: https://www2.isical.ac.in/~fire/ working-notes/
          <year>2014</year>
          /MSR/FIRE2014_BITS-Lipyantran.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Overview of the Shared Task on Code-Mixed Information Retrieval from Social Media Data, in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation, Association for Computing Machinery</article-title>
          ,
          <year>2025</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>31</lpage>
          . URL: https://doi.org/10.1145/ 3734947.3735670. doi:
          <volume>10</volume>
          .1145/3734947.3735670.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on code-mixed information retrieval from social media data</article-title>
          ,
          <source>in: FIRE 2024 Working Notesl, CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          , p.
          <fpage>124</fpage>
          -
          <lpage>128</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>4054</volume>
          /
          <fpage>T2</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Chen,
          <source>RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval, arXiv preprint arXiv:2411.04752</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2411.04752.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          , G. Jones, DCU@FIRE-2014:
          <article-title>Fuzzy Queries with Rule-Based Normalization for Mixed Script Information Retrieval</article-title>
          ,
          <source>in: Proceedings of FIRE</source>
          <year>2014</year>
          ,
          <year>2014</year>
          , pp.
          <fpage>48</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jain</surname>
          </string-name>
          , S. Pal,
          <article-title>DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retrieval</article-title>
          ,
          <source>in: Proceedings of FIRE 2015 Workshop on Mixed Script Information Retrieval</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>30</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
          </string-name>
          ,
          <article-title>Labeling of Query Words Using Conditional Random Field</article-title>
          ,
          <source>arXiv preprint arXiv:1607.08883</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Q.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>CoIR: A Comprehensive Benchmark for Code Information Retrieval Models, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL</article-title>
          <year>2025</year>
          ), Association for Computational Linguistics,
          <year>2025</year>
          , pp.
          <fpage>12345</fpage>
          -
          <lpage>12358</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>123</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dai</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ruan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          , Cocktail:
          <string-name>
            <given-names>A Comprehensive</given-names>
            <surname>Information</surname>
          </string-name>
          <article-title>Retrieval Benchmark with LLM-Generated Documents Integration</article-title>
          ,
          <source>arXiv preprint arXiv:2405.16546</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2405.16546.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          , Pyterrier:
          <article-title>Declarative experimentation in python</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>2117</fpage>
          -
          <lpage>2120</lpage>
          . doi:
          <volume>10</volume>
          .1145/3397271.3401075.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Buettcher</surname>
          </string-name>
          ,
          <article-title>Reciprocal rank fusion outperforms condorcet and individual rank learning methods</article-title>
          ,
          <source>in: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, ACM</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>758</fpage>
          -
          <lpage>759</lpage>
          . URL: https: //doi.org/10.1145/1571941.1572114. doi:
          <volume>10</volume>
          .1145/1571941.1572114.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tewari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Overview of the cmir track at fire 2025: Code-mixed information retrieval from social media data</article-title>
          ,
          <source>in: FIRE '25: Proceedings of the 17th Annual Meeting of the Forum for Information Retrieval Evaluation. December</source>
          <volume>17</volume>
          -20, Varanasi, India, Association for Computing Machinery (ACM), New York, NY, USA,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tewari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Findings of the code-mixed information retrieval from social media data (cmir) shared task at fire 2025</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Chakraborty (Eds.),
          <source>Forum for Information Retrieval Evaluation (Working Notes) (FIRE 2025) December</source>
          <volume>17</volume>
          -20, Varanasi, India, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>