1. Introduction

LexiSemIR: A Two-Stage Re-ranking Framework with BM25 and Zero-Shot Bi-Encoder

Swati Gupta

Tanusree Nath

tanusreenath32@gmail.com 0

Vedika Gupta

vedika.nit@gmail.com 1

Manjari Gupta

manjari@bhu.ac.in 0 0 Department of Computer Science, Banaras Hindu University , Varanasi , India 1 Jindal Global Business School, O.P. Jindal Global University , Sonipat , India

2026

Most standard Information Retrieval (IR) models primarily rely on keyword matching, which can be inadequate when a deeper contextual understanding is required. In such cases, it becomes essential to capture both lexical and semantic relationships between query-document pairs. To address this limitation, our team CodeWeavers proposes LexiSemIR, a two-stage re-ranking-based model developed for the CMIR-2025 (Code-Mixed Information Retrieval) shared task on Bengali-English code-mixed text. In the first stage, the top k documents are retrieved using a lexical bag-of-words model (BM25). These are then re-ranked in the second stage using a zero-shot bi-encoder, which computes semantic similarity between query and document embeddings. The proposed approach balances simplicity and performance, while minimizing trainable parameters due to its zero-shot design. LexiSemIR secured 3rd place in the CMIR-2025 shared task, achieving MAP = 0.1546 and P@5 = 0.38, thereby outperforming the BM25 baseline in early precision. The results highlight the model's ability to efectively combine lexical and semantic retrieval strategies for robust performance in code-mixed IR settings.

eol>Code-mixed language Information retrieval BM25 Bi-Encoder

1. Introduction

Information Retrieval (IR) is broadly defined as "finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)" [ 1 ]. In modern times, IR finds its applications in a variety of tasks, including internet browsing, question answering systems, personal assistants, chat-bots, and digital libraries [ 2 ]. There has also been a surge of user-generated content in code-mixed languages in recent years, which has further complicated the task of information retrieval [ 3, 4 ]. Code-mixing refers to the act of "mixing two or more languages in a single discourse" [ 5 ]. While code-mixing may help the model understand multilingual similarities, it becomes susceptible to hurting the retrieval efectiveness [ 6 ]. Most traditional IR systems depend on lexical matching-based algorithms (BM25, Heimstra-LM, etc.), which may prove to be inefective in many instances where the same word has diferent meanings or where diferent words have the same meaning [ 7 ]. In such cases, there is a need to have a deeper semantic understanding of the queries and documents. To eliminate this drawback, semantic-matching-based algorithms have been developed (vector space model, neural networks, etc.) that focus on the meanings of tokens. To further improve the eficiency of a single IR algorithm, two-stage retrieval systems have been developed. Since the 1990s, two-stage retrieval systems have undergone constant improvement with the advent of new methods and technologies. One of the recent studies has used a two-stage retrieval system using an adapted BM25 and a neural ranking model [ 8 ].

Instead of the traditional two-stage retrieval system, [ 9 ] introduced a serverless three-stage retrieval system with BM25, monoBERT [ 10 ] and duoBERT. Another study by [ 11 ] also developed a two-stage retrieval system using BM25 and monoBERT encoder. Following these works, our study also aims to build a two-stage retrieval model. However, in place of monoBERT in the second stage, we employ MPNet in a zero-shot setting. MPNet makes a good candidate as a two stage re-ranker because documents are often encoded using sentence embeddings [ 12 ]. In addition to that, using zero-shot mode to transfer knowledge from English to other languages is a popular approach [ 13 ]. This not only makes the model faster and scalable, but also allows for independent encoding of queries and documents via the bi-encoder architecture instead of the traditional cross-encoder. A similar study by [ 14 ] also employed BM25 and a SentenceTransformer-based encoder to build a two-stage retrieval system. However, this study used fine-tuning to fit the SentenceTransformer on their data. Our proposed approach uses SentenceTransformer in a zero-shot setting, which makes it faster and parameter-eficient, while maintaining comparable performance in the given task.

The overall contrbutions of our work are summarized below: • We propose LexiSemIR, a two-stage re-ranking-based retrieval framework, combining a lexical BM25 ranker with a zero-shot semantic bi-encoder to capture both lexical and semantic meaning of query document pairs. This model elegantly balances performance with simplicity and parameter eficiency. The proposed model ranked 3rd in the CMIR-2025 shared task on Bengali-English code-mixed text, with a MAP score of 0.1546, NDCG score of 0.2767, P@5 of 0.38, P@10 of 0.2833. • We conduct a detailed error analysis of the proposed model and provide valuable insights into its underlying biases and sources of error. In addition, the query-specific variations are highlighted, which directly afect the retrieval quality. This insight provides actionable guidance for queryexpansion techniques in future work. • A sensitivity analysis is performed across multiple _ values that demonstrates how retrieval depth influences performance stability across metrics.

2. State of the Art

Code-mixed information retrieval research has undergone significant improvement in the recent years. One of the major breakthroughs in this area is the creation of new corpora in low-resource languages. Early studies like [ 15 ] introduced such a corpus in Hindi-English code-mixed social media data. Another study by [ 16 ] also introduced a track for mixed-script information retrieval in Hindi at FIRE-2016. Bengali, being such a language also started gaining attention from the research community. A study by [ 17 ] developed a Bengali information retrieval system by introducing a Bengali text corpora, using advanced preprocessing techniques and applying TF-IDF with cosine similarity to retrieve relevant answers to queries. Another study by [18] experimented with prompt engineering and mathematical modeling with GPT-3.5. Some studies like [19] and [20] combined multiple languages (Hindi, Bengali and Marathi). In the study by [19], several indexing and retrieval strategies were evaluated using various techniques like Divergence from Randomness variants, Okapi BM25, TF-IDF, and statistical language models.

Some of the early works also focused on cross-lingual information retrieval. A study [21] developed a framework to retrieve English documents using Hindi and Bengali queries using machine translation and automatic query generation. A similar study by [22] assembled a system to retrieve English documents using Bengali, Hindi and Telugu queries. To achieve this, they used a combination of bilingual dictionaries, sufix-stripping stemmers, transliteration and TF-IDF ranking.

3. Methodology

This section presents the detailed methodology of the proposed approach. The goal of this work is to develop a Code-Mixed Information Retrieval (CMIR) system, focusing on code-mixed Bengali-English queries and documents. Standard baseline information retrieval models mainly rely on lexicon-based keyword matching (TF-IDF, BM25, PL2, InL2, Hiemstra-LM), completely discounting the semantic meaning of natural text. To overcome this drawback, the LexiSemIR (Lexical + Semantic Information Retrieval) model is proposed, which efectively balances lexical and semantic matching of querydocument pairs. By employing a two-step retrieval process and re-ranking technique, the LexiSemIR model ensures that both surface-level keyword overlap and deeper contextual understanding between query-document pairs are captured. An overview of the model is given in Figure 1. The detailed workflow of the model is described in the succeeding sections.

3.1. Dataset and Setup

The dataset consists of a set of queries Q and documents D. Both the queries and the documents are provided in code-mixed Bengali. For the training phase, the relevance judgements for the queries are available in a separate file, where each query is mapped to one or more relevant documents, with a binary relevance score. Table 1 summarizes the dataset statistics.

3.2. Experimental Design

The retrieval is performed in two stages. In the first stage of the retrieval, the BM25 ranker is used to retrieve the top 100 matching documents [23], followed by a bi-encoder-based re-ranking to retrieve the top 10 amongst those.

BM25. It is a classic probabilistic bag-of-words model, used as the first-stage retriever. Given a query Q with tokens {1, 2, . . . , }, and a set of documents = {1, 2, . . . , }, where N is the total number of documents in the corpus, BM25 calculates the relevance score of document with respect to query Q as:

BM25( , ) = ∑︁ IDF() · =1 (, ) + 1 ·

1 − + · (, ) · ( 1 + 1) ︁( || ︁) avgdl Here, the 1 and are two hyperparameters. 1 controls the term-frequency saturation while controls document length normalization. (, ) is the frequency of term in document , is the average document length in the collection and IDF stands for the Inverse-Document Frequency of a term , It is calculated as:

IDF() = ln ︂( − ( ) + 0.5 () + 0.5 + 1 ︂) In the above equation, () refers to the number of documents containing the term .

Bi-encoder Unit: The documents retrieved in the first stage, along with the query, are passed to a pre-trained sentence transformer-based bi-encoder unit. The bi-encoder unit consists of the all-mpnetbase-v2 Sentence-Transformer (based on MPNet [24]), which parallelly processes the query and the document. Given the initial query , let = {1, 2, . . . , 100} be the top 100 documents retrieved in the first stage. If : text → R768 denotes the bi-encoder function that maps a piece of text to a dense vector, then the query embeddings and the document embedding are calculated as: emb = () ∈ R768 and emb = ( ) ∈ R768 respectively. The bi-encoder unit simultaneously encodes each document and the query to produce dense vectors that capture their semantic meaning. The final relevance score (, ) of each document with respect to the query is calculated by computing the cosine-similarity between their respective embedding vectors. It is calculated as follows: (, ) = cos(emb, emb ) =

emb⊤emb ‖emb‖2 ‖emb ‖2

This score is used to retrieve and align the final top 10 relevant documents with respect to query Q.

3.3. Hyperparameters

Hyperparameters are a set of configuration variables that control the model’s performance. By configuring the hyperparameter settings, a model can be tuned for optimal performance. Table 2 summarizes the diferent hyperparameters used in LexiSemIR model along with their descriptions.

3.4. Training and Testing

judgement file that maps each query to at least one document. The LexiSemIR model involves limited tuning only in Stage 1 (BM25). Specifically, we adjusted BM25 hyperparameters using the training set queries and relevance judgements. Stage 2 (the bi-encoder) was employed in a purely zero-shot setting without any additional fine-tuning. Thus, while the overall framework is evaluated on both training and test queries, only Stage 1 undergoes hyperparameter tuning. For testing, 30 queries are given. The trained model is used to retrieve relevant documents for the test queries, the results of which are further evaluated by the organizers to publish the final evaluation metrics.

4. Results

Our team CodeWeavers, submitted two runs in the CMIR-2025 shared task. Run 1 corresponds to an enhanced version of Hiemstra-LM with advanced query and document preprocessing. Run 2 corresponds to the LexiSemIR model, which secured 3rd rank in the overall competition. The test metrics were released by organizers, as we did not have direct access to relevance judgments. The reported evaluation metrics on the test queries for both runs are listed in Table 3. Run 1 0.15627 Run 2 (LexiSemIR) 0.154625 0.341038 0.276684

Since the relevant judgements for the test queries were not available, further analysis of the model based on errors and sensitivity are conducted on the training set. Table 4 shows the performance metrics of the model on training queries. Run 1 0.4162 Run 2 (LexiSemIR) 0.644196

0.4711 0.411855

4.1. Error Analysis

This section talks about the error analysis of the proposed model in the training phase. Figure 2 shows the confusion matrix corresponding to the proposed LexiSemIR model. From the figure, it is evident that a total of 67 out of 378 relevant documents were retrieved by the model. It can also be observed that the false-negative (FN) rate is higher than the false-positive (FP) rate. The FN rate shows that a considerable portion of relevant documents were not ranked in the top ten, and allows for an area of significant improvement. The FP rate is also quite high, signalling that the model is also confident about placing irrelevant documents in the top ten.

Figure 3 shows the rank distribution of the 67 retrieved documents. A positive trend can be observed in it, where the overall number of retrieved documents decreases as the rank goes down, showing that the model successfully places the most relevant documents in the top ranks.

For a more detailed error analysis, an evaluation of the retrieved documents based on each query is presented in Table 5. It is observed from the table that queries 19 and 1 retrieved the highest number of relevant documents, followed by queries 13 and 14. The worst performing queries are 2 and 12, with 0 retrieved documents. Queries 21 and 25 have the highest number of false negatives, while queries 2 and 12 have the highest number of false positives.

A closer inspection reveals that the poor performance for queries 2 and 12 may be attributed to their predominant use of Bengali tokens with minimal English code-mixing. Since all-mpnet-base-v2 is primarily trained on English text, it lacks robust multilingual alignment capabilities. Consequently, Bengali-dominant inputs fall outside its pretraining distribution, leading to weaker semantic representations. The absence of suficient English context limits the model’s ability to generate meaningful embeddings, thereby reducing retrieval efectiveness for such queries.

Overall, this analysis shows that the model’s performance variation is highly query-specific. Addressing this issue by improving the handling of queries with limited code-mixing could significantly enhance the model’s ability to mitigate false negatives and improve generalization in future iterations.

5. Discussion

The proposed model secured third place in the overall CMIR-2025 competition. This achievement indicates the efectiveness of the model. Although the model performed well on the dataset, it is important to note that the query set sizes for the train and the test phases were relatively small (20 and 30, respectively). This design makes the dataset highly scalable, making space for better results when applied to larger datasets. While the model showed comparable performance with the hyperparameter settings mentioned in Section 3.3, an additional analysis based on the _ hyperparameter has been performed to assess the model’s sensitivity towards it. The other two hyperparameters ({1} and {}) are kept at their default standard values. Similar to the error analysis, the sensitivity analysis is also performed on the train queries due to the unavailability of the relevance judgements for test queries.

5.1. Sensitivity Analysis

This section talks about the sensitivity of the model on changing the _ value in the first stage of retrieval. After initially submitting the model by tuning it with _ = 100, it was later evaluated with seven other values of _ to analyse its sensitivity, and the corresponding performance metrics on the train data were observed. The graph corresponding to the observed values of performance metrics with respect to diferent values of _ is shown in Figure 4.

From the figure, it can be deciphered that the overall best value of _ is 150, with the highest value in both NDCG@10 and P@10. _ = 25 had the highest MAP score, while _ = 50 had the highest value for P@5.

It can also be observed that the MAP score is the least sensitive to the change in _ value. On the other hand, P@5 shows drastic sensitivity to changing the _ value from 25 to 50, where it rises from being the minimum to the maximum. After that, it remains fairly stable before dipping again at 200. For both NDCG@10 and P@10, a gradual but uneven upward trend is observed as the value of _ increases from 25 to 150. However, at 175, the values of these two metrics drop drastically.

6. Conclusion

This study proposed a two-stage re-ranking-based framework as a part of the CMIR-2025 shared task in code-mixed information retrieval [25], [26]. The model uses BM25 as the first-stage ranker to retrieve the top 100 documents. These documents are then passed to the second stage, where a zero-shot bi-encoder computes the query and document embeddings. A cosine similarity score between them provides the relevance measure, which in turn is used to retrieve the top 10 final documents.

The proposed LexiSemIR model, which represents the second run submitted for the task, secured the third rank in the competition, demonstrating competitive performance. Although Run 1 achieved slightly higher NDCG and MAP scores, Run 2 (LexiSemIR) showed improvements in P@5 and P@10, indicating enhanced precision at higher ranks through some comparable trade-ofs.

This study also provides a detailed analysis of the model on the training data. A comprehensive error analysis was conducted on the training queries, ofering valuable insights into the model’s performance, biases, and potential areas for improvement. In addition, a sensitivity analysis on the training queries highlighted the performance trend with varying values of the

Future work could focus on more advanced preprocessing of code-mixed data and the incorporation of language-specific encoders to achieve a more accurate semantic understanding.

Future scope of this study could involve more advanced preprocessing of code-mixed data and incorporating language-specific encoders to achieve more accurate semantic understanding.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools. [18] A. Deroy, S. Maity, Retrievegpt: Merging prompts and mathematical models for enhanced codemixed information retrieval, arXiv preprint arXiv:2411.04752 (2024). [19] J. Savoy, L. Dolamic, M. Akasereh, Information retrieval with hindi, bengali, and marathi languages: Evaluation and analysis, in: Multilingual Information Access in South Asian Languages: Second International Workshop, FIRE 2010, Gandhinagar, India, February 19-21, 2010 and Third International Workshop, FIRE 2011, Bombay, India, December 2-4, 2011, Revised Selected Papers, Springer, 2013, pp. 334–352. [20] J. Leveling, G. J. Jones, Sub-word indexing and blind relevance feedback for english, bengali, hindi, and marathi ir, ACM Transactions on Asian Language Information Processing (TALIP) 9 (2010) 1–30. [21] D. Mandal, S. Dandapat, M. Gupta, P. Banerjee, S. Sarkar, Bengali and hindi to english crosslanguage text retrieval under limited resources., in: CLEF (Working Notes), 2007. [22] S. Bandyopadhyay, T. Mondal, S. K. Naskar, A. Ekbal, R. Haque, S. R. Godhavarthy, Bengali, hindi and telugu to english ad-hoc bilingual task at clef 2007, in: Workshop of the Cross-Language Evaluation Forum for European Languages, Springer, 2007, pp. 88–94. [23] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al., Okapi at TREC-3,

British Library Research and Development Department, 1995. [24] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, Mpnet: Masked and permuted pre-training for language understanding, Advances in neural information processing systems 33 (2020) 16857–16867. [25] S. Chanda, K. Tewari, S. Pal, Overview of the cmir track at fire 2025: Code-mixed information retrieval from social media data, in: FIRE ’25: Proceedings of the 17th Annual Meeting of the Forum for Information Retrieval Evaluation. December 17-20, Varanasi , India, Association for Computing Machinery (ACM), New York, NY, USA, 2025. [26] S. Chanda, K. Tewari, S. Pal, Overview of the cmir track at fire 2025: Code-mixed information retrieval from social media data, in: K. Ghosh, T. Mandl, S. Pal, S. Majumdar, A. Chakraborty (Eds.), Forum for Information Retrieval Evaluation (Working Notes) (FIRE 2025) December 17-20, Varanasi , India, CEUR-WS.org, 2025.

[1]

Schütze ,

C. D.

Manning ,

Raghavan , Introduction to information retrieval, volume 39 , Cambridge University Press Cambridge, 2008 .

[2]

K. A.

Hambarde ,

Proenca , Information retrieval: recent advances and beyond , IEEE Access 11 ( 2023 ) 76581 - 76604 .

[3]

Chanda , S. Pal, The efect of stopword removal on information retrieval for code-mixed data obtained via social media , SN Comput. Sci. 4 ( 2023 ).

[4]

Chanda ,

Pal , Overview of the shared task on code-mixed information retrieval from social media data , in: FIRE 2024 Working Notesl, CEUR Workshop Proceedings , 2024 , p. 124 - 128 . URL: https://ceur-ws. org/ Vol- 4054 / T2 -1.pdf.

[5]

Chanda ,

Pal , Overview of the shared task on code-mixed information retrieval from social media data, in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation, Association for Computing Machinery , 2025 , pp. 29 - 31 .

[6]

Do ,

Lee , S.-w. Hwang, Contrastivemix: overcoming code-mixing dilemma in cross-lingual transfer for information retrieval , in: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2 : Short

Papers)

, 2024 , pp. 197 - 204 .

[7]

W. B.

Croft ,

Metzler ,

Strohman , et al., Search engines: Information retrieval in practice , volume 520 , Addison-Wesley

Reading

, 2010 .

[8]

Kaushish ,

Vijayvargiya ,

Rawat ,

Kumar ,

Jain , Optimized spoken query cross-lingual document retrieval using bm25 and neural re-ranking with adamw , in: 2025 4th OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 5 .0, IEEE, 2025 , pp. 1 - 7 .

[9]

Nogueira ,

Yang ,

Cho ,

Lin , Multi-stage document ranking with bert , arXiv preprint arXiv: 1910 . 14424 ( 2019 ).

[10]

Nogueira ,

Cho , Passage re-ranking with bert , arXiv preprint arXiv: 1901 . 04085 ( 2019 ).

[11]

Anand ,

Zhang , S. Ding,

Xin ,

Lin , Serverless bm25 search and bert reranking ., in: DESIRES , 2021 , pp. 3 - 9 .

[12]

Sannigrahi ,

J. Van

Genabith , C. España-Bonet, Are the best multilingual document embeddings simply based on sentence embeddings? , arXiv preprint arXiv:2304.14796 ( 2023 ).

[13]

Litschko ,

Artemova ,

Plank , Boosting zero-shot cross-lingual retrieval by training on artificially code-switched data , arXiv preprint arXiv:2305.05295 ( 2023 ).

[14]

W. A. G.

Kodri ,

Haris ,

Fitriadi , Fine-hybrid: Integration of bm25 and finetuned sbert to enhance search relevance , Teknika 14 ( 2025 ) 213 - 222 .

[15]

Chakma , A. Das , Cmir: A corpus for evaluation of code mixed information retrieval of hindienglish tweets , Computación y Sistemas 20 ( 2016 ) 425 - 434 .

[16]

Banerjee ,

Chakma ,

S. K.

Naskar , A. Das , P.

Rosso , S.

Bandyopadhyay , M.

Choudhury, Overview of the mixed script information retrieval (msir) at fire-2016, in: Forum for information retrieval evaluation , Springer, 2016 , pp. 39 - 49 .

[17]

Kowsher , I. Hossen,

Ahmed , Bengali information retrieval system (birs ), International Journal on Natural Language Computing (IJNLC) 8 ( 2019 ).