<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Spoken-Query Cross-Lingual Information Retrieval for the Indic Languages (SqCLIR) using BM25 and Indic-BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pranesh TT</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thamizhmathi KK</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bharathi B</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Chennai, Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Voice-based search is becoming more common, but spoken queries are challenging for retrieval systems because they often contain transcription errors and, in many cases, need to work across languages. The SqCLIR task at FIRE 2025 addresses this challenge with two subtasks: a monolingual setup and a cross-lingual setup. For our participation, we focused on the Hindi monolingual track, where the goal was to retrieve relevant passages from a Hindi text collection using queries spoken in Hindi. To explore variation in speech input, we generated queries using both male and female voices. Our experiments combined traditional BM25 retrieval with neural embedding approaches using Indic-BERT and FAISS indexing. The results were encouraging, showing that embedding-based retrieval can improve recall and ranking compared to baselines. We also discuss how transcription errors and speaker characteristics influence retrieval efectiveness and how multilingual embeddings can provide benefits even within a monolingual setup.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Spoken Query Retrieval</kwd>
        <kwd>Cross-Language Information Retrieval</kwd>
        <kwd>SqCLIR</kwd>
        <kwd>Hindi Monolingual</kwd>
        <kwd>Speech Recognition</kwd>
        <kwd>Indic-BERT</kwd>
        <kwd>FAISS</kwd>
        <kwd>BM25</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. Background and Motivation</title>
        <p>
          Voice-based technologies such as virtual assistants, search engines, and conversational systems are
now a part of everyday life. As more users interact with machines through speech, spoken query
retrieval has become an important research challenge. Compared to typed queries, spoken queries
are often noisy—they may include disfluencies, mispronunciations, or background interference that
make retrieval harder [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. In a multilingual country like India, this challenge is even greater, since
users often ask queries in their native language. This makes the development of robust Spoken Query
Cross-Language Information Retrieval (SqCLIR) systems particularly relevant [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. The SqCLIR Task at FIRE 2025</title>
        <p>
          The Forum for Information Retrieval Evaluation (FIRE 2025) introduced the second edition of the
SqCLIR shared task to promote research on speech-driven information retrieval systems [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ]. The task
provides both monolingual and cross-lingual tracks, where spoken queries must be matched against
large collections of text documents. The oficial task website https://sites.google.com/view/sqclir-2025
ofers detailed guidelines, dataset specifications, and evaluation protocols. Queries are released as audio
recordings spoken by both male and female speakers, and are accompanied by transcriptions generated
using Automatic Speech Recognition (ASR). In our work, we focus on the Hindi monolingual track,
where both queries and documents are in Hindi.
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Related Work</title>
        <p>Prior research in information retrieval has explored both lexical ranking models and neural
embeddingbased approaches. While these methods have shown efectiveness for text and cross-lingual retrieval,
their adaptation to spoken queries—especially for Indic languages—remains limited. We provide a
detailed survey of these approaches in Section 2.</p>
      </sec>
      <sec id="sec-1-4">
        <title>1.4. Our Contributions</title>
        <p>To address these issues, we study the Hindi monolingual spoken query retrieval task at FIRE 2025. Our
work makes the following contributions:
• We evaluate both traditional lexical retrieval (BM25) and embedding-based approaches using</p>
        <p>IndicBERT combined with FAISS indexing.
• We investigate the impact of speaker variation by comparing retrieval performance across queries
spoken by male and female utterances.
• We highlight the trade-ofs between lexical and embedding-based retrieval in handling noisy
spoken queries, ofering insights for developing more robust SqCLIR systems in low-resource
settings.</p>
        <p>The remainder of this paper is organized as follows. Section 2 presents a literature survey of related
work in spoken query retrieval, lexical methods, and embedding-based approaches. Section 3 describes
the methodology, including data preparation, retrieval workflow, and evaluation setup. Section 4 reports
the experimental results of BM25 and Indic-BERT on the Hindi SqCLIR task, followed by a comparative
discussion. Finally, Section 5 concludes the paper and outlines possible directions for future research.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Survey</title>
      <p>
        In recent years, research in the field of spoken query retrieval has gained momentum due to the growing
use of speech-enabled technologies such as virtual assistants, search engines, and conversational systems.
Spoken queries difer from typed queries because they often contain disfluencies, mispronunciations,
or background noise, which makes retrieval more challenging [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In multilingual settings like India,
cross-lingual information retrieval is further complicated by the diversity of languages and dialects.
Several recent works have addressed these challenges, focusing on either lexical or embedding-based
approaches for robust retrieval. The SqCLIR task introduced by FIRE further emphasizes the importance
of developing efective speech-driven retrieval methods for Indian languages [
        <xref ref-type="bibr" rid="ref2">2, 5, 6</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Lexical Retrieval Methods: BM25</title>
        <p>
          Robertson et al. [7] introduced BM25, a probabilistic relevance framework widely used in traditional
information retrieval tasks. BM25 ranks documents based on term frequency and inverse document
frequency, providing strong baselines for text retrieval. In the context of spoken queries, BM25 has been
applied to transcriptions obtained from automatic speech recognition (ASR) systems. Previous studies in
FIRE SqCLIR tasks have shown that BM25 can achieve competitive performance on monolingual Hindi
queries when ASR errors are low [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. However, lexical methods struggle with semantic mismatches
caused by paraphrasing or vocabulary variations in spoken language.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Embedding-Based Retrieval: IndicBERT and Multilingual Models</title>
        <p>To handle semantic variations and cross-lingual queries, embedding-based methods have been widely
explored. Multilingual sentence embeddings, such as LaBSE [8] and IndicBERT [9], map queries and
documents into a shared semantic space, enabling similarity-based retrieval. IndicBERT has been
finetuned on Indian language corpora and shown to improve retrieval performance over lexical methods by
capturing semantic context and mitigating ASR errors. Embedding-based approaches are particularly
efective for low-resource languages and for queries that exhibit code-mixing or regional vocabulary
diferences.</p>
        <p>Beyond Indic-specific models, dense retrieval methods developed in the international IR community
have significantly advanced the state of the art. DPR (Dense Passage Retrieval) [ 10] introduced a
dual-encoder framework for eficient passage retrieval, while ColBERT (Contextualized Late
Interaction BERT) [11] proposed a late interaction mechanism that balances eficiency with fine-grained
token-level matching. Further, ANCE (Approximate Nearest Neighbor Negative Contrastive
Estimation) [12] improved training through dynamic negative sampling, and TAS(Topic Aware Sampling) [13]
demonstrated the efectiveness of knowledge distillation and topic-aware sampling for lightweight
dense retrievers. These models highlight the broader progress in dense retrieval for English and other
high-resource languages, ofering insights that can inform spoken query retrieval in Indic settings.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Spoken Query Retrieval Challenges in FIRE</title>
        <p>
          Previous FIRE SqCLIR tracks have highlighted challenges specific to spoken query retrieval in Indian
languages [
          <xref ref-type="bibr" rid="ref2">5, 6, 2</xref>
          ]. ASR errors, speaker variations (male vs. female utterances), and domain-specific
vocabulary significantly afect retrieval performance. Some approaches combine lexical and
embeddingbased methods to balance precision and semantic recall. These studies provide a foundation for further
improvements by analyzing both transcription quality and embedding efectiveness.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Current Work</title>
        <p>Our study extends prior FIRE SqCLIR eforts by focusing on the Hindi monolingual track. We evaluate
both lexical (BM25) and embedding-based (IndicBERT) retrieval, with a specific emphasis on speaker
variability (male vs. female queries). This positions our work as a step toward more robust spoken
query retrieval in low-resource settings.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Objective and Research Questions</title>
        <p>The primary goal of this research is to improve the retrieval of spoken Hindi queries from a large
collection of Hindi documents. Specifically, we compare the efectiveness of a lexical method (BM25 [ 7])
with an embedding-based method (IndicBERT [14]) when applied to Automatic Speech Recognition
(ASR) transcripts of spoken queries. Our research is guided by two main questions. First, how does
BM25 compare to IndicBERT in retrieving relevant documents for noisy spoken Hindi queries? Second,
how does speaker variability, particularly the diference between male and female utterances, influence
retrieval performance? Answering these questions allows us to analyze the trade-ofs between lexical
and embedding-based retrieval approaches in the context of monolingual Hindi SqCLIR.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Preparation</title>
        <p>
          The dataset used in this study is based on the Hindi resources released as part of the FIRE 2025 SqCLIR
shared task [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ], combined with our own additional recorded queries. Spoken queries were provided
in two evaluation sets (DL19 and DL20), with balanced representation from male and female speakers:
43 male and 43 female queries in DL19, and 54 male and 54 female queries in DL20. In addition, FIRE
released oficial text queries for both DL19 and DL20 (43 and 54 queries respectively). To introduce
additional speaker variability, we recorded 96 new spoken queries from male speakers only. In total, the
dataset consists of 291 queries spanning spoken, text, and recorded sources. The document collection
against which retrieval was performed consists of 8,841,823 Hindi news articles released as part of FIRE
2025. Table 1 summarizes the distribution of queries.
        </p>
        <p>All spoken queries were transcribed into text using the Whisper ASR model [15]. Whisper was selected
over alternatives such as Wav2Vec2 and Google Speech-to-Text due to its robustness to background
noise and disfluencies, its multilingual coverage, and its open-source availability, making it particularly
suitable for Indic language research.</p>
        <p>The document collection and transcribed queries were normalized by tokenizing text, converting all
words to lowercase, and removing Hindi stopwords. For embedding-based retrieval, we used IndicBERT,
a transformer-based model pretrained on 12 major Indic languages (including Hindi) and English [14].
Its multilingual pretraining on large-scale web and Wikipedia corpora makes it particularly well-suited
for Hindi retrieval tasks, ofering a stronger representation baseline than generic multilingual models
such as mBERT.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Retrieval Workflow</title>
        <p>The retrieval workflow is illustrated in Figure 1. Spoken queries were first transcribed by Whisper ASR
to generate textual inputs. Both the queries and the documents then underwent preprocessing. Retrieval
was carried out using two complementary approaches. The first approach, BM25, is a probabilistic
lexical method based on term frequency and inverse document frequency [7], which provides a strong
baseline. The second approach, IndicBERT, is a transformer-based embedding model pretrained on a
large corpus of 12 Indic languages [14]. IndicBERT produces dense vector representations of queries
and documents, and retrieval is performed by computing cosine similarity using FAISS indexing. Finally,
the ranked outputs from both methods were formatted into TREC-style run files to enable standardized
evaluation and comparison.1</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluation Setup</title>
        <p>The evaluation considered three types of queries: FIRE male queries, FIRE female queries, and our
own recorded male queries. Retrieval efectiveness was assessed using Mean Reciprocal Rank (MRR),
Recall@K, and nDCG [16]. This setup provided a systematic comparison between BM25 and IndicBERT
under varying speaker conditions, while also allowing us to analyze the efect of ASR transcription
quality on retrieval performance.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Summary</title>
        <p>In summary, our methodology follows a structured pipeline that begins with transcription of spoken
queries using Whisper ASR, followed by preprocessing of both queries and documents. Retrieval is
then performed using BM25 as a lexical baseline and IndicBERT as an embedding-based method. The
results are evaluated using standard IR metrics to compare the two approaches under diferent speaker
conditions. BM25 provides an interpretable baseline, while IndicBERT leverages pretrained multilingual
embeddings specifically tailored for Indic languages. Together, these methods enable a structured
comparison between lexical and semantic retrieval strategies for Hindi SqCLIR.
1The source code and TREC run files are available at https://github.com/Pranesh4950/fire2025-sqclir.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Comparative Analysis</title>
      <p>The following tables present the evaluation of two retrieval models, BM25 and IndicBERT, on the
SQCLIR task using spoken and text queries. The metrics reported include nDCG@10, Reciprocal Rank
(RR), and Recall at diferent cutofs (10, 100, and 1000). The results highlight diferences in retrieval
efectiveness across query types and between the two models.
4.1. Discussion of Results
4.1.1. BM25
0.0751
0.0951
BM25 demonstrates consistently stronger retrieval efectiveness compared to IndicBERT across all
query types. As shown in Table 2, for given text queries BM25 achieves the highest nDCG@10 (0.2024)
and RR (0.4497), indicating that relevant documents are ranked highly. Recall also improves significantly
at higher cutofs (R@1000 = 0.3358), suggesting that BM25 is able to capture a broader set of relevant
documents.</p>
      <p>For spoken queries, both male and female, BM25 achieves moderate results, with female queries
slightly outperforming male queries (nDCG@10: 0.0951 vs. 0.0751, RR: 0.2462 vs. 0.1872). This indicates
that female spoken queries are transcribed in a way that aligns more efectively with the document
collection. Recorded spoken queries (male) yield results similar to given spoken queries, with small
improvements at larger recall cutofs.</p>
      <sec id="sec-4-1">
        <title>4.1.2. IndicBERT</title>
        <p>IndicBERT performs weaker than BM25 across all metrics. From Table 3, we observe that given text
queries achieve the best performance among all IndicBERT results (nDCG@10 = 0.1638, RR = 0.3618).
However, these values remain lower than BM25. Similarly, recall at 1000 documents (0.2608) falls short
of BM25’s coverage.</p>
        <p>For spoken queries, IndicBERT shows a noticeable performance gap. Male spoken queries achieve
the lowest nDCG@10 (0.0586), while female spoken queries perform slightly better (0.0715), but still
trail BM25. Recorded spoken queries yield marginal improvement (nDCG@10 = 0.0744), but remain
below BM25.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.3. Comparative Analysis</title>
        <p>Across all experiments, BM25 consistently outperforms IndicBERT. One clear trend is that both models
perform best on given text queries, reflecting the absence of ASR transcription errors. However, the
number of relevant tokens difers significantly between query types (e.g., 2296 for text vs. 1066 for
spoken). This discrepancy arises because ASR-transcribed queries often lose or alter terms, reducing
the overlap with relevant documents. As a result, spoken queries inherently lead to lower retrieval
efectiveness.</p>
        <p>The weaker performance of IndicBERT can be attributed to several factors. First, the model was not
ifne-tuned on the SqCLIR dataset, which limits its ability to capture domain-specific query–document
relevance. Second, IndicBERT is sensitive to ASR noise, where transcription errors distort embeddings
and reduce similarity with relevant documents. Third, while IndicBERT is pretrained on multiple Indic
languages, its representations may not fully capture Hindi-specific nuances, especially when compared
against BM25’s direct term-matching mechanism.
4.1.4. Overall
In summary, BM25 demonstrates robustness to transcription noise and provides higher ranking
efectiveness across all query types, making it the stronger retrieval model in our SqCLIR setting. IndicBERT
remains a valuable semantic baseline but highlights the challenges of embedding-based methods for
noisy, real-world Hindi speech queries without task-specific fine-tuning.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we evaluated BM25 and IndicBERT for the Hindi monolingual SQCLIR task at FIRE 2025.
Our results show that BM25 consistently outperforms IndicBERT, with both models performing best on
text queries and facing degradation on spoken queries due to ASR errors. Female spoken queries yield
slightly higher efectiveness than male queries, but overall performance remains limited.</p>
      <p>These findings underscore two key limitations: (i) ASR noise significantly reduces retrieval quality,
and (ii) IndicBERT embeddings underperform without task-specific fine-tuning. Addressing these
challenges motivates future directions such as hybrid retrieval that combines lexical precision with
semantic embeddings, fine-tuning IndicBERT on Hindi speech-text pairs, and incorporating re-ranking
strategies to mitigate noise.</p>
      <p>Beyond benchmarking, advances in SqCLIR systems hold practical value for speech-enabled
information access in Indian languages, supporting inclusive search tools in domains such as digital libraries,
e-governance, and education.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Declaration on Generative AI</title>
      <p>The authors declare that GPT-5 was used for grammar and spelling corrections.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors would like to acknowledge the support and resources provided by the Department of
Computer Science and Engineering at Sri Sivasubramaniya Nadar College of Engineering, Chennai,
Tamil Nadu, India.
[5] B. Dave, P. Majumder, D. Ganguly, E. Kanoulas, Overview of the fire 2024 sqclir track: Spoken
query cross-lingual information retrieval for the indic languages (2024).
[6] B. Dave, P. Majumder, D. Ganguly, E. Kanoulas, Findings of shared task on spoken query
crosslingual information retrieval for the indic languages at fire 2024, in: Proceedings of the 16th
Annual Meeting of the Forum for Information Retrieval Evaluation, 2024.
[7] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond,</p>
      <p>Foundations and Trends® in Information Retrieval 3 (2009) 333–389.
[8] C. Feng, et al., Language-agnostic bert sentence embedding, in: ACL, 2020.
[9] A. Aravapalli, M. Marreddy, S. R. Oota, R. Mamidi, M. Gupta, Indicsenteval: How efectively do
multilingual transformer models encode linguistic properties for indic languages?, arXiv preprint
arXiv:2410.02611 (2024).
[10] V. Karpukhin, B. Oguz, S. Min, P. S. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage
retrieval for open-domain question answering., in: EMNLP (1), 2020, pp. 6769–6781.
[11] O. Khattab, M. Zaharia, Colbert: Eficient and efective passage search via contextualized late
interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research
and development in Information Retrieval, 2020, pp. 39–48.
[12] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, A. Overwijk, Approximate nearest
neighbor negative contrastive learning for dense text retrieval, arXiv preprint arXiv:2007.00808
(2020).
[13] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury, Eficiently teaching an efective dense
retriever with balanced topic aware sampling, in: Proceedings of the 44th international ACM
SIGIR conference on research and development in information retrieval, 2021, pp. 113–122.
[14] D. Kakwani, et al., Indicbert: A multilingual language model for indian languages, in: LREC, 2020.
[15] C. Graham, N. Roll, Evaluating openai’s whisper asr: Performance analysis across diverse accents
and speaker traits, JASA Express Letters 4 (2024).
[16] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of ir techniques, in: ACM Transactions
on Information Systems (TOIS), volume 20, ACM, 2002, pp. 422–446.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <article-title>Spoken query processing for interactive information retrieval</article-title>
          ,
          <source>Data &amp; Knowledge Engineering</source>
          <volume>41</volume>
          (
          <year>2002</year>
          )
          <fpage>105</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , Sqcliril:
          <article-title>Spoken query cross-lingual information retrieval in indian languages, Pattern Recognition Letters (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          , E. Kanoulas,
          <article-title>Overview of the second shared task on spoken query cross-lingual information retrieval for indic languages sqclir at fire</article-title>
          <year>2025</year>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          , E. Kanoulas,
          <article-title>Findings from the second shared task on spoken query cross-lingual information retrieval for indic languages sqclir at fire 2025</article-title>
          ,
          <source>in: Proceedings of the 17th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>