<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MIRaCLE: Multilingual Information Retrieval with Cross-Lingual Embeddings for Mathematical Expressions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krishna Tewari</string-name>
          <email>krishnatewari.rs.cse24@iitbhu.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Supriya Chanda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riti Tripathi</string-name>
          <email>rititripathi09@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bennett University</institution>
          ,
          <addr-line>Greater Noida</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Indian Institute of Technology (BHU)</institution>
          ,
          <addr-line>Varanasi</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Kalinga Institute of Industrial Technology</institution>
          ,
          <addr-line>Bhubaneswar</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Cross-Lingual Mathematical Information Retrieval (CLMIR) addresses the challenge of retrieving documents containing mathematical expressions and text across languages, an area where most existing systems remain monolingual. As part of the FIRE 2025 CLMIR shared task, which targets English-Hindi retrieval using a curated corpus of 39,862 Hindi instances from Math StackExchange (ARQMath-1) and 50 English queries, we developed and evaluated two hybrid retrieval models. Run 01 combines the multi-qa-MiniLM-L6-cos-v1 model with regex-based text cleaning, while Run 02 leverages the all-mpnet-base-v2 model, spaCy-based preprocessing, and an enhanced numerical similarity function for mathematical expressions. Both systems adopt a hybrid scoring strategy integrating semantic embeddings and symbolic math similarity, with FAISS employed for eficient large-scale indexing. Performance was assessed using Precision@10 (P@10), Mean Average Precision (MAP), and normalized Discounted Cumulative Gain (nDCG). Our Run 2 achieved competitive results (P@10: 0.122, MAP: 0.165, nDCG: 0.3063), ranking among the top-performing teams in CLMIR 2025. The results underscore the efectiveness of robust multilingual embeddings and refined math similarity computation, while suggesting future improvements through adaptive weighting and multi-expression handling.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Cross-Lingual Information Retrieval</kwd>
        <kwd>Mathematical Information Retrieval</kwd>
        <kwd>Sentence Transformers</kwd>
        <kwd>FAISS</kwd>
        <kwd>SymPy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>from Math Stack Exchange, contains approximately 39,862 Hindi instances, enriched with both
formulae and explanatory text, and provides 50 English queries for system evaluation.</p>
      <p>In this work, we present our participation in the FIRE 2025 CLMIR shared task. We propose a hybrid
architecture that integrates multilingual sentence-transformer embeddings with SymPy-based formula
similarity measures, and employs FAISS for eficient large-scale indexing. To enhance robustness, we
adopt preprocessing strategies for handling noisy mathematical notation and employ hybrid scoring
functions that balance semantic and symbolic similarity. Evaluation on the oficial benchmark
demonstrates competitive performance, underscoring the efectiveness of combining multilingual neural
representations with symbolic reasoning for CLMIR.</p>
      <p>The rest of this paper is organized as follows: Section 2 discusses related work; Section 3 describes
the dataset; Section 4 presents our proposed methodology; Section 5 reports results and analysis; and
Section 6 concludes with key findings and future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Research in Mathematical Information Retrieval (MIR) has addressed the problem of representing
mathematical content in ways that allow both symbolic and textual aspects to be efectively indexed and
retrieved. A major focus has been on developing structural encodings of formulae that preserve their
semantics while enabling retrieval operations similar to text search. Canonicalization and operator tree
methods provided one of the earliest means to support structural similarity, showing improvements
in identifying related expressions beyond surface string matching [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>
        The question of how to index mathematical expressions eficiently has been widely studied. Early
indexing systems designed specifically for formulae applied tree-based and string-based encodings,
demonstrating scalable retrieval across large digital libraries [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Further refinements incorporated
text metadata and ranking strategies, where structural signals were combined with contextual
information to improve relevance estimation [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>
        Benchmark-driven research provided a strong impetus for MIR development. The NTCIR-11
Math2 Task introduced a formula retrieval evaluation using a Wikipedia corpus of 100 test queries, with
metrics such as Precision@5 and Mean Reciprocal Rank (MRR) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Participating systems
demonstrated that hybrid approaches combining formula encodings with semantic text improved retrieval.
For instance, the IFISB system extracted features from formula sequences and integrated them with
contextual representations, highlighting the importance of text–math fusion.
      </p>
      <p>
        The NTCIR-12 MathIR Task extended this setup to Wikipedia and arXiv corpora, incorporating both
keyword and formula queries [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The ICST system introduced a hybrid indexing model that leveraged
semantic operator trees and RankBoost for ranking, achieving a Precision@5 of 0.4733 on Wikipedia
queries [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These findings illustrated the advantage of structural weighting combined with
learningto-rank methods for large-scale MIR.
      </p>
      <p>
        Community-driven benchmarks have further shaped the field. The ARQMath 2020 Lab introduced
mathematical question answering and formula retrieval using Math Stack Exchange data [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Task 1
(Answer Retrieval) covered 77 topics, while Task 2 (Formula Retrieval) focused on 45 topics, evaluated
with  and MAP. Systems such as DPRL3 and zbMATH integrated textual features with formula
similarity, but reported peak  values of only 0.042, reflecting the dificulty of retrieval in noisy,
user-generated CQA content [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Algorithmic innovations have also shaped MIR. The Tangent search engine introduced Maximum
Subtree Similarity (MSS), where formulae were represented via symbol pair indexing and compact
structural encodings [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Evaluated on the NTCIR-11 benchmark, Tangent achieved a p@5 of
approximately 92%, demonstrating that subtree-based retrieval could balance eficiency with high retrieval
quality. Techniques such as MSS subsequently influenced hybrid retrieval architectures by showing
that compact structural representations scale well in practice.
      </p>
      <p>
        The application of neural architectures has introduced new perspectives into MIR. Structure-aware
deep models have been developed to capture the syntactic properties of formulae, while semantic
embeddings allowed for generalization across notational variants. Recursive and graph-based embeddings
demonstrated the potential of bridging symbolic structure with semantic equivalence [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. At the
same time, contextual embeddings from transformers enabled dense representations of surrounding
text, aligning formulae with their semantic usage [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ].
      </p>
      <p>Parallel work in cross-lingual information retrieval (CLIR) provides relevant foundations for
extending MIR to multilingual settings. Dictionary-based alignment methods and statistical translation
models established early baselines, while vector space and positional language models refined ranking
strategies [17, 18]. More recently, multilingual transformers such as mBERT and XLM-R demonstrated
strong cross-lingual transfer without requiring parallel corpora [19, 20]. Region-specific pretraining
has further improved retrieval for Indian languages, with IndicBERT and MuRIL handling code-mixing,
transliteration, and other linguistic phenomena common in South Asian contexts [21, 22].</p>
      <p>Our work builds upon these strands of research by combining multilingual semantic embeddings
with symbolic reasoning for mathematical expressions. Specifically, we introduce a hybrid retrieval
model that leverages sentence-transformer representations for cross-lingual alignment, SymPy-based
similarity for formula comparison, and FAISS indexing for scalable retrieval. By integrating symbolic
and neural methods within a unified framework, our approach aims to address the limitations of purely
monolingual or symbolic systems and establish a more robust baseline for cross-lingual mathematical
information retrieval.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The dataset used in this study is released as part of the FIRE-2025 CLMIR shared task. It is derived from
the Math StackExchange corpus used in ARQMath-1 and has been adapted to support cross-lingual
retrieval between English and Hindi. The training corpus, provided in Train_set.csv, consists of
39,862 mathematical questions and answers written primarily in Hindi, enriched with LaTeX-formatted
mathematical expressions. Each entry is organized into four fields: a unique identifier, a Hindi title
describing the problem, a body containing explanatory text and embedded formulas, and a set of topical
tags covering domains such as probability, algebra, group theory, graph theory, etc. The titles and tags
suggest that many questions are translations or adaptations of English mathematical queries, ofering
a localized benchmark for cross-lingual evaluation.</p>
      <p>The body of posts typically contains descriptive text in Hindi, often mixed with English terminology,
along with LaTeX-formatted mathematical content. Examples range from probability calculations and
group-theoretic questions to systems of congruences and functional equations, making the dataset
suitable for retrieval tasks that integrate both textual and symbolic reasoning. Tags further categorize
posts into mathematical subfields, facilitating filtering and thematic analysis. Table 1 summarizes
structure of the training dataset.</p>
      <p>The oficial test set, Test_Data.csv, comprises 50 English queries designed to retrieve relevant
Hindi posts from the training collection. Each query consists of a unique identifier, a LaTeX-formatted
mathematical expression, and a short English description indicating its mathematical domain. Queries
span a diverse range of topics, including geometry (e.g., the Pythagorean theorem), calculus (e.g.,
definite integrals and derivatives), linear algebra (e.g., eigenvalues and dot products), statistics (e.g.,
variance), partial diferential equations (e.g., Laplace and heat equations), etc. as represented in Table 2.</p>
      <p>The combination of Hindi documents and English queries highlights the challenges of cross-lingual
and math-heavy retrieval, motivating approaches that efectively integrate multilingual semantic
representations with symbolic formula understanding.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>We address the problem of CLMIR, where an English query must retrieve Hindi mathematical
documents containing both natural language and LaTeX-based formulae. Formally, each query is
reprewhere   ∈ Σhi corresponds to the textual part in Devanagari script and   is a set or sequence of LaTeX
The retrieval objective is to compute a hybrid similarity score
(, ) =  
text(</p>
      <p>,  (  )) + (1 − )  math(  ,   ),
where  is a transliteration function mapping Devanagari text into Latin script,  text measures semantic
similarity between query and document text,  math measures symbolic similarity between the
corre10000
रैखक बधाई क एक प्रणाली को</p>
      <p>Presents a system of linear congruences: रैखक-बीजगणत,
 11 1 +  12 2 + ⋯ +  1   ≡  1 (mod ) ,
… ,  1  1 +  2  2 + ⋯ +</p>
      <p>≡   (mod ) .</p>
      <p>Questions the validity of row operations.</p>
      <p>सयंख्ा-सद्धांत,
मॉड्यूलर-एरथमेटक
sponding mathematical expressions, and  ∈ (0, 1) balances the two components. For each query, the
system first retrieves a candidate set of documents based on text similarity, and then re-ranks them
using the hybrid score to produce the final top-  results.</p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing</title>
        <p>The preprocessing stage begins by separating textual and mathematical components from both queries
and documents. LaTeX expressions are extracted using common inline and display delimiters. These
extracted spans form the mathematical parts (  ,   ), while the surrounding content constitutes the
textual parts (  ,   ). Any malformed or unparsable fragments are discarded.</p>
        <p>Textual normalization difers slightly for English and Hindi. English text is lowercased, stripped
of punctuation, and cleaned for consistent whitespace. Hindi text is first transliterated into Latin
script using the ITRANS scheme (via the indic-transliteration library). This ensures
compatibility with English-trained embedding models while preserving semantic meaning (e.g., “समाकलन" →
“samakalan”). Further cleaning removes diacritics and harmonizes character variants. The text is then
tokenized with the model-specific tokenizer, yielding encoder-ready input.</p>
        <p>To maintain input quality, documents with fewer than ten tokens or those consisting entirely of
mathematical expressions are excluded. Additionally, input length is capped at 512 tokens to comply
with transformer encoder limits.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Embedding Generation and Candidate Retrieval</title>
        <p>Once preprocessed, the textual components of queries and documents are mapped into dense vector
embeddings using Sentence Transformer models. Each encoder outputs a normalized representation,
and relevance is measured through cosine similarity:
approach balances speed and accuracy, producing the top  = 500 candidate documents for each
query. These candidates are then subject to hybrid re-ranking.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Symbolic (Math) Similarity</title>
        <p>In addition to textual matching, mathematical content is treated as a first-class signal of relevance.</p>
        <p>Run 1 employs a lightweight strategy: expressions are tokenized into operators, variables, and
symbols, and similarity is measured by token overlap,
where  (⋅) denotes the extracted set of math tokens. This method prioritizes eficiency and rewards
exact or near-exact matches.</p>
        <p>Run 2 applies a more expressive, structure-aware approach. Here, mathematical expressions are
parsed into operator-operand trees. Similarity is then computed by comparing overlapping nodes in
these trees, allowing the system to capture correspondences between expressions that difer
syntactically but remain structurally or semantically related.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Run Configurations</title>
        <p>Although the overall retrieval pipeline is shared across both runs, they diverge significantly in three
aspects: the choice of embedding models, the configuration of FAISS indexing, and the balance between
textual and mathematical similarity when computing the final hybrid score. These diferences reflect
distinct design priorities, with Run 1 focusing on eficiency and scalability, while Run 2 aims for higher
retrieval accuracy and stronger cross-lingual robustness.</p>
        <p>Run 1 (MiniLM-based): This configuration adopts multi-qa-MiniLM-L6-cos-v1, a compact
Sentence Transformer model that produces 384-dimensional embeddings. The main advantage of MiniLM
lies in its eficiency. The smaller embedding size reduces memory usage, allowing the system to
index a large number of documents without excessive storage overhead. Moreover, its lightweight
architecture accelerates both the encoding and retrieval stages, making it suitable for scenarios where
computational resources are limited or where rapid response times are critical.</p>
        <p>Candidate retrieval in Run 1 is implemented using FAISS with a fixed configuration, where the
embedding space is partitioned into a predetermined number of clusters ( list = 100) and a fixed number
of probes ( probe = 10) are searched at query time. This static setting ensures consistent query latency
across the entire collection. While such an approach may occasionally sacrifice recall—since only a
limited subset of clusters is explored—it guarantees predictability and eficiency, both of which are
desirable for real-time or large-scale applications.</p>
        <p>On the mathematical side, Run 1 employs the token-overlap method already described in the
previous subsection. Since this strategy does not capture deeper equivalences between expressions, the
hybrid scoring function deliberately places greater emphasis on the textual channel. The final
similarity score is computed with a 0.7 weight on text and 0.3 weight on math:
(, ) = 0.7 ⋅</p>
        <p>(, ) + 0.3 ⋅   (, ).</p>
        <p>This choice reflects the design philosophy of Run 1: textual similarity is treated as the primary signal
of relevance, while mathematical similarity plays a secondary, supportive role. In practice, this setup
enables rapid large-scale retrieval while still leveraging mathematical information for disambiguation
in cases where textual evidence alone is insuficient.</p>
        <p>Run 2 (MPNet-based): The second configuration takes a diferent stance, prioritizing robustness
and retrieval quality over raw speed. For the textual encoder, we employ all-mpnet-base-v2, which
generates 768-dimensional embeddings. MPNet ofers stronger representational capacity compared to
MiniLM, making it more efective at capturing fine-grained semantic relationships. More importantly,
it has demonstrated better performance in multilingual and cross-lingual settings, which is
particularly important here, since English queries must retrieve documents written in Hindi. The trade-of,
however, is higher computational cost and larger memory requirements due to the increased
dimensionality.</p>
        <p>Candidate generation in Run 2 also difers from Run 1 in its use of an adaptive FAISS indexing
strategy. Rather than fixing the number of clusters and probes in advance, the index parameters are
scaled with the size of the document collection. Specifically, the number of clusters increases with
corpus size, ensuring finer partitioning of larger collections, while the number of probes is also adjusted
dynamically to improve recall. This adaptive mechanism enhances robustness: when applied to small
datasets, it avoids unnecessary overhead, but on larger datasets it improves retrieval coverage, reducing
the risk of missing relevant documents that might fall outside the probed clusters.</p>
        <p>For the mathematical similarity component, Run 2 integrates the more expressive tree-based
structural approach introduced earlier. Since this strategy is capable of capturing structural and semantic
correspondences between expressions, the hybrid scoring function gives greater relative weight to the
mathematical channel compared to Run 1. The final score is defined as:
(, ) = 0.6 ⋅</p>
        <p>(, ) + 0.4 ⋅   (, ).</p>
        <p>Here, textual similarity remains slightly dominant, but mathematical similarity is treated as a more
substantial factor, reflecting the configuration’s emphasis on the dual importance of both language
and symbolic reasoning in cross-lingual mathematical retrieval.</p>
        <p>5: Compute textual similarity:</p>
        <sec id="sec-4-4-1">
          <title>6: Encode mathematical part with regex-based token matching:</title>
          <p>where  (⋅) denotes extracted math tokens
7: Combine similarities with linear interpolation:</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>8: Rank documents  in descending order of (, )</title>
          <p>9: Output: Ranked list of documents { (1),  (2), … ,  ( ) }</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Hybrid Re-ranking</title>
        <p>For both runs, the system first retrieves 500 candidates using text similarity alone. These candidates are
then re-ranked using the hybrid scoring function (, )
. Finally, the top  = 50 ranked documents are
returned. In the event of tied scores, preference is given to the document with higher text similarity.
Algorithm 1 Hybrid Retrieval Model based on MiniLM for CLMIR (Run 1)
1: Input: English query  = (  ,   ), Hindi document collection  = {
1,  2, … ,   }
2: Textual Preprocessing: Normalize text, remove stopwords, perform stemming
3: Mathematical Preprocessing: Extract LaTeX formulae, tokenize into operators, variables, and
structures
4: Encode textual part of query and documents using MiniLM:
Algorithm 2 Hybrid Retrieval Model based on MPNet for CLMIR (Run 2)
1: Input: English query  = (  ,   ), Hindi document collection  = {
1,  2, … ,   }
2: Textual Preprocessing: Normalize text, remove stopwords, perform stemming
3: Mathematical Preprocessing: Extract LaTeX formulae, represent expressions as operator trees
4: Encode textual part of query and documents using MPNet:

  =  MiniLM(  ),</p>
        <p>=  MiniLM(  )
6: Encode mathematical part via tree-based structural similarity:

  =  MPNet(  ),</p>
        <p>=  MPNet(  )</p>
        <sec id="sec-4-5-1">
          <title>8: Rank documents  in descending order of (, )</title>
          <p>9: Output: Ranked list of documents { (1),  (2), … ,  ( ) }</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>IReL’s Run 2 shows a clear and substantial improvement over Run 1 across all reported metrics.
Run 1 recorded P@10 = 0.000, MAP = 0.0064 and nDCG = 0.0352, whereas Run 2 achieved P@10 =
0.122, MAP = 0.165 and nDCG = 0.3063. The P@10 jump from 0.000 to 0.122 is particularly important
for user-facing retrieval, as it reflects many more relevant documents appearing in the top-10 returned
items.</p>
      <p>When compared against other teams, IReL’s Run 2 is competitive with the leading submissions. Its
P@10 of 0.122 surpasses Archisha Dhyani’s runs (0.028-0.050), all DUCS_CLMIR runs (0.006-0.010),
and Tends (0.080); it also outperforms most NLP Fusion runs (0.046-0.068) and is comparable to NLP
Fusion’s strongest run (0.118) and Retriever’s Run 2 (0.114). In MAP, Run 2 (0.165) ranks above Archisha
Dhyani (0.0712-0.1034) and DUCS_CLMIR (0.0145-0.0159), and is competitive with the stronger NLP
Fusion and Retriever runs (whose MAPs reach up to 0.2143). For nDCG, Run 2 (0.3063) is among the
top performers: it substantially outperforms most teams and is only slightly below Retriever’s best
submission (Run 3, nDCG = 0.3264). Taken together, these comparisons indicate that Run 2 attained
high ranking quality relative to the shared task field, especially in nDCG which measures the graded
quality of top-ranked results.</p>
      <p>A closer inspection of Run 1 shows several failure modes that explain its very low scores. Similarity
values produced by Run 1 span a wide but misleading range: some queries (e.g., q_1, q_2, q_5, q_46,
q_48, q_50) yielded relatively high similarity scores (reported values between approximately 0.2275
and 0.8500), yet these high scores did not translate into correct, diverse top-ranked documents. Instead,
Run 1 frequently returned identical document sets for multiple distinct queries (for example, the exact
same top candidates for q_1, q_2, q_5, q_46, q_48 and q_50), which strongly suggests insuficient
query-specific discrimination. One striking artifact is document 15243, which attains a very high
similarity (0.85000014) across many queries; this behaviour is consistent with an indexing or retrieval
bias. These issues explain why Run 1’s precision and ranking measures are near the bottom of the
leaderboard despite sporadically high pairwise similarity scores.</p>
      <p>By contrast, Run 2 shows both empirically and qualitatively improved behaviour. The range and
distribution of similarity scores for Run 2 are more informative and query-sensitive (for instance, reported
similarity ranges include 0.165-0.6017 for q_1 and 0.4587-0.5921 for q_2), indicating better
diferentiation among candidate documents. Document-overlap analysis between the two runs confirms this:
in many cases Run 2 returns diferent and more relevant documents than Run 1 (e.g., q_1 retrieves
4446, 33248, … under Run 2 instead of 15243, 39015, … under Run 1). The improved P@10 (0.122)
and nDCG (0.3063) quantify these gains at the top ranks, while the MAP increase demonstrates
better overall ranking across the full list. That said, Run 2 is not uniformly strong for every query: an
interesting anomaly is q_3, for which Run 2 reports relatively low similarity values (approximately
0.2432-0.3309). This suggests that q_3 contains particularly challenging content (for example,
unusually complex formulae or mixed-language phenomena) that evade the model’s current alignment, and
it highlights remaining per-query weaknesses even in the stronger run.</p>
      <p>We also categorized query-level performance by topic domains such as calculus, algebra, and
geometry. Preliminary inspection revealed stronger performance for algebraic and linear equation queries,
where symbolic structures align closely, compared to geometric or descriptive text-heavy queries,
which rely more on semantic cues. This suggests that symbolic similarity contributes
disproportionately in formula-centric categories. Conversely, performance degradation for mixed or linguistically
rich queries highlights the need for adaptive weighting or task-specific fine-tuning.</p>
      <p>In summary, IReL’s submissions demonstrate a dramatic within-team improvement from Run 1 to
Run 2. Run 2 attains competitive performance among the shared task participants, particularly in
nDCG, and achieves substantial absolute gains in MAP and P@10 relative to Run 1. However, the
detailed per-query analysis reveals both systematic issues (indexing/retrieval bias in Run 1) and
remaining edge-case failures (e.g., q_3) that motivate further refinement. These observations underscore
that gains in cross-lingual mathematical retrieval arise from (i) improving the fidelity of math-to-math
matching, (ii) ensuring indexing and candidate retrieval are not dominated by a small set of artifacts,
and (iii) addressing dificult, query-specific phenomena through targeted modelling and evaluation
strategies.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>The CLMIR 2025 Shared Task highlighted the challenges of retrieving Hindi mathematical documents
from English queries that combine text with symbolic expressions. The IReL system explored two
approaches: the first run, based on a lightweight multilingual encoder and binary math similarity,
performed poorly with MAP = 0.0064, P@10 = 0.000, and nDCG = 0.0352, underscoring the limitations
of such models in cross-lingual math retrieval. In contrast, the second run, which incorporated the
all-mpnet-base-v2 model, dynamic FAISS indexing, and a hybrid scoring scheme combining text
and formula similarity, achieved substantially stronger results with MAP = 0.165, P@10 = 0.122, and
nDCG = 0.3063. These findings emphasize the importance of balancing semantic and symbolic signals
for efective retrieval. Remaining challenges include handling complex mathematical structures,
mitigating transliteration noise, and optimizing the relative contribution of text and math components.
While evaluated primarily on the FIRE CLMIR dataset, the proposed hybrid framework can be
readily extended to other multilingual math corpora (e.g., English–Tamil or English–Bengali), given that
the underlying components (transliteration, semantic encoding, and symbolic reasoning) are
languageagnostic. This positions the system as a flexible foundation for broader multilingual scientific
information retrieval. Future research directions involve fine-tuning transformer embeddings on math-rich
multilingual corpora, developing richer formula similarity measures that integrate structural and
numerical evaluations, and employing learning-to-rank strategies to adaptively weight hybrid scores.
Exploring query expansion, ensemble retrieval, and external knowledge integration can further improve
robustness and scalability in multilingual mathematical information retrieval.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and
spelling check, Paraphrase and reword. After using these tools, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
[17] G. Salton, A. Wong, C.-S. Yang, A vector space model for automatic indexing, Communications
of the ACM 18 (1975) 613–620.
[18] Y. Lv, C. Zhai, Positional language models for information retrieval, in: Proceedings of the 34th</p>
      <p>International ACM SIGIR Conference, 2011, pp. 299–306.
[19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of NAACL-HLT, 2019.
[20] A. Conneau, et al., Unsupervised cross-lingual representation learning at scale, in: Proceedings
of ACL, 2020.
[21] D. Kakwani, et al., Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained
multilingual language models for indian languages, in: Proceedings of EMNLP, 2020.
[22] S. Khanuja, et al., Muril: Multilingual representations for indian languages, in: Findings of
EMNLP, 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Youssef</surname>
          </string-name>
          , Roles of math search in mathematics,
          <source>in: Proceedings of the Symposium on Computer Algebra and Scientific Computing</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blostein</surname>
          </string-name>
          ,
          <article-title>Recognition and retrieval of mathematical expressions</article-title>
          ,
          <source>in: International Conference on Document Analysis and Recognition</source>
          , IEEE,
          <year>2012</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>154</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mihalinec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sojka</surname>
          </string-name>
          ,
          <article-title>Mathsearch: A search engine for mathematical content</article-title>
          ,
          <source>in: Proceedings of the International Conference on MathML and Technologies for Math on the Web</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sojka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liska</surname>
          </string-name>
          ,
          <article-title>Indexing and searching mathematics in digital libraries</article-title>
          ,
          <source>Mathematics in Computer Science</source>
          <volume>5</volume>
          (
          <year>2011</year>
          )
          <fpage>227</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Youssef</surname>
          </string-name>
          ,
          <article-title>Methods of relevance ranking and hit-content generation in math search</article-title>
          , in: Proceedings of CICM,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D</given-names>
            <surname>.-D. Nguyen</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-H. Nguyen</surname>
          </string-name>
          , et al.,
          <article-title>An approach to searching mathematical content in vietnamese</article-title>
          ,
          <source>in: Asian Conference on Intelligent Information and Database Systems</source>
          , Springer,
          <year>2012</year>
          , pp.
          <fpage>58</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          , et al.,
          <source>Ntcir-11 math-</source>
          <article-title>2 task overview</article-title>
          ,
          <source>in: Proceedings of the 11th NTCIR Conference on Evaluation of Information Access Technologies</source>
          , National Institute of Informatics,
          <year>2014</year>
          , pp.
          <fpage>88</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          , et al.,
          <article-title>Ntcir-12 mathir task overview</article-title>
          ,
          <source>in: Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>299</fpage>
          -
          <lpage>308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>The math retrieval system of icst for ntcir-12 mathir task</article-title>
          ,
          <source>in: Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies</source>
          , National Institute of Informatics,
          <year>2016</year>
          , pp.
          <fpage>318</fpage>
          -
          <lpage>325</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          , I. Ounis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schubotz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Youssef</surname>
          </string-name>
          ,
          <article-title>Overview of the arqmath 2020 competition on answer retrieval for questions on math, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          (CLEF
          <year>2020</year>
          ), Springer,
          <year>2020</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>193</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>030</fpage>
          - 58219- 7_
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          , I. Ounis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schubotz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Youssef</surname>
          </string-name>
          ,
          <article-title>Overview of arqmath task 1 and 2: Answer retrieval and formula retrieval</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Davila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schubotz</surname>
          </string-name>
          , et al.,
          <article-title>Tangent-cft: An improved search engine for mathematical formulae</article-title>
          ,
          <source>in: Proceedings of the 39th International ACM SIGIR Conference</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1165</fpage>
          -
          <lpage>1168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schellenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          , Tangent-l:
          <article-title>An embedding model for mathematical formulas</article-title>
          ,
          <source>in: Proceedings of the 44th International ACM SIGIR Conference</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1579</fpage>
          -
          <lpage>1583</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          , E. Tang,
          <article-title>Mathematical formula retrieval with structure-aware deep neural networks</article-title>
          ,
          <source>in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2549</fpage>
          -
          <lpage>2560</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of EMNLP</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          , et al.,
          <article-title>Cross-lingual language model pretraining</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>