<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ARQMath Track: Applying Substructure Search and BM25 on Operator Tree Path Tokens</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wei Zhong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinyu Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ji Xin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Zanibbi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jimmy Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>David R. Cheriton School of Computer Science, University of Waterloo</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Rochester Institute of Technology</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>This paper reports on substructure-aware math search system Approach Zero that is applied to our submission for ARQMath lab at CLEF 2021. We have participated in both Task 1 (math ARQ) and Task 2 (formula retrieval) this year. In addition to substructure retrieval, we have added a traditional full-text search pass based on the Anserini toolkit [1]. We use the same path features extracted from Operator Tree (OPT) to index and retrieve math formulas in Anserini, and we interpolate Anserini results with structural results from Approach Zero. Automatic and table-based keyword expansion methods for math formulas have also been explored. Additionally, we report preliminary results from using previous years' labels and applying learning to rank for our first-stage search results. In this lab, we obtain the most efective search results in Task 2 (formula retrieval) among submissions from 7 participants including the baseline system. Our experiments have also shown a great improvement over the baseline result we produced from previous year.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Math Information Retrieval</kwd>
        <kwd>Math-aware search</kwd>
        <kwd>Math formula search</kwd>
        <kwd>Community Question Answering (CQA)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        over 17 million math formulas or notations. The data collection covers MSE threads from 2010 to
2018, and task topics are selected from MSE questions of 2019 (for ARQMath-2020 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) and 2020
(for ARQMath-2021 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). A main task (CQA Task, or Task 1) and a secondary formula retrieval
task (Task 2) are included in this lab. Participants are able to leverage math notations together
with its (text) context to retrieve relevant post answers. For Task 1, complete answers are
available for applying full-text retrieval, but participants are also allowed to utilize structured
formulas in the documents. On the other hand, formula retrieval in Task 2 is about identifying
similar formulas in the document that is related to the topic question. The formula retrieval task
specifies query formula with its question post, and optionally, participant could use contextual
information around the topic formula in the question post. Both tasks ask participants to
return up to five runs (one primary run and four alternative runs) that contain relevant post
answers to the given question topic. Relevance judgement will be received for primary runs
and selected results of alternative runs from the submission pool. Oficial evaluation metrics
include NDCG’ [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], MAP’, and P’@10, where MAP’ and P’@10 use H+M binarization (hits with
relevance score ≥ 2 are considered as relevant, and relevance levels are collapsed into binary).
NDCG’, MAP’, and P’@K are identical to their corresponding standard measurements except
that unjudged hits are removed before metric computation. Relevance is scored on a graded
scale, from 0 (irrelevant) to 3 (highly relevant).
      </p>
      <p>
        We submitted 5 runs for both tasks. Our system for this ARQMath lab is based on a
structureaware search system Approach Zero [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] and a full-text retrieval system Anserini [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We
adapted a two-pass search architecture for most of our submitted runs. In the Approach Zero
pass, a substructure matching approach is taken to assess formula similarity where the largest
common subexpression from formulas is obtained and we use the maximum matched subtree to
compute their structure similarity. Symbol similarity is further calculated with the awareness of
symbol substitutions in math formulas. The similarity metric used by Approach Zero is easily
interpretable and it may serve better the needs of identifying highly structured mathematical
formulas.
      </p>
      <p>
        As illustrated by Mansouri et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in an example query result (see Table 1), substructure
matching and variable name substitution are desired for identifying highly relevant math
formulas. Usually, this can be more easily achieved using tree-based substructure search than
using full-text search. However, searching math formulas also requires more "fuzzy" match
or high-level semantics. In this case, embedding formulas or matching bag-of-word tokens
using traditional text retrieval methods (but with careful feature selection) are shown to be
efective as well [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. For example, Tangent-CFT system is able to find document formula
(  () log ) (not shown in the Table 1) for the example query by considering semantic
information, but this formula is hard to be identified by substructure search engine because it
shares little common structure feature with the query Operator Tree (OPT).
      </p>
      <p>In our submission this year, we try to compensate strict substructure matching by introducing
a separate pass that performs simple token matching on Operator Tree path tokens. Specifically,
we include a full-text search engine Anserini to boost the results of Approach Zero. In the
Anserini pass, we use feature tokens extracted from a formula as terms and directly apply
full-text retrieval by treating those tokens as normal text terms. The diference between our
Anserini pass and other existing bag-of-word math retrieval systems is that we use leaf-root
path prefixes from formula Operator Tree representation (See Figure 1). This representation is
the same representation we use to carry structural information for formulas in Approach Zero,
but the latter additionally performs substructure matching and variable name substitution in
math formulas.</p>
      <p>
        We further try to improve our system recall by applying query expansion on both text and
math query keywords. We investigate RM3 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] (for both text and math keywords) and a method
using lookup table to extract math keywords from formulas. In addition, we report the result
from using previous years’ label and applying learning-to-rank methods.
      </p>
      <p>Our main objectives of experiments for this lab are as follows.</p>
      <p>• Evaluate the efectiveness of treating OPT leaf-root path features as query/index terms.
• Try diferent ways to combine results from structure search and traditional bag-of-word
matching paradigm. Evaluate the efectiveness of query expansion involving math
formulas.
• Apply learning-to-rank methods to the ARQMath dataset from previous years’ labels and
post meta data, and determine its usefulness.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Formula Retrieval</title>
      <sec id="sec-2-1">
        <title>2.1. Representation</title>
        <p>In this lab, we adapt an enhanced version of Operator Tree (OPT) representation, trying to
improve math formula retrieval recall. As illustrated by an example formula topic in Figure 1,
this representation contains more nodes than typical OPT in the following ways:
• Always placing an add node on top of a term, this allows matching a path from a single
term to another path from a multi-term expression, e.g.,  and  +  .
• Having an additional sign node (i.e., + and -) on top of each term. They are tokenized into
the same token such that it can match another math term even with diferent signs. It
changes the path fingerprint (see Section 2.2) so that we have the information to penalize
those paths of diferent signs.
• For any variable, it will place a subsup node (optionally an additional base node) on top
of the variable node, even if it comes without a subscript/supscript. This helps to increase
recall for cases when subscripted and non-subscripted variables are both commonly used
to denote the same math entity, e.g.,  and 1 . Notice that this rule is not applied
to constants, as they are not usually subscripted in math notations.</p>
        <p>equal
add
add
+
+
+
subsup
n
2
n
base
n
When being indexed, a formula OPT will be broken down into linear leaf-root paths, so
that they can be treated as normal “terms” to be used in inverted index. Diferent
leafroot paths may end up being the same path tokens after applying tokenization, e.g., path
U/base/subsup/+/add/equal and n/base/subsup/+/add/equal will result in the same
token path VAR/BASE/SUBSUP/SIGN/ADD/EQUAL since both U and n are tokenized to variable
token VAR (a capitalized name indicates it is tokenized). The purpose of tokenizing every node
in the path is to improve recall, such that we can find identical equations with diferent symbol
set, as it is frequently the case in math notations.</p>
        <p>In addition to leaf-root path tokens, we also index the prefixes of those path tokens, this
is necessary to identify a subexpression from a formula in the document. For example, if we
need to find  = 2 +  by only querying its subexpression 2 +  , then all the possible
leaf-root path prefixes should also be indexed. To alleviate the cost, one may optionally prune
prefix paths which always occur together. For example, the */base path will always follow
a */base/subsup path (asterisk denotes any prefix), thus we can remove the former path to
reduce index size.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Path Symbol Weight</title>
        <p>
          Tokenization on path boosts recall for formulas, however, we still need original path information
to break ties when tokenized paths have matched, e.g.,  &lt; 0 and  ≤ 0 . To address this issue,
in this task, we apply a 3-level similarity weight for path-wise matching. More specifically, we
use the original operator symbols along the token path to generate a hash value for each path
by computing the Fowler-Noll-Vo hash [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] from the leaf node up to 4 nodes above, and we call
this hash value the fingerprint of a path. The fingerprint captures a local symbolic appearance
for operators on a path, it can be used to diferentiate formulas of the same structure but with
diferent math operator(s).
        </p>
        <p>Upon scoring the match between a pair of paths, we compare against their leaf node symbol
as well as fingerprint value, we will assign the highest path-match weight if both values agree
between two paths, a medium weight if leaf symbols match but not the fingerprints, and a
lower path-match score if otherwise. A weighted sum of matched paths represents the symbol
similarity in our model.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Structure-Based Scoring</title>
        <p>
          The Approach Zero formula search system takes a tree-based matching approach, and specialized
query optimization is applied to match formula substructures during the very first stage of
retrieval [
          <xref ref-type="bibr" rid="ref5 ref6">6, 5</xref>
          ]. The benefit of substructure matching is that the formula similarity score is
well-defined and can be interpreted easily. In our case, the structure similarity is previously
defined as the number of paths in the maximum matched common subtree [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In this task,
we acknowledge the diferent contribution from paths and apply an IDF weight to a matched
formula, defined by the sum of each individual path IDFs:
        </p>
        <p>IDF(^  , ) =</p>
        <p>∑︁
∈^  ,
log


where  is the path in the largest common subtree ^  , of query and document formulas, 
is the document frequency of path , and  is the total number of paths in the collection.</p>
        <p>We also incorporate symbol similarity score ( ,  ) to further diferentiate formulas
with identical structure but diferent symbols. This score is only computed in the second stage
when structure similarity score is computed and possible to make the hit into top K results.
Specifically, we penalize symbol similarity by the length of document formula  :
SF( ,  ) =</p>
        <p>1
1 + (1 − ( ,  ))2
︃(
(1 −  ) +</p>
        <p>1
log(1 +  )
)︃
where the penalty is determined by parameter  .
overall similarity for a math formula match:</p>
        <p>Given structure similarity and symbol similarity, we adapt the following formula to compute</p>
        <p>
          Similarity( ,  ) = SF( ,  ) · IDF(^  , )
whereas for normal text terms in query, we compute their scores using BM25+ scoring schema [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
The final score for this pass is then accumulated from math and text keywords.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Text-Based Scoring</title>
        <p>On the other hand, we also add a separate pass in parallel to score document formulas by “bag of
paths” (without applying substructure matching). The path set for text-based scoring includes
two copies of a path, one with original leaf symbol and another with tokenized leaf symbol,
however, both types of paths apply tokenization to operator nodes (See Figure 2 for an example).
Including original leaf symbol will award exact operand matches, and the fully tokenized leaf
paths are included to boost recall and enable us to match expressions with diferent operand
symbols.
(1)
(2)
(3)
add
+
times
subsup
base
base
n
n
_VAR_BASE_SUBSUP_TIMES_SIGN_ADD _VAR_BASE_SUBSUP_TIMES_SIGN _VAR_BASE_SUBSUP_TIMES
_VAR_BASE_SUBSUP _VAR_BASE _normal__n___BASE_SUBSUP_TIMES_SIGN_ADD
_normal__n___BASE_SUBSUP_TIMES_SIGN _normal__n___BASE_SUBSUP_TIMES
_normal__n___BASE_SUBSUP _normal__n___BASE _VAR_BASE_SUBSUP_TIMES_SIGN_ADD
lossless document length) for scoring both text and formula paths, specifically</p>
        <p>︂(
∑︁ log 1 +</p>
        <p>−
∈</p>
        <p>+ 0.5 )︂
 + 0.5</p>
        <p>,
· , + 1(1 −  + (/))
(4)
where 1 and  are parameters, and  , , ,, ,  refer to the total number of documents,
the document frequency of the term , the term frequency of term  in the document , the
length of document , and the average document length respectively.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Math-Aware Retrieval</title>
      <sec id="sec-3-1">
        <title>3.1. Query Expansion</title>
        <p>In CQA Task, we need to return full-text answer posts as search results. In addition to
simply merging results from formula and text retrieval independently, we have identified a few
techniques to help map information from one type to another:
• To make use of information in formulas, we map tokens in LATEX to text terms so that
formula-centered document posts can also be found by querying text keywords.
• To utilize the context information in answer posts, we explore query expansion (covering
both math and text) based on pseudo relevance feedback to add potentially relevant
keywords based on both math and text context.</p>
        <p>
          In the following sections, we explored two query expansion methods.
3.1.1. Math Keyword Expansion
For the purpose of mapping tokens in LATEX to text, we designed some manual rules to convert
a set of LATEX math-mode commands to text terms. For example, we will expand text term “sine”
to the query if a \sin command occurs in formula LATEX markup. Furthermore, Greek-letter
commands in LATEX are also translated into plain text, e.g., \alpha will be mapped to term
“alpha”. A specialized LATEX lexer from our PyA0 package [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is used to scan and extract tokens
from markup. The list of math keyword expansion mappings we have used in this task is
enumerated in Appendix D.
        </p>
        <p>
          In order to find more formulas by querying math tokens, we not only expand keywords in
query, but also apply math keyword expansion to all document formulas for CQA Task.
3.1.2. RM3 Query Expansion for Mixed Types of Keywords
In addition to math keyword expansion, we apply RM3 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] to expand larger range of possibly
relevant terms or formulas from initially retrieved documents. Based on relevance model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ],
RM3 optimizes the expansion by closing the diference between document language model
(|) and query relevance model  (|), where random variable  represents generated
word,  and  are document and relevant document respectively.
        </p>
        <p>The objective is then reflected as negative KL divergence
(5)
(6)
− ( |) ∝ ∑︁  (|) log (|)</p>
        <p>∈
and RM3 with pseudo relevance assumption utilizes top retrieved results C to approximate above
objective.  (|) is estimated by normalized expanded query probability, i.e.,  (, 1, ...)/
where  are existing query keywords and  is normalizing constants. It can be further associated
to query likelihood ∏︀  (|) as shown below
 (|) ≈
∑︁  ()  (, 1, ...|) / 
∈C
∈C
= ∑︁  ()  (|) ∏︁  (|) / 

the query likelihood can be approximately represented by other appropriate scoring: In this lab,
we use BM25 for scoring in Anserini pass, and use the scoring functions stated in Section 2.3 in
Approach Zero pass. To apply RM3 to math formulas, we treat math markup as the same as
text keyword.</p>
        <p>After estimating query relevance model from Eq. 6, we further perform an interpolation
(using even ratio  = 0.5) with maximum likelihood estimate of existing query keywords in
order to improve model stability. We use top query keywords from our estimate of  (|) to
query them again in a cascade way. Our parameters for RM3 include: The number of top-
retrieved results used for estimation, and the number of top query keywords to be selected to
query in the second round.
3.2. Learning to Rank ARQMath Answers
We make the assumption that most answer posts are relevant to its linked question post, thus
we pair all answer posts with their question posts in the index. To eliminate the consequence
from retrieving low-quality answers (i.e., answers irrelevant to its linked question), we apply
learning to rank techniques using features such as the number of upvotes for an answer.</p>
        <p>Two learning to rank methods have been explored, i.e., linear regression and LambdaMART [15].
LambdaMART works by minimizing the cross entropy between pair-wise odds ratio of perfect
and actual results, and it is eficient and can be regarded as list-wise learning-to-rank method
because it only requires to sample adjacent pairs. Furthermore, it can accumulate the “ ” for
each document before updating parameters.  serves as a nice symmetric connection for the
cross entropy w.r.t model parameters. By default, LambdaMART is commonly set up to optimize
NDCG measures by multiplying measurement gain directly to pair-wise  , [16] where

  = − 1 + ·  |Δ|
(7)
 is a parameter that determines the shape of distribution for probability  that a document 
is ranked higher than , and  is the likelihood ratio of  .</p>
        <p>The following factors are considered to rerank answer posts:
• Votes (the upvote number): As it is presumably a direct indicator to reflect answer post
relevance to its question.
• Similarity: The first-phase score for each ranked result (which may be interpolated
result from two separate passes, see discussion in Section 2.3 and 2.4).
• Tags: The number of tags matched between topic question and linked question of
document. In ARQMath lab, each question has been potentially attached some “tags” to
indicate the question scope in math terms. Tags are manually labeled by MSE users with
a bar on reputation, and they can be a good abstraction for a Q&amp;A thread.</p>
        <p>These features are similar compared to the features proposed by Yin Ki NG et al. [17],
however, they do not have assessed data available at that time, and they have to mock relevance
assessments using indirect indicators (e.g., thread being marked as duplicate by users). Our
experiments will be based on direct relevance judgement which is more reliable, accurate
and less complicated. Furthermore, we also explore another learning-to-rank method using
LambdaMART.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>The oficial results of our system compared to systems with best results are summarized in
Table 9 and 10. System(s) noted “L" in the table are systems applying learning to rank using
existing labels (for ARQMath-1, they are trained separately from 45 test topics in Task 2).
Systems noted “M" use keywords manually extracted from topic post for querying, 1 while the
1The complete set of our manual queries can be found here: https://github.com/approach0/pya0/tree/arqmath2/
topics-and-qrels (for 2021 topics, we use the “refined” version as indicated by our file name)
model execution is still automatic. The same subscripted M letter indicates the same set of
topics. Notice that system TU_DBS team uses only text information (noted as “T”) for retrieval
in Task 1. In addition, although Task 2 asks to submit at most 5 results for each visually unique
formula, our index with limited number of visually unique formulas are not available in time,
thus our oficial runs for Task 2 may contain extra results per unique formula (those are marked
“F") and it may afect the comparison with other systems (although the oficial evaluation will
remove those extra results, those are holding replaces in returned search results).</p>
      <p>In our runs, a base letter (such as “P”, “A” etc.) indicates the set of parameters we have applied
in Approach Zero system. Table 8 in Appendix shows the detailed parameters for diferent base
settings. Math path weight is the weight associated to path in matched common subtree, it is
used to adjust the contribution importance comparatively to text keyword match; formula 
shown in Eq. 2 is the penalty applied to over-length formulas; and BM25+ is the scoring used
for normal text search in in Approach Zero pass.</p>
      <p>Additionally, we append a number to base letter in our run names to indicate the way it
combines results with text-based system Anserini, details can be found in Section 4.5 and 4.6.</p>
      <sec id="sec-4-1">
        <title>4.1. Task-1 Submission</title>
        <p>In CQA Task, we adapt Lancaster stemmer, an aggressive stemmer that is able to canonicalize
more math keywords, e.g. summation will be stemmed to sum whereas other stemmers such as
Porter and Snowball will only convert it to summat.</p>
        <p>Our Task-1 results are not quite competitive, however, we observe that text-only retrieval
system from TU_DBS team can achieve better results than ours in Task 1. This implies that
text retrieval alone in Task 1 plays a crucial role in efectiveness contribution, and a potential
gain is anticipated when text retrieval and the way it combines with math search can be further
improved in our case.</p>
        <p>In our post experiments, we generate reranked runs for Task 1 by applying Linear Regression
and LambdaMART (trees, depth = 7, 5) directly on Approach Zero pass (see Section 4.7), which
is trained on all Task-1 judgements from previous year. After applying learning to rank, our
post-experiment result is on par with most efective systems in terms of P@10.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task-2 Submission</title>
        <p>In Task 2, two runs C30 and B30 using diferent base parameters achieve the same scores
numerically, they are collapsed into one row in the table. We have also generated similar 5
runs for Task 2 again (having an asterisk on their run names) but with up to 5 results for each
visually unique formula. We have further corrected an oversight in our Anserini pass which
afects tree path token generation. It turns out our results can be further improved.</p>
        <p>Without using any existing label for training in our oficial submission, we obtain the best
efectiveness across all metrics in Formula Retrieval Task of this year data (ARQMath-2), and
according to P’@10 metric, we are able to achieve the highest precision at the very top results
in ARQMath-2 by returning results from Approach Zero alone (see run B* ). We attribute this
advantage to our structure-aware matching applied to the very first stage of retrieval process.
Top-precision systems such as Tangent-S [18] introduced alignment phase to find matched
substructures, and Tangent-CFTED performs tree edit distance calculation in reranking stage.
These structure matching methods are too expensive to be applied in the first stage of retrieval.</p>
        <p>Apart of the above results, we have conducted a variety of experiments trying to achieve the
objectives listed in Section 1, although we select some best performed runs for submission, we
still have made the following attempts in this paper to address those objectives:
• Explore traditional IR architecture and bag-of-words tokens using Anserini without
applying substructure matching. And evaluate these two system using the same set of
features extracted from OPT.
• Combine text-based approach with tree-based approach using score interpolation as well
as search results concatenation, and try a more deeper integration to translate one type
of token to another using RM3 and math keyword expansion.
• Apply learning-to-rank with ground truth labels from previous year using linear
regression and LambdaMART, and evaluate their efectiveness.</p>
        <p>All of our experiments in the following sections will be using previous-year topics for evaluation,
since judgement data of this year is not available at the time we write this paper.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Employed Machine Configuration</title>
        <p>Our experiments run on a server machine with the following hardware configuration: Intel(R)
Xeon(R) CPU E5-2699 @ 2.20GHz (88 Cores), 1 TB DIMM Synchronous 2400 MHz memory and
running on a HDD partitioned with ZFS.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Runtimes and Eficiency</title>
        <p>For measuring runtime statistics, both systems are querying tokens extracted from the same
representation in a similar way (i.e., by extracting prefix leaf-root paths from Operator Tree
representation). Our index contains over 23 million formulas, over 17 millions of which are
structured formulas (not single-letter math notations). The statistics of our path tokens per
topic is (143.5, 102.5, 110, 400) in (avg, std, med, max).</p>
        <p>
          Table 2 reports the query execution time separately from two passes. Compared to our
previously published results [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the system we have used in this task has on-disk math index
also compressed, that technical improvement has improved system eficiency. However, our
query execution times are unable to match Anserini which only performs matching at the token
level without aligning substructures.
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Text and Math Interpolation</title>
        <p>In our substructure search, an uniform weight is associated with each path for scoring, we first
investigate how this weight afects overall retrieval efectiveness. We fix the BM25+ parameters
(, 1) in Anserini pass to (0.75, 1.2), (0.75, 1.5) and (0.75, 2), and change math path weight from
1 to 3.5 with a pace of 0.5. An evaluation on CQA Task is conducted here since this task requires
trade-of between text terms and math formulas.</p>
        <p>
          As seen in Figure 3, measures from diferent BM25 parameters follow the similar trend with
respect to math path weight. As path weight goes larger, NDCG’ degrades consistently, this
aligns with MathDowsers runs [17] as they observe best performance when “formula weight”
is almost minimal ( ≈ 0.1). however, the other measures reach higher points when math path
is weighted more than text terms, but they tend to be unstable. We believe this is because MAP’
and BPref changes are very minor in this evaluation, they will have greater chance to flutter.
Also, NDCG’ is shown generally more reliable [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] among other measures used for incomplete
assessment data.
        </p>
        <p>
          Then we have investigated combining bag-of-words tokens from Anserini, and we choose
ifxing BM25 parameters to (0.4, 0.9) in Anserini pass. We first adapt  = 0.5 linear interpolation
ratio after normalizing scores to [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], and then merge results from Approach Zero and Anserini
in the second stage. The interpolation is expressed by
ifnal score
=  ·  + (1 −  ) · 
(8)
where  and  are scores generated by Approach Zero (fixing math path weight to 1.5)
and Anserini respectively.
        </p>
        <p>Three cases to combine with Anserini are examined, including using text terms only, using
math paths only and using both text terms and maths (diferent type of tokens are treated
the same in Anserini pass, all use BM25 scoring without substructure matching ability). For
comparison, we also list the results from each individual system as well.</p>
        <p>As shown in Figure 3, Anserini generally improves results if combined with Approach Zero.
A boost of score from text-only Anserini is expected as Anserini alone achieves better results in
text search, and given most of the keywords from query and results are text terms, combining
Anserini can be beneficial. We notice that the path-only Anserini run also boosts scores, and
we believe this is because paths tokens used in Anserini adds recall, whereas Approach Zero
using substructure matching is good at adding precision, which are complementary to each
other. However, text-only retrieval from Anserini contribute the most to structure-aware search
in Approach Zero.</p>
        <p>We are also interested at the combination efect in Task 2. Under our assumption that
structure-aware search tends to produce good precision at the top, and path tokens are helpful
to search recall, we have designed two possible ways to merge our math retrieval results from
Approach Zero and Anserini: (1) Keep top-K results from Approach Zero using structure search,
and the rest of results from top-1000 are concatenated from Anserini pass. (2) Uniformly apply
score interpolation in Eq. 8 but with diferent ratios this time. Our oficial submissions are also
named by above two conditions, i.e., a base run letter following the method we use to merge
results. For example, A55 interpolates base run A with Anserini results using a ratio of 0.55,
and P300 uses top-300 results from base run P and concatenates results from Anserini for the
lower ranks.</p>
        <p>Figure 4 shows the evaluation summary for combining results from Approach Zero and
Anserini using two diferent methods. Approach Zero uses the same base configuration “P", we
vary the interpolation ratio and  to see how efectiveness is afected.</p>
        <p>We observe that the weighted merge of results with a ratio around 0.3 or 0.6 achieves higher
NDCG’ and MAP’ in general. And concatenation of search results is shown slightly better in
efectiveness in this case, the almost even concatenation achieves optimal NDCG’ and MAP’
scores. These have indicated the contribution from either system is essential to achieve good
results for Task 2, and they are very complementary when they contribute evenly in general.
On the other hand, the concatenation results have justified our assumption that the top results
(i.e., top 400 in this case) from Approach Zero are very efective comparatively.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Text and Math Expansion</title>
        <p>We use our best base runs (A, B and P) to test the efectiveness of math keyword expansion
(as described in Section 3.1.1). The experiment is conducted for Task 1 only, and we are only
expanding query keywords in Approach Zero pass. Math keyword expansion is applied both in
index and in query to boost formula recall.</p>
        <p>As only a small portion of math keywords can be expanded by our manual rules, and content
containing formulas is just a partial of the collection, we do not observe a large advancement in
efectiveness. However, the gain in NDCG’ is consistent. And because the NDCG’ measure is
shown to be more stable than other measures here (see Table 4), we still think the efect of math
expansion in Task 1 is beneficial. Nevertheless, the rules used in math keywords expansion
have to be designed manually, and it may ignore other alternative synonyms and equivalent
terms for math tokens.</p>
        <p>
          We have noticed that the naive math keyword expansion applies uniform weight to keywords
after expansion, this has important downside in contrast to query expansion methods such as
RM3 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] which will adjust boost weight to new query according to the relevance model. For
example, many formula topics contain “greater or equal to" sign ≥ , however, they are not
always relevant to e.g., inequality, so expanding such terms using uniform weights is going to
hurt efectiveness. On the other hand, RM3 will assign smaller weight in this case, because the
term “inequality” is unlikely to co-occur with ≥ .
        </p>
        <p>In the following experiment, we also explore using RM3 for expanding query keywords.
We simply treat math markup as terms so it can integrate into RM3 naively. RM3 has two
parameters in our implementation, i.e., the number of keywords  in the query after expansion,
and the number of top documents  to be sampled for relevance model. We use (, ) to
uniquely determine a RM3 run.</p>
        <p>Two base runs are used in this experiment, P and C. As shown in Table 5, there is good
improvement from RM3 in base run C, and it has a greater improvement compared to the gain
from using math keyword expansion. However, this improvement is not consistent, as in run
P, RM3 is actually harmful if not combined with math keyword expansion. Overall, benefit
from query expansion is not notable, and our experiment shows the introducing of RM3 can be
also harmful on some initial settings. But because math keyword expansion helps consistently
across both experiments, we choose to apply it to all of our submissions for Task 1.</p>
      </sec>
      <sec id="sec-4-7">
        <title>4.7. Learning to Rank</title>
        <p>Inspired by the fact that previous year judgement is available for the first time in this lab, we
want to study the efectiveness of reranking from utilizing these data. However, we do not have
learning-to-rank results tuned correctly at the time we submit our runs, so these are completed
as post-experiment runs.</p>
        <p>Our experiment investigated two methods, simple linear regression and LambdaMART. We
have our base run B as baseline and rerank its results with these two models. Experiment is
conducted on Task 1 (because our features are mostly indicators for document-level similarity)
with 39,124 relevance samples from previous year judgement pool, we split the data into 8 folds
and validate model efectiveness by reporting averaged measures across each test data. We use
the number of upvotes  , the number of tag matches  , and the ranking score  produced from
Approach Zero as feature set.</p>
        <p>As shown in Table 6, simple linear regression can achieve similar performance gain compared
to LambdaMART model, presumably because the limited available data we have in
ARQMath1. The averaged coeficients for our resulting 8-fold linear regression model after training is
, ,  ≈ [0.002, 0.109, 0.007]. Compared to the feature selection by Yin Ki NG et al. [17], we
do not include user-wise metadata such as user reputation and their history upvotes. However,
similar to their findings, our experiment echos that tag matches is a very important feature for
this task.</p>
      </sec>
      <sec id="sec-4-8">
        <title>4.8. Query Analysis and Case Study</title>
        <p>To understand what is causing efectiveness changes in diferent methods, and to compare our
math retrieval with the state-of-the-art system, we have gone through a query-by-query case
analysis to understand results in diferent methods.</p>
        <p>We plot the NDCG’ scores per query for both Task 1 and Task 2 in Appendix E and F. Figure 5
compares Task 1 scores from diferent methods: A base run configuration (P), base run with math
keyword expansion (P-mexp), base run with RM3 = (20, 10), the same base run results merged
with Anserini system using math tokens (P-50-ans0409desc), and the same base run results
merged with Anserini system using text tokens (P50-ans0409title). Both merged results have a
merge ratio  = 0.5 and they use BM25 parameters (0.4, 0.9). Figure 6 consists of per-query
results for the formula retrieval task, and here we also compare the results to Tangent-S system.
Run P30-ans7515, P50-ans7515, and P300-ans7515 are diferent ways to merge results with
Anserini, they all use BM25 parameters (0.75, 0.15).
4.8.1. The Efect of Diferent Methods
First, combination with Anserini almost uniformly improves efectiveness, either by using text
tokens or math tokens in a bag-of-word model. In a few cases, the improvement from combining
bag-of-word math tokens is profound, e.g., for topic A.19 4 − 1 , A.68  + 1 and A.93
( − ) = ( − ) , when formulas should be matched entirely. However, in
1 1 1
cases like A.40 11 + 22 + 33 + ... +  = or A.83 1, 1 + , 1 + + , ... , math
2 2 3
bag-of-word tokens tend to sufer because these formulas require evaluating partial matches
more structurally in order to assess similarities.</p>
        <p>Second, adding only text tokens alone can greatly improve results, because many formula
keywords are hurt by either malformed or irregular formula markups. For example, A.32
() ⇐⇒ ... uses text without surrounding \text, and A.55 has Unicode encoding in
the markup and our parser could not handle. Other formula keywords do not produce similar
formulas in search results and may need to rely on text keywords as they are more informative,
notably A.80 and A.90. Similarly, less informative math formula keywords in the topic generally
benefit from query expansion. For example, in topic A.99, formula keyword  : R → R adds
expansion terms “rational number” which capture the semantic meaning of this math expression
even it is hard to find many such structures in the indexed documents.</p>
        <p>Math keyword expansion has boosted a few queries notably, but it can also hurt results such
as in A.26, where it expands “fraction” keyword to the query because it contains a fraction in an
∫︁ ∞ sin 
integral  , which is obviously more about “integral” than “fraction”. This confirms
0 
our assumption that weights assignment to expanded query keywords is essential in order to
keep math keyword expansion generally beneficial. On the other hand, RM3 has mostly mild
increase/decrease on baseline, and the overall improvement is minor.</p>
        <p>In Table 6, we are comparing to one of the most efective systems in Task 2, i.e., Tangent-S [ 18].
However, we do notice there are queries we could not generate any result, mostly because our
semantic parser is unable to handle some formulas. For example, topic B.11 has the following
formula in the original topic where parentheses would not pair correctly.</p>
        <p>∫︁ ∫︁

 (, )  =
∫︁ ∫︁</p>
        <p>⃒⃒ Φ Φ ⃒⃒
 (Φ(, )⃒ ⃒
⃒⃒  ×  ⃒⃒</p>
        <p>Tangent-S on the other hand, uses both Symbol Layout Tree and Operator Tree to represent
formulas, and if they fail to parse OPT, they can work from Symbol Layout Tree as fallback, the
latter only captures topology of the nodes in a formula, and in that case, parentheses can be
unpaired. This exposes one of our crucial weakness in searching formulas, i.e., we are heavily
relying on well-defined parser rules to handle user-created data, and a failure in parsing would
end up zero recall in our system.</p>
        <p>Nevertheless, we have successfully demonstrated some advantages, for example, Table 7 is a
comparison of results from our system and Tangent-S. We are able to identify commutative
operands and rank highly relevant hit to the top. However, our result at rank 3 is not relevant
because the exponential power in the query is a fraction, while our returned result does not have
fraction as power, even if the number of operands matched in that case is large. In this particular
query, our NDCG’ score is not competitive to Tangent-S, because after top results, Tangent-S
is also able to return partially relevant (relevance level = 1) formulas such as 1 (1 + √3) at
2
a lower rank (not shown in the table), while our results at similar positions may match more
operands at tree leaf, but they can be less relevant results due to missing key symbolic variable
 , e.g., (1 + )1/2 .</p>
      </sec>
      <sec id="sec-4-9">
        <title>4.9. Strength and Weakness</title>
        <p>In terms of efectiveness, our system is able to retrieve formulas with math structure awareness.
Our system is very efective in formula search, our structure search is able to be applied to the
very first stage of retrieval and produce highly efective results without reranking.</p>
        <p>Rank
1
2
3
4
5</p>
        <p>However, as indicated by Task-1 results, our method to handle text and math tokens together
is not ideal. On the other hand, in Task 2, some of our results are screwed by failure to parse
some math markups, our OPT parser is less robust to handle user created math content than
SLT parser, because OPT requires higher level of semantic construction to be resolved (e.g., to
pair parentheses vertical bars in a math expression).</p>
        <p>
          So far, our experimental results using learning-to-rank methods does not understand a
finegrind level of math semantics, we could incorporate more features in lower level to further
exploit these methods. Also, we have not applied embedding to formula retrieval yet. As
demonstrated by other recent systems [
          <xref ref-type="bibr" rid="ref7">7, 19, 20</xref>
          ], embedding applies less strict matching than
substructure search and it can often greatly improve efectiveness.
        </p>
        <p>
          Finally, although our formula search results are efective and the structure search pass employs
a dedicated dynamic pruning optimization [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for querying math formulas, we are still not
reaching the level of eficiency of text-based retrieval search system.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper, we have investigated diferent ways to combine our previous system Approach
Zero based on substructure retrieval and another full-text retrieval system Anserini using
bag-of-word tokens. We have evaluated and compared the efect of merging results by diferent
tokens (i.e., text-only, math-only and mixed types), and by diferent methods (i.e., concatenation
and linear interpolation). We demonstrate the usefulness of combining linear tokens into
structure-aware math information retrieval using OPT prefix paths.</p>
      <p>We also try using query expansion techniques to assist CQA task, we reported our preliminary
evaluation results for math-aware search by applying RM3 model and a new math keyword
expansion idea. We have also investigated using a few CQA task features to train and rerank
search results, utilizing a small scale of labeled data. Our submissions to formula retrieval task
of this year have achieved the best efectiveness over all oficial metrics. In the future, we need
to add a more tolerant and eficient parser so that we can parse user created data more robustly.
We are interested to introduce data driven models that target math retrieval more specially.
Additionally, more features can be explored to achieve greater efectiveness boost by learning
from existing labels.
Umass at trec 2004: Novelty and hard, Computer Science Department Faculty Publication
Series (2004) 189.
[15] C. J. C. Burges, K. M. Svore, Q. Wu, J. Gao, Ranking, Boosting, and Model Adaptation,
Technical Report MSR-TR-2008-109, 2008. URL: https://www.microsoft.com/en-us/research/
publication/ranking-boosting-and-model-adaptation/.
[16] P. Donmez, K. M. Svore, C. J. Burges, On the local optimality of lambdarank, in:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in
information retrieval, 2009, pp. 460–467.
[17] N. Yin Ki, D. J. Fraser, B. Kassaie, G. Labahn, M. S. Marzouk, F. W. Tompa, K. Wang, Dowsing
for math answers with tangent-l, in: International Conference of the Cross-Language
Evaluation Forum for European Languages (Working Notes), 2020.
[18] R. Zanibbi, K. Davila, A. Kane, F. W. Tompa, Multi-stage math formula search: Using
appearance-based similarity metrics at scale, in: Proceedings of the 39th International
ACM SIGIR conference on Research and Development in Information Retrieval, 2016, pp.
145–154.
[19] S. Peng, K. Yuan, L. Gao, Z. Tang, Mathbert: A pre-trained model for mathematical formula
understanding, arXiv preprint arXiv:2105.00377 (2021).
[20] Z. Wang, A. Lan, R. Baraniuk, Mathematical formula representation via tree embeddings,</p>
      <p>Online: https://people. umass. edu/˜ andrewlan/papers/preprint-forte. pdf (2021).</p>
    </sec>
    <sec id="sec-6">
      <title>A. Approach Zero Parameter Settings</title>
      <p>B. Oficial Results and Post Experiments (Task 1)
C. Oficial Results and Post Experiments (Task 2)</p>
    </sec>
    <sec id="sec-7">
      <title>Math Keyword Expansion Mappings</title>
      <p>Summary for the manual mapping rules from any markup containing a math token (on the left column)
to expansion text keywords (on the right column).</p>
      <p>Math Tokens
&gt;, &lt;, ≤ , ≥
 ...</p>
      <p>R, N

0
∞
=, ̸=
∫︀ , ∮︀
∑︀
√



!
(mod )
sin, cos, tan</p>
      <p>Mapped Term(s)
alpha, beta ...
rational number, natural number ...
pi
zero
infinity
equality
inequality
integral
summation
fraction
root
partial, derivative
factorial
modular, mod
sine, cosine, tangent
function/operator names</p>
      <p>(corresponding names)
8
.
0
6
.
0</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Anserini: Enabling the use of lucene for information retrieval research</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1253</fpage>
          -
          <lpage>1256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          , Overview of arqmath 2020:
          <article-title>Clef lab on answer retrieval for questions on math</article-title>
          ,
          <source>in: International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <source>Overview of arqmath-2</source>
          (
          <year>2021</year>
          ):
          <article-title>Second clef lab on answer retrieval for questions on math</article-title>
          ., in: International Conference of the CLEF Association (
          <article-title>CLEF</article-title>
          <year>2021</year>
          ),
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          ,
          <article-title>On information retrieval metrics designed for evaluation with incomplete relevance assessments</article-title>
          ,
          <source>Information Retrieval</source>
          <volume>11</volume>
          (
          <year>2008</year>
          )
          <fpage>447</fpage>
          -
          <lpage>470</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rohatgi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Giles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <article-title>Accelerating substructure similarity search for formula retrieval</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>714</fpage>
          -
          <lpage>727</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <article-title>Structural similarity search for formulas using leaf-root paths in operator subtrees</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rohatgi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Giles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <article-title>Tangent-cft: An embedding model for mathematical formulas</article-title>
          ,
          <source>in: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fraser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. W.</given-names>
            <surname>Tompa</surname>
          </string-name>
          ,
          <article-title>Choosing math features for bm25 ranking with tangent-l</article-title>
          ,
          <source>in: Proceedings of the ACM Symposium on Document Engineering</source>
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lavrenko</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Relevance-based language models</article-title>
          ,
          <source>in: ACM SIGIR Forum</source>
          , volume
          <volume>51</volume>
          , ACM New York, NY, USA,
          <year>2017</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Glenn Fowler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Noll</surname>
          </string-name>
          , Fowler/noll/vo hash, www.isthe.com/chongo/tech/comp/fnv,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <article-title>Lower-bounding term frequency normalization</article-title>
          ,
          <source>in: Proceedings of the 20th ACM international conference on Information and knowledge management</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kamphuis</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. P. de Vries</surname>
            , L. Boytsov,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Which bm25 do you mean? a largescale reproducibility study of scoring variants</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Lin,</surname>
          </string-name>
          <article-title>Pya0: A python toolkit for accessible math-aware search</article-title>
          ,
          <source>in: Proceedings of the 44th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdul-Jaleel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Diaz</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Larkey</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          <string-name>
            <surname>Smucker</surname>
          </string-name>
          , C. Wade,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>