<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DPRL Systems in the CLEF 2021 ARQMath Lab: Sentence-BERT for Answer Retrieval, Learning-to-Rank for Formula Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Behrooz Mansouri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Douglas W. Oard</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Zanibbi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rochester Institute of Technology</institution>
          ,
          <addr-line>NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Maryland, College Park</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the Document and Pattern Recognition Lab from the Rochester Institute of Technology in the CLEF 2021 ARQMath lab. There are two tasks defined for ARQMath: (1) Question Answering, and (2) Formula Retrieval. Three systems were submitted for Task 1, all of which used two-stage retrieval models. First, a set of questions within the test collection that were similar to the query were found using a Sentence-BERT model that had first been trained on Quora Question Pairs and then fine tuned using duplicate question links found within the ARQMath test collection. Then in the second stage, answers given to those questions (identified using links within the collection) were ranked by one of three diferent similarity scores. For Task 2, five runs were submitted: one using only formula embedding, another using formula embedding followed by re-ranking based on tree-edit distance, the third run using Tangent-S, and the remaining two being alternative ways of reranking Tangent-S results using learning-to-rank techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Community Question Answering (CQA)</kwd>
        <kwd>Mathematical Information Retrieval (MIR)</kwd>
        <kwd>Math-aware search</kwd>
        <kwd>Math formula search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>0
a
^
1
2
+
0
b
^
=
(a)</p>
      <p>0
1 c
2
^
1
2
a
a
2
n
+
n
2
n
a
b
(b)
=
n
a
c
2
duplicate, referring the asker to the similar question(s) that had previously been posted to Math
Stack Exchange, where they can find relevant answers. Following a similar process, in all of
our runs we first find similar questions (using some fully automatic technique) and then we
rank only the answers given to those questions. Our systems difer in how the first and second
stages of that process are implemented.</p>
      <p>
        To find similar questions, we fine-tune a Sentence-BERT model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], using both related and
duplicate questions on Math Stack Exchange. For ranking the answers, we use three diferent
scoring functions: (1) the Math Stack Exchange answer scores that are available as metadata in
the test collection, (2) A computed score from a Sentence-BERT model trained on the
ARQMath1 relevance judgments that estimates the similarity between the Question and Answer pair
(QASim), and (3) a combination of scores from those two approaches.
      </p>
      <p>
        For Task 2, we modify our previous models Tangent-CFT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Tangent-CFTED [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
TangentCFT is an n-gram embedding model built on a linearized tree representation of formulae and
Tangent-CFTED re-ranks the results from Tangent-CFT using tree-edit distance. Our two new
runs for ARQMath, use learning-to-rank framework for mathematical formulae [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] trained
on ARQMath-1 topics. One run is trained only on the 29 training queries from ARQMath-1,
whereas the other was trained on all 77 ARQMath-1 formula queries. The Tangent-S [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] system
is our last (baseline) run. All our runs use only formula matching to find relevant formulae,
with no use of the text surounding those formulae in the question from which the query was
extracted or in the test collection post from which a potentially relevant formula instance was
extracted. Our models make use of both the Operator Tree (OPT) and Symbol Layout Tree (SLT)
representations of formulae, one encoding the syntax and the other appearance of a formula.
Figure 1 shows the OPT and SLT representations for the formula 2 + 2 = 2. In the OPT
representation, the edge labels for non-commutative operators indicate argument position.
In the SLT representation, the edge labels show the spatial relationship between the formula
elements. For instance, the edge label ‘a’ shows that the number ‘2’ is located above the variable
‘a’, while the edge label ‘n’ shows operator ‘+’ is located next after ‘a’.
      </p>
      <p>In this paper, we first describe our runs in Task 2, then discuss our proposed models for Task
1, and finish with conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task 2: Formula Retrieval</title>
      <p>
        In Task 2, one formula inside each topic from Task 1 is selected and the participants are asked
to return a set of relevant formula instances from the questions and answers in the collection.
As described in the ARQMath 2021 overview [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], to decide the relevance degree of a formula,
the context in which the topic and retrieved formula appear is important. In all our models, we
focus only on the structural matching of the formula, and text is ignored. All our models make
use of both OPT and SLT representations. Diferent approaches such as [
        <xref ref-type="bibr" rid="ref4 ref7 ref9">4, 7, 9</xref>
        ] have found this
beneficial for system efectiveness. Next, we describe our five runs.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Tangent-S</title>
        <p>
          Tangent-S3 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] was reported by the organizers as the Task-2 baseline system in both ARQMath
2020 and 2021, and we also use components or scores from Tangent-S in all of our runs. In
this system first a set of candidates are retrieved based on tuple similarity. Using depth-first
traversals, tuples with 4 elements are created as (parent, child, path, path-from-root [PFR]). The
parent and child are the node values in the form of Type!Value, where type can take values such
as Variable (V) or Number(N) and value shows the variable name or the numeric value. Path
shows the set edge labels visited connecting parent to child. Path-from-root shows edges labels
visited when moving from the root node to the parent node. Table 1 shows the tuples created
from SLT and OPT representations of the formula 2 + 2 = 2 from Figure 1. As the first node
is the root of the tree, the path-from-root is empty. The second tuple is showing that the node
with type Number and value ‘2’ has no children and we have reached end of baseline (eob). The
default edge label for the eob is ‘n’. After the tuples are generated, the harmonic mean of recall
and precision for matched tuples is used for ranking.
        </p>
        <p>To re-rank the candidates, three similarity scores are considered. The Maximum Subtree
Similarity (MSS) is computed from the largest connected match between query and candidate
formulae obtained using a greedy algorithm, evaluating pairwise alignments between trees
using unified node values. Two other similarity features used in this system are query node
matching after alignment, either with or without -Type unification. The SLT and OPT results</p>
        <sec id="sec-2-1-1">
          <title>3https://github.com/MattLangsenkamp/tangent-s</title>
          <p>Tangent-CFT
Encoder</p>
          <p>Tangent-CFT
Embedding Models</p>
          <p>Retrieval Result
[One Per Representation]
Preprocessing Step</p>
          <p>Formula vector
representations
are next combined via linear regression over alignment measures from each representation to
produce final similarity scores.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Tangent-CFT2</title>
        <p>
          The Tangent-CFT model [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] was the first embedding model introduced for mathematical formula
to use both SLT and OPT representations. In addition to the full representations (which include
both the type and the value of a node), this model also employed unification on the SLT to
produce a representation called SLT-Type. In the SLT-Type tree representation, only the type
of each node was represented; the corresponding values were ignored. For each of the three
representations, the model would then convert the tree representation to a vector, and then add
the 3 vectors to get the final representation of a formula. Our the process thus has the following
steps (refer to [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for further details):
• Tuple Extraction: Presentation MathML and Content MathML representations (from
LaTeXML4) are used as a basis from which to generate internal SLT and OPT formula
representations. Built using depth-first traversals with the Tangent-S system, these
internal tree representations are strings consisting of a sequence of tuples, as described
above for the Tangent-S system. The only diference for Tangent-CFT is that the
pathfrom-root element produced by Tangent-S is ignored.
• Tuple Encoding: The tuples are then tokenized and enumerated. The tokenization is
based on type and value. For example the tuple (V!a, N!2, a), which represents 2, will be
tokenized to {‘V’, ’a’, ‘N’, ‘2’, ‘a’ }.
• Training Embedding Models with fastText: Mathematical formulae are diverse, with
relatively few training examples being available for any particular formula. For this
reason, Mansouri et al chose to apply [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] the multi-scale fastText [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] n-gram embedding
model get vector representations for each tuple. As the name suggests, fastText was
originally designed for text. To apply it to math, each token is treated as it it were a text
character, every whole tuple is treated as a text word, and the generated set of tuples is
        </p>
        <sec id="sec-2-2-1">
          <title>4https://dlmf.nist.gov/LaTeXML/</title>
          <p>OPT Retrieval Result</p>
          <p>RRF
treated as a text sentence. The final vector representation for a formula is thus obtained
by averaging the fastText representations for its individual tuples.</p>
          <p>
            While the overall pipeline for Tangent-CFT2 is the same as for Tangent-CFT, two modifications
were made to the architecture:
• Formula Representations. In Tangent-CFT2 we add another representation to the
previous model by considering the OPT-Type. This representation is similar to SLT-Type,
but for the operator tree.
• Combining the Results. After ARQMath 2020, we modified the result combination part
of the Tangent-CFT system. In the previous system, the 3 vectors from each representation
were added to get the final vector representation of a formula. In Tangent-CFT2, with
each of the four representations of SLT and OPT (full and -Type), first, the top-k results
are retrieved. This is done by computing the cosine similarity between the query vector
representation and the vector representation of formulae in the collection. Then the
four results are combined using modified Reciprocal Rank Fusion [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] with the following
formula:
 ( ∈  ) = ∑︁
∈
          </p>
          <p>
            ( )
 + ( )
(1)
where  is a set of formulae to be ranked,  is a set of models, and  and  are
the scores and the rank, respectively, of the retrieved formula by model . The
top1000 results from each representation are computed as the cosine similarity between the
vectors.
The Tangent-CFTED model was introduced in ARQMath 2020 [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. This model first retrieves a
set of candidate formulae using the Tangent-CFT model. It then re-ranks them using Tree-Edit
Distance (TED), similarly to Kamali and Tompa[
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. Figure 3 shows an overview of this system.
In this section, we first review the Tangent-CFTED model and then describe the changes in the
new model.
          </p>
          <p>Tangent-CFTED uses tree-edit distance to compare the similarity between two formulae
using their tree representations. Tree-edit distance is the minimum cost of converting one tree
to the other using a set of edit operations. In this work, we consider three edit operations:
insertion, deletion, and substitution. Each operation has a unique weight. These weights are
similar in both versions, learned on NTCIR-12 [14] topics. To calculate tree-edit distance, only
the node values are used; the edge labels are ignored. The similarity measure is defined as
inverse tree-edit distance as follows:
(1, 2) =</p>
          <p>1
 (1, 2) + 1
.</p>
          <p>(2)</p>
          <p>
            In our model, we use both SLT and OPT representations and re-rank the candidates with
tree-edit distance. The results are then combined using RRF with equation 1. For more details
refer to [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. There are two modifications made in the new version:
• Selecting the candidates. The candidate formulae are selected using Tangent-CFT2.
• Combining the Results. To combine the re-ranked results from the SLT and OPT
representations, we again use RRF (equation 1). In the previous version, the results were
combined linearly with weights learned on the NTCIR-12 [14] dataset.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Learning-to-Rank</title>
        <p>Broadly speaking, three main approaches have been used for the formula retrieval task: full-tree
matching, sub-tree matching, and embedding models. In our learning-to-rank framework,
we make use of all these approaches, using results from instances of each approach to create
features for the SVM-rank [15] system for supervised learning to rank. We trained SVM-rank
two ways, once with the 29 ARQMath-1 training queries (LtR29), and the other time with all
77 queries from the ARQMath-1 collection (LtRall). The penalty for misclassification during
training (C) is 0.01, and the tolerance for termination criterion (epsilon) is 0.001. We computed
the following kinds of similarity measures as features:
• Tuple matching scores
• Maximum Sub-tree Similarity (MSS)
• Node Matching scores
• Unweighted tree edit distance scores
• Weighted tree edit distance scores
• Cosine similarity from Tangent-CFT model</p>
        <p>All features other than MSS were calculated on both OPT and SLT representations, with
and without unification. The MSS features were only calculated for the unified SLT-Type and
OPT-Type representations. The first three kinds of similarity features are from the Tangent-S
model, which uses sub-tree matching. The tree edit distance features were calculated with the
Tangent-CFTED system. For the weighted tree edit distance scores, we use the same weights
that Tangent-CFTED uses for retrieval.5</p>
        <p>5Note that for learning to rank we used the original Tangent-CFT and Tangent-CFTED models, not the
TangentCFT2 and Tangent-CFT2TED versions that we used for single-system submissions this year.
2.5. Results for Submitted Runs
Tables 2 and 3 show the DPRL runs results on Task 2, both for our oficial submitted runs (Table
2), and for runs obtained after correcting errors in how our systems were run (Table 3). In this
section we summarize results for our submitted runs, shown in Table 2.</p>
        <p>ARQMath-1 Results. For the ARQMath-1 topics, our learning-to-rank framework has better
efectiveness compared to the Tangent-S system, in particular for P ′@10 which is higher for all
of our runs. Interestingly, our learning-to-rank framework shows similar efectiveness when
trained on all training and test topics (LtRall) versus trained on only training topics (LtR29).
Using a one-way analysis of variance (ANOVA) test with a .05 significance level, the diferences
between our runs were not significant in terms of P ′@10 and MAP′. However, with nDCG′, the
diferences between Tangent-CFT2 and both our learning-to-rank models are significant (using
posthoc Tukey HSD Test,  &lt; 0.05).</p>
        <p>Tangent-CFT2, Tangent-CFT2TED, and Tangent-S take diferent approaches for formula
retrieval using embedding, full-tree, and sub-tree matching. Our learning-to-rank model uses
SVM-rank to linearly combine similarity measures from each of these models, in order to
overcome their individual limitations.</p>
        <p>In our next analysis, we compare three runs from the Tangent family on the ARQMath-1
topics, and show how learning-to-rank can help. Tangent-CFT2, being an n-gram embedding
model, focuses on retrieving formulae that share common n-grams with the input query. This
can be beneficial for formulae such as ∑︀=0 , which are small, and perhaps users are
sensitive to the variables or numbers they use in their formula. Both Tangent-S and
TangentCFT2TED return non-relevant formulae such as ∑︀
=0(− 1) in their top-10 results, which
have SLTs and OPTs that are similar to the query, but are not relevant. For this query, the P′@10
for Tangent-CFT2 was 0.9, 0.4 for Tangent-CFT2TED, and 0.6 for Tangent-S. Tangent-CFT2TED
uses tree-edit distance as a full-tree matching score.</p>
        <p>Looking at full trees provides better results for formulae such as:</p>
        <p>∅, {1}, {2}, {1, 2}, {3}, {1, 3}, {2, 3}, {1, 2, 3}, {4}, . . .
where partial matches are unlikely to provide useful information. The P′@10 values for this
query for Tangent [-CFT2TED,-CFT2,-S] are 0.6, 0.4 and 0.3, respectively. Finally, Tangent-S is a
system using sub-tree matching, and for complex formulae such as:
∫︁ ∫︁

 (, )  =
∫︁ ∫︁</p>
        <p>⃒⃒ Φ Φ ⃒⃒
 (Φ(, )⃒ ⃒
⃒⃒  ×  ⃒⃒
ifnding sub-matches can also be useful, returning highly relevant formulae such as:
∫︁ ∫︁
,
∫︁ ∫︁
,
 (, ) =</p>
        <p>( (, )) | (, )|,
that the other two models did not return in their top-1000 results. Tangent-S has P′@5 of 0.5
for this formula, whereas this value is 0.3 for both Tangent-CFT2 and Tangent-CFT2TED.</p>
        <p>From these examples, we can see that each of these models have their strengths and limitations.
With our learning-to-rank model, we re-rank Tangent-S results using similarity scores from
multiple retrieval models. For example, for the query lcm(1, 2) = gcd(112,2) Tangent-S ranks
non-relevant formulae such as  = lcm(1, 2, . . . , ), that share sub-trees with the query
in its top-10 results. Using our proposed learning-to-rank model, relevant formulae such as
· 
lcm(, ) = gcd(,) are pushed to the top-10 results. As can be seen, the first formula can
be converted to the second with a pair of substitutions ( for 1,  for 2) and removing the
multiplication dot in the second formula.</p>
        <p>The learning to rank models use only formula tree similarity features. However, there are
formulae that need more features and processing. For instance, for the query:
() ⇐⇒ ̸ ∃( ∈ )
there are relevant formulae such as ⊢ ∃∀( ∈/ ) that may not necessarily share SLT or OPT
structure with the query. Perhaps using canonicalization methods [16] can improve efectiveness
for these queries to convert them to one unified format. While we focused on structural similarity
features, there are still formulae for which the efectiveness is low. Textual features are another
missing part of our current model. There are queries such as  =  ( + 1), appearing in a
question related to diferential equations, for which returning a structurally similar formula
such as  =  () is considered non-relevant due to its appearance in a diferent context (a
post on another topic). We consider exploring these features in our future work.</p>
        <p>ARQMath-2 Results. The results for ARQMath-2 for our systems and the baseline are
substantially lower than for ARQMath-1, with the baseline outperforming all of our models in
nDCG′ and MAP′. This was due in part to changes in relevance assessment for Task 2 (see the
ARQMath-2 working notes overview paper for details [17]), and to errors made while computing
our runs, described in the next Section.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.6. Corrected Unoficial ARQMath-2 Post-Hoc Runs</title>
        <p>After ARQMath 2021, we determined that we had executed our embedding models incorrectly
for ARQMath-2 topics, which impacted all of our oficial Task 2 runs. Tangent-CFT2 uses
embeddings, Tangent-CFT2TED re-ranks Tangent-CFT2 results, and our learning to rank models
use cosine similarities over embedding vectors as features. We fixed the issue and recalculated
the efectiveness measures shown in Table 3.</p>
        <p>After correction, the learning-to-rank models improve the Tangent-S (baseline) results
similarly for both ARQMath-1 and ARQMath-2 topics (roughly 5-6% for nDCG′, 7-8% for MAP′,
and 9-12% for P′@10). For both topic sets, training our learning to rank models using all
ARQMath-1 topics versus just the 29 ARQMath-1 training topics produces similar results. We
choose to re-rank Tangent-S results for our learning to rank models, as Tangent-S had the
best performance for Task 2 systems at ARQMath-1. However, for ARQMath-2, this is not the
case. Re-ranking results from another system may have provided better re-ranked results. Our
strongest results were obtained by the Tangent-CFT2TED system, which had better efectiveness
than Tangent-CFT2. Tangent-CFT2TED reranks Tangent-CFT2 results using SLT and OPT tree
edit distances with RRF (see above).</p>
        <p>Even after correction, all of our runs have lower nDCG′ and MAP′ measures on ARQMath-2
topics than on ARQMath-1 topics. However, our highest  ′@10 is nearly identical, increasing
just slightly (by 0.3%) for Tangent-CFT2TED for ARQMath-2 topics vs. ARQMath-1 topics. Note
that our proposed models are all based on formula similarity and the context is ignored: there are
several queries where our models retrieve formulas with identical or nearly identical SLT/OPT
representations, but they are assessed as low relevance or not relevant due to diferences in
the posts where query and retrieved formulas appear (e.g., diferent data types and ranges
for variables). Some rather small changes can cause formulas to be deemed not-relevant, as
illustrated in Table 4.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task 1: Answer Retrieval</title>
      <p>In Task 1, the goal is to find relevant answers to a mathematical question. The topics are selected
among questions posted on Math Stack Exchange in 2020. The answers in the collection are
posted from 2010 to 2018.</p>
      <p>We had 3 runs for Task 1, all following a two-step retrieval model. First, similar questions
are retrieved. Then, all answers given to similar questions are ranked using 3 diferent scoring
functions. In all our runs, the mathematical formulae are represented using their original
LATEX strings. Finally, the answers are sorted in descending order by their vote scores. We
explain our two-step retrieval model next.</p>
      <sec id="sec-3-1">
        <title>3.1. STEP 1: Finding Similar Questions</title>
        <p>Math Stack Exchange provides links to related and duplicate questions. The related questions
have a similar topic, but they are not exactly the same question. The duplicate questions are
tagged by the Math Stack Exchange moderators of users with high reputation scores. A duplicate
question is a newly posted question that has been asked before on Math Stack Exchange.</p>
        <p>In our retrieval models, to first identify similar questions to a topic, we used the
SentenceBERT Cross-Encoders with the pre-trained model on the Quora question pairs dataset.6 The
model was trained on over 500,000 sentences with over 400,000 pairwise annotations indicating
whether two questions are a duplicate or not. Using this model, we did two-step fine-tuning.
First, we trained the model on both duplicate and related questions. Then, another fine-tuning
was done, using only the duplicate questions. For our training, we used the posts provided in
the ARQMath collection (from 2010 to 2018). In the first fine-tuning, 358,306 pairs, and in the
second, 57,670 pairs were used. In both cases, half of the pairs were positive samples and the
other half were the negative ones, chosen randomly from the collection.</p>
        <p>To train both models, we used multi-task learning, considering two loss functions: constrastive
[18] and multiple negatives ranking loss [19]. The constrastive loss function minimizes the
distance between positive pairs and maximizes it for negative ones, making it suitable for
classification tasks. The multiple negatives ranking loss function, however, considers only
positive pairs and minimizes the distance between positive pairs out of a large set of possible
candidates. We set the batch size to 64 and number of training epochs to 20. The maximum
sequence size was set to 128.</p>
        <p>Figure 4(a) shows the Cross-Encoder model trained for finding similar questions. In the first
ifne-tuning, a question title and body are concatenated. In the second fine-tuning, however, we
considered the same process for training, with three diferent inputs:
• Using the question title, with a maximum sequence length of 128 tokens.
• Using the first 128 tokens of question body.
• Using the last 128 tokens of question body.</p>
        <sec id="sec-3-1-1">
          <title>6https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs</title>
          <p>Classifier</p>
          <p>BERT
(a)</p>
          <p>QASim
Classifier</p>
          <p>BERT
(b)
Question 1</p>
          <p>Question 2</p>
          <p>Question</p>
          <p>Answer</p>
          <p>To find a similar question, we use the three models to separately retrieve the top-1000 most
similar questions. The retrieved results are then combined using RRF as shown in Eq. 1. We call
this similarity score Question-Question Similarity (QQSim).</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. STEP 2: Finding Related Answers</title>
        <p>After similar questions are found for a topic, all the answers given to them are compiled and
ranked based on the question and answer similarity. In each run, we used a diferent similarity
function as follows:
1. Math Stack Exchange score [MathSE]. Each post on Math Stack Exchange (MathSE)
has a score given by the users. The score is the diference between the positive and
negative votes given to that post. In our first run, we simply consider this score as an
indicator of answer relevance. We used MinMax normalization to map answer scores
to values between 0 to 1. Therefore, our final relevance score between a question and a
candidate answer is calculated as:
( , ) = ( , ) ·  ℎ()
(3)
where the  is the question topic and  is the candidate answer and  is the question
to which answer  was given.
2. Two-step Hierarchical Sentence-BERT [QASim]. ARQMath-1 results showed not
all the answers with a high score on Math Stack Exchange are relevant. An example is
shown in Table 5.</p>
        <p>Therefore, in our second run, we train Sentence-BERT Cross-encoder fine-tuned on
question and answer pairs from ARQMath-1 topics and their assessed hits. Our
pretrained model is Tiny-BERT with 6 layers trained on the “MS Marco Passage Reranking”
[20] task. The inputs are triplets of (Question, Answer, Relevance), where the relevance
is a number between 0 and 1. In ARQMath-1 evaluation [21], high and medium relevance
degrees were considered as relevant for precision-based measures. In our training, we</p>
        <p>Finding the last two digits of 999. . . 9 (nine 9s)
The last two digits of 999
At this point, it would seem to me the easiest thing to do is just do 99
mod 100 by hand. The computation should only take a few minutes.</p>
        <p>In particular, you can compute 93 and then cube that.
consider a relevance score of 1 for answers assessed with high or medium, 0.5 for low,
and 0 for non-relevant. As for the cross-encoder for Question-Question similarity, we
use the LATEX representation for formulae. The input question is the concatenation of
the question title and body. We use a batch size of 64, with 20 epochs and a maximum
sequence length of 128. We keep our loss functions similar to our Question-Question
model, using multi-task learning by constrastive and multiple negatives ranking loss
functions. After the training, the cross-encoder outputs the similarity of question and
answer, called QASim as shown in Figure 4(b). In this run, after similar questions are
found, the model predicts the similarity of question and answer pair, QASim. Our final
ranking score considers two similarity scores; between the question and answer, and also
the similarity between the topic question and the question to which the answer is given
(calculated in step 1). The ranking function is:
( , ) = ( , ) · ( , )
(4)
where  is the question to which answer  was given.
3. Combined model [RRF]. In our last run, we combine the similarity scores obtained
from the two previous runs using RRF as given in equation 1.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results</title>
        <p>Table 6 shows our run results on ARQMath topics for Task 1. Using a one-way analysis of
variance (ANOVA) test with a .05 significance level, none of our runs on ARQMath-1 and
ARQMath-2 topics were significantly diferent in any of the efectiveness measures. The
top1000 results returned by all three runs difer only in their rank ordering. This is due to the efect
of question-question similarity in all our runs on the final similarity score. However, looking at
intersections of the top-10 annotated results for each run in Figure 5 (averaged over 77 topics),
our first and second runs, which use diferent scoring functions for the candidate answers, are
able to find a set of relevant answers the the other run cannot find. When combining the results,
these relevant answers are left out. For example, when considering all relevance degrees, on
average there are 2.1 relevant answers retrieved by our first run (that considers only the Math
Stack Exchange score), which are not included in the combined results (run RRF). Therefore, for
future work, we might consider a diferent strategy for combining the results.</p>
        <p>ARQMath provides diferent types of topics. Topics are categorized based on their dificulty
into hard, easy and medium. Another grouping is based on whether the topic is dependent on
the text, formula or both. The last category divides the questions based on their subject into
concept, computation and proof. We separate topics based on these categories and calculated
P′@10 for each group. The results are shown in Table 7. As shown in this table, re-ranking
results with the Cross-Encoder trained on ARQMath-1 topics can improve efectiveness for
text-dependent questions. The same efect can be seen for topics related to concepts. In this
category, 40% of topics are text-dependent and the other are dependent on both formula and
text. In contrast to concept-related questions, 50% of questions related to computation are
formula-dependent, causing low efectiveness for our models.</p>
        <p>For ARQMath-2 topics, we see a drop in all efectiveness measures, including for the baseline.
The increase in the ratio of topics that depend upon both text and formula may partly explain
this.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This paper describes the DPRL runs for the ARQMath lab at CLEF 2021. Five runs were submitted
for the Formula Retrieval task. For ARQMath-2, our initial formula retrieval runs were computed
incorrectly. In our corrected runs, the Tangent-CFT2TED model did better [17]. Our
learning-torank models consistently improved results, but re-ranking the baseline Tangent-S runs produced
substantially weaker results for ARQMath-2 than ARQMath-1 topics in terms of nDCG′ and
MAP′ metrics, perhaps because the initial Tangent-S results were weaker for ARQMath-2. Our
models retrieved formulae with similar or identical Symbol Layout Tree and Operator Tree
representations for some queries. However, when formulae appeared in a diferent context than
the formula query, they could be assessed as being of low relevance, or not relevant at all.</p>
      <p>For the Answer Retrieval task, three runs were submitted. In our runs, first, a set of similar
questions are found for a topic, and then answers given to them are ranked by criteria that
are specific to each run. To find similar questions we used Sentence-BERT Cross-encoder
ifned-tuned on the ARQMath Math Stack Exchange collection. For ranking candidate answers,
we used three approaches: (1) using Math Stack Exchange answer scores, (2) using similarity
of question and answer with Cross-encoder model trained on ARQMath-1 assessment, and (3)
using Reciprocal Rank Fusion to combine the two previous scores. The final relevance score is
calculated as the multiplication of question-question similarity and question-answer similarity
score. These runs were competitive (obtaining the second-highest nDCG′ for submitted runs
[17]), even though we treated formulas represented in LATEX as text.</p>
      <p>For future work, in the Formula Retrieval task, we aim to consider other similarity features
such as spatial features [22], to improve our learning-to-rank results. Also, text features are
ignored in our current formula retrieval models, and we aim to include them. For the Answer
Retrieval task, our work is in an early stage, and there are several possible directions. First, our
models are mostly trained with default parameters from the Sentence-BERT model and we plan
to do a grid search on our models’ parameters. Second, we represent formulae as LATEX strings,
and would like to use formula structure features within our model. One possible approach is
replacing formulae by their concept name where available, similar to identifying synonymns
in text search. For example, the formula 2 + 2 = 2 could be replaced with “Pythagorean
Theorem”.</p>
      <p>We have found ARQMath-2 to be an excellent opportunity to further develop our ideas, and
we look forward to ARQMath-3!</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We would like to thank Matt Langsenkamp for providing the Tangent-S code on GitHub.
This material is based upon work supported by the Alfred P. Sloan Foundation under Grant No.
G-2017-9827 and the National Science Foundation (USA) under Grant No. IIS-1717997.
ings of the 36th International ACM SIGIR Conference on Research and Development in
Information Retrieval, 2013.
[14] R. Zanibbi, A. Aizawa, M. Kohlhase, I. Ounis, G. Topic, K. Davila, Ntcir-12 mathir task
overview., in: NTCIR, 2016.
[15] T. Joachims, Training linear SVMs in linear time, in: Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining, 2006.
[16] I. Normann, M. Kohlhase, Extended formula normalization for -retrieval and sharing of
mathematical knowledge, in: Towards Mechanized Mathematical Assistants, Springer,
2007.
[17] B. Mansouri, R. Zanibbi, D. W. Oard, A. Agarwal, Overview of arqmath-2 (2021):second
clef lab on answer retrieval for questionson math (working notes version), in: Proc.
International Conference of the Cross-Language Evaluation Forum for European Languages,
CEUR, 2021. Online.
[18] S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with
application to face verification, in: 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), volume 1, IEEE, 2005.
[19] M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos,
R. Kurzweil, Eficient natural language response suggestion for smart reply, arXiv preprint
arXiv:1705.00652 (2017).
[20] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO:</p>
      <p>A human generated machine reading comprehension dataset, in: CoCo@ NIPS, 2016.
[21] R. Zanibbi, D. W. Oard, A. Agarwal, B. Mansouri, Overview of ARQMath 2020: CLEF lab on
answer retrieval for questions on math, in: International Conference of the Cross-Language
Evaluation Forum for European Languages, Springer, 2020.
[22] R. Avenoso, Spatial vs. Graph-Based Formula Retrieval, Master’s thesis, Rochester Institute
of Technology, 2021. URL: https://scholarworks.rit.edu/theses/10784.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <source>Advancing Math-Aware Search: The ARQMath-2 lab at CLEF</source>
          <year>2021</year>
          , in: European Conference on Information Retrieval, Springer,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <article-title>Finding old answers to new math questions: the ARQMath lab at CLEF 2020</article-title>
          , in: European Conference on Information Retrieval, Springer,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence embeddings using Siamese BERTNetworks</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rohatgi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Giles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tangent-CFT</surname>
          </string-name>
          :
          <article-title>An embedding model for mathematical formulas</article-title>
          ,
          <source>in: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <article-title>DPRL systems in the CLEF 2020 ARQMath lab</article-title>
          ,
          <source>in: Working Notes of CLEF 2020-Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <article-title>Learning to rank for mathematical formula</article-title>
          ,
          <source>in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Davila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <article-title>Layout and semantics: Combining representations for mathematical formula search</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          , Overview of ARQMath-
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <article-title>: Second CLEF lab on answer retrieval for questions on math, in: International Conference of the Cross-Language Evaluation Forum for European Languages</article-title>
          ,
          <string-name>
            <surname>LNCS</surname>
          </string-name>
          , Springer,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G. Y.</given-names>
            <surname>Kristianto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Topic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          ,
          <article-title>Mcat math retrieval system for ntcir-12 mathir task</article-title>
          .,
          <source>in: NTCIR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanibbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <article-title>Characterizing searches for mathematical concepts</article-title>
          ,
          <source>in: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)</source>
          , IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Buettcher</surname>
          </string-name>
          ,
          <article-title>Reciprocal rank fusion outperforms condorcet and individual rank learning methods</article-title>
          ,
          <source>in: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kamali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. W.</given-names>
            <surname>Tompa</surname>
          </string-name>
          ,
          <article-title>Retrieving documents with mathematical content</article-title>
          , in: Proceed-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>