=Paper= {{Paper |id=Vol-3180/paper-05 |storemode=property |title=DPRL Systems in the CLEF 2022 ARQMath Lab: Introducing MathAMR for Math-Aware Search |pdfUrl=https://ceur-ws.org/Vol-3180/paper-05.pdf |volume=Vol-3180 |authors=Behrooz Mansouri,Douglas W. Oard,Richard Zanibbi |dblpUrl=https://dblp.org/rec/conf/clef/MansouriOZ22 }} ==DPRL Systems in the CLEF 2022 ARQMath Lab: Introducing MathAMR for Math-Aware Search== https://ceur-ws.org/Vol-3180/paper-05.pdf

DPRL Systems in the CLEF 2022 ARQMath Lab:
Introducing MathAMR for Math-Aware Search
Behrooz Mansouri1 , Douglas W. Oard2 and Richard Zanibbi1
1
Rochester Institute of Technology, NY, USA
2
University of Maryland, College Park, USA

Abstract
There are two main tasks defined for ARQMath: (1) Question Answering, and (2) Formula Retrieval, along
with a pilot task (3) Open Domain Question Answering. For Task 1, five systems were submitted using raw
text with formulas in LaTeX and/or linearized MathAMR trees. MathAMR provides a unified hierarchical
representation for text and formulas in sentences, based on the Abstract Meaning Representation (AMR)
developed for Natural Language Processing. For Task 2, five runs were submitted: three of them using
isolated formula retrieval techniques applying embeddings, tree edit distance, and learning to rank, and
two using MathAMRs to perform contextual formula search, with BERT embeddings used for ranking.
Our model with tree-edit distance ranking achieved the highest automatic effectiveness. Finally, for Task
3, four runs were submitted, which included the Top-1 results for two Task 1 runs (one using MathAMR,
the other SVM-Rank with raw text and metadata features), each with one of two extractive summarizers.

Keywords
Community Question Answering (CQA), Mathematical Information Retrieval (MIR), Math-aware search,
Math formula search

1. Introduction
The ARQMath-3 lab [1] at CLEF has three tasks. Answer retrieval (Task 1) and formula search
(Task 2) are the tasks performed in ARQMath-1 [2] and -2 [3]. The ARQMath test collection
contains Math Stack Exchange (MathSE)1 question and answer posts. In the answer retrieval
task, the goal is to return a ranked list of relevant answers for new math questions. These
questions are taken from posts made in 2021 on MathSE. The questions are not included in the
collection (which has only posts from 2010 to 2018). In the formula search task, a formula is
chosen from each question in Task 1 as the formula query. The goal in Task 2 is to find relevant
formulas from both question and answer posts in the collection, with relevance defined by
the likelihood of finding materials associated with a candidate formula that fully or partially
answers the question that a formula query is taken from. The formula-specific context is used in
making relevance determinations for candidate formulas (e.g., variable and constant types, and
operation definitions), so that formula semantics are taken into account. This year a new pilot
Open-Domain Question Answering task was also introduced, where for the same questions

CLEF 2022 – Conference and Labs of the Evaluation Forum, September 21–24, 2022, Bucharest, Romania
$ bm3302@rit.edu (B. Mansouri); oard@umd.edu (D. W. Oard); rxzvcs@rit.edu (R. Zanibbi)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
1
math.stackexchange.com/
as in the Answer Retrieval task (Task 1), the participants were asked to extract or generate
answers using data from any source.
The Document and Pattern Recognition Lab (DPRL) from the Rochester Institute of Technol-
ogy (RIT, USA) participated in all three tasks. For Task 1, we have two categories of approaches.
In the first approach, we search for relevant answers using Sentence-BERT [4] embeddings of
raw text that includes the LATEX representation for formulas given in the MathSE posts. In the
second approach, we use a unified tree representation for text and formulas for search. For this,
we consider the Abstract Meaning Representation (AMR) [5] for text, representing formulas
by identifiers as placeholders, and then integrating the Operator Tree (OPT) representation
of formulas into our AMR trees, forming MathAMR. The MathAMR representations are then
linearized as a sequence, and Sentence-BERT is used for retrieval. We trained Sentence-BERT
on pairs of (query, candidate) formulas with known relevance, and ranked (query, candidate)
pairs with unknown relevance to perform the search.
Our runs in Task 1 are motivated by a common user behavior on community question
answering websites such as MathSE. When there is a new question posted, the moderators can
mark the question as duplicate if similar question(s) exist. We would expect that good answers
to a similar question are likely to be relevant to a newly posted question. Our goal is use this
strategy and to make this process automatic. First, we aim to find similar questions for a given
topic in Task 1, and then rank the answers given to those similar questions.
For Task 2, we submitted two types of approaches. For the first type, we consider only isolated
formulas during search: the context in which formulas occur are ignored for both query and
candidates, with similarity determined by comparing formulas directly. In the second type of
approach, we use contextual formula search. In contextual formula search, not only is formula
similarity important, but also the context in which formulas appear. As in Task 1, we make use
of AMR for text and then integrate the OPT into the AMR.
Finally, for Task 3, we select the first answers retrieved by two of our Task 1 runs, and then
apply two extractive summarization models to each. These two summarizers select at most 3
sentences from each answer post returned by a Task 1 system.
In this paper, we first introduce the MathAMR representation, as it is used in our runs
for all the tasks. We then explain our approaches for formula search, answer retrieval, and
open-domain question answering tasks.

2. MathAMR
Formula Representations: SLTs and OPTs. Previously, math-aware search systems primarily
used two representation types for formulas: Symbol Layout Trees (SLTs) capture the appearance
of the formula, while Operator Trees (OPTs) capture formula syntax [6]. In an SLT, nodes
represent formula elements (including variable, operator, number, etc.), whereas the edge labels
capture their spatial relationships. In an OPT, nodes are again the formula elements, but the
the edge labels indicate the order of the operands. For commutative operators such as ‘=’, for
which the order is not important, the edge labels are identical. Figure 1 shows the SLT and OPT
representations for formula 𝑥𝑛 + 𝑦 𝑛 + 𝑧 𝑛 .
In both representations, formula symbol types are given in the nodes (e.g., V! indicates that
U!plus
V!n V!n V!n 0 0
0
above above above
O!SUP O!SUP O!SUP
next next next next 0 1 0 1 0 1
V!x + V!y + V!z

V!x V!n V!y V!n V!z V!n

(a) Symbol Layout Tree (b) Operator Tree

Figure 1: SLT (a) and OPT (b) representations for 𝑥𝑛 + 𝑦 𝑛 + 𝑧 𝑛 . The nodes in SLT show the symbols and
their types (with exception of operators). The edge labels above and next show the spatial relationship
between symbols. Nodes in the OPT show symbols and their type (U! for unordered (commutative)
operator, O! for ordered operator, and V! for variable identifiers). OPT edge labels indicate the ordering
of operands.

the type is a variable). In our SLT represenation, there is no explicit node type for operators.
SLT edge labels show the spatial relationship between symbols. For example, variable 𝑛 is
located above variable 𝑥, and operator + is located next to variable 𝑥. As with a SLT, in an
OPT the nodes represent the formula symbols. The difference is that in OPT representation,
operators have an explicit node type. Unordered and ordered operators are shown with ‘U!’
and ‘O!’. For further details refer to Davila et al. [7] and Mansouri et al. [8].
Operator trees capture the operation syntax in a formula. The edge labels provide the
argument order for operands. By looking at the operator tree, one can see what operator is
being applied on what operands. This is very similar to the representation of text with Abstract
Meaning Representations (AMR), which can roughly be understood as representing “who is
doing what to whom”.2
Abstract Meaning Representation (AMR). AMRs are rooted Directed Acyclic Graphs
(DAGs). AMR nodes represent two core concepts in a sentence: words (typically adjectives
or stemmed nouns/adverbs), or frames extracted from Propbank [9].3 For example in Figure
3, nodes such as ‘you’ and ‘thing’ are English words, while ‘find-01’ and ‘solve-01’ represent
Propbank framesets. Labeled edges between a parent node and a child node indicate a semantic
relationship between them. AMRs are commonly used in summarization [10, 11], question
answering [12, 13], and information extraction [14, 15]. For example, Liu et al [10], generated
AMRs for sentences in a document, and then merged them by collapsing named and date
entities. Next, a summary sub-graph was generated using integer linear programming, and
finally summary text was generated from that sub-graph using JARM [16].
Figure 2 shows an example summary of two sentences in their AMR representations [10].
There are two sentences:

(a) I saw Joe’s dog, which was running in the garden.

2
https://github.com/amrisi/amr-guidelines/blob/master/amr.md#part-i-introduction
3
http://propbank.github.io/
see-01 chase-01 chase-01
ARG0 ARG1 ARG0 ARG1 ARG0 ARG1
location
poss
i dog dog cat
poss ARG0-of person dog garden cat
name
person run-02
name
name location
op1
name garden
“Joe”
op1

“Joe”

(a) (b) (c)

Figure 2: AMR summarization example adapted from Liu et al. [10]. (a) AMR for sentence ‘I saw Joe’s
dog, which was running in the garden.’ (b) AMR for a following sentence, ‘The dog was chasing a cat.’
(c) Summary AMR generated from the sentence AMRs shown in (a) and (b).

Integrating OPT into AMR
you solve-01 you solve-01
U!Plus
:arg0 0 0 :arg0 SUP
:arg2-of :arg2-of
0 :op0

:arg1 :arg1-of O!SUP O!SUP O!SUP :arg1 :arg1-of
.
:op0
find-01 thing general-02
EQ:ID 0 1 0 1 0 1
find-01 thing general-02
Plus SUP .
:math :math .
:mode V!n V!y :mode
V!x V!n V!z V!n
imperative :mod imperative :mod :op0
equal-01 equal-01 SUP

:arg2 :arg2
Math Math

(a) Abstract Meaning Representation (b) Operator Tree (c) Math Abstract Meaning Representation

Figure 3: Generating MathAMR for the query “Find 𝑥𝑛 + 𝑦 𝑛 + 𝑧 𝑛 general solution” (ARQMath-2 topic
A.289). (a) AMR tree is generated with formulas replaced by single tokens having ARQMath formula
ids. (b) OPT formula representation is generated for formulas. (c) Operator tree root node replaces the
formula place holder node. Note that in (c) the rest of OPT is not shown due to space limitations.

(b) The dog was chasing a cat.
Figure 2c shows the summary AMR generated for sentences (a) and (b).
To generate AMRs from text, different parsers have been proposed. There are Graph-based
parsers that aim to build the AMR graph by treating AMR parsing as a procedure for searching
for the Maximum Spanning Connected Subgraphs (MSCGs) from an edge-labeled, directed
graph of all possible relations. JAMR [16] was the first AMR parser, developed in 2014, and it
used that approach. Transition-based parsers such as CAMR [17], by contrast, first generate a
dependency parse from a sentence and then transform it into an AMR graph using transition
rules. Neural approaches instead view the problem as a sequence translation task, learning
to directly convert raw text to linearized AMR representations. For example, the SPRING
parser [18] uses depth-first linearization of AMR, and views the problem of AMR generation as
translation problem, translation raw text to linearized AMRs with a BART transformer model
[19] by modifying its tokenizer to handle AMR tokens.
While AMRs can capture the meaning (semantics) of text, current AMR parsers fail to correctly
parse math formulas. Therefore, in this work, we introduced MathAMR. Considering the text
“Find 𝑥𝑛 + 𝑦 𝑛 + 𝑧 𝑛 general solution” (the question title for topic A.289 in ARQMath-2 Task
1), Figure 3 shows the steps to generate the MathAMR for this query. First, each formula is
replaced with a placeholder node that includes the identifier for that formula. In our example,
we show this as “EQ:ID”, where in practice ID would be the formula’s individual identifier in
the ARQMath collection.
The modified text is then presented to an AMR parser. For our work we used the python-based
AMRLib,4 using the model “model_parse_xfm_bart_large”. Figure 3(a) shows the output AMR.
Nodes are either words or concepts (such as ‘find-01’) from the PropBank framesets [9]. The
edge labels show the relationship between the nodes. In this instance, ‘arg0’ indicates the
subject and ‘arg1’ the object of the sentence. We introduce a new edge label ‘math’ to connect a
formula’s placeholder node to its parent. For further information on AMR notation, see [20].
Figure 3(b) is the OPT representation of the formula, which is integrated into the AMR by
replacing the placeholder with the root of the OPT, thus generating what we call MathAMR.
This is shown in Figure 3(c). To follow AMR conventions, we rename the edge labels from
numbers to ‘opX’ where ‘X’ is the edge label originally used in the OPT. We use the edge label
‘op’ as in AMRs edge labels capture the relation and its ordering. In the next sections, we show
how MathAMR is used for search.
This is an early attempt to introduce a unified representation of text and formula using
AMRs. Therefore, we aim to keep our model simple and avoid other information related to
the formula that could have been used. For example, our current model only uses the OPT
formula representation, where as previous researches have shown using SLT representation
can be helpful as well [8, 21, 22]. In our current model we only use OPTs. Also, we are using
the AMR parser that is trained on general text not specific for math which is a limitation of our
work. For other domains such as biomedical research, there exist pre-trained AMR parsers [23].

3. Task 2: Formula Retrieval
Because of our focus on formula representation in AMRs, we start by describing our Task 2
runs. In Section 4 we then draw on that background to describe our Task 1 runs.
In the formula retrieval task, participating teams were asked to return a set of relevant
formulas for a given formula query. Starting in ARQMath-2 [24], the relevance criteria for
Task 2 were defined in such a way that a formula’s context has a role in defining its relevance.
Therefore, in our ARQMath-3 Task 2 runs we use athAMR to create a unified representation
of formulas and text. In addition to the MathAMR model, we also report results from some of
our previous isolated formula search models for this task, as they yielded promising results in
ARQMath-1 and -2.

3.1. Isolated Formula Search Runs
For isolated formula search, we created three runs. These runs are almost identical to what
we had in ARQMath-2. Therefore, we provide a brief summary of the systems, along with

4
https://github.com/bjascob/amrlib
differences compared to our previous year’s systems. Please eefer to Mansouri et al. [25] for
more information.
TangentCFT-2: Tangent-CFT [8] is an embedding model for mathematical formulas that
considers SLT and OPT representations of the formulas. In addition to these representations,
two unified representations are considered where only types are represented when present in
SLT and OPT nodes, referred to as SLT TYPE and OPT TYPE.
Tangent-CFT uses Tangent-S [21] to linearize the tree representations of the formulas.
Tangent-S represents formulas using tuples comprised of symbol pairs along with the labeled
sequence of edges between them. These tuples are generated separately for SLT and OPT trees.
In Tangent-CFT we linearize the path tuples using depth-first traversals, tokenize the tuples,
and then embed each tuple using an n-gram embedding model, implemented using fastText
[26]. The final embedding of a formula SLT or OPT is the average of its constituent tuple
embeddings. Our fastText models were trained on formulas in the ARQMath collection. Using
the same pipeline for training, each formula tree is linearized with Tangent-S, then tokenized
with Tangent-CFT, and their vector representations are extracted using trained models.
In our run, we use the MathFIRE5 (Math Formula Indexing and Retrieval with Elastic Search)
system. In this system, formula vector representations are extracted with Tangent-CFT, then
loaded in OpenSearch 6 where dense vector retrieval was performed by approximate k-NN search
using nmslib and Faiss [27]. We used the default parameters. Note that in our previous Tangent-
CFT implementations we had used exhaustive rather than approximate nearest neighbor search.
As there are four retrieval results from four different representations, we combined the results
using modified Reciprocal Rank Fusion [28] as:
∑︁ 𝑠𝑚 (𝑓 )
𝑅𝑅𝐹 𝑠𝑐𝑜𝑟𝑒(𝑓 ∈ 𝐹 ) = (1)
60 + 𝑟𝑚 (𝑓 )
𝑚∈𝑀

where the 𝑠𝑚 is the similarity score and 𝑟𝑚 is the rank of the candidate. As all the scores
from the retrieval with different representations are cosine similarity scores with values in the
interval [0, 1], we did not apply score normalization.
TangentCFT2TED (Primary Run): Previous experiments have shown that Tangent-CFT
can find partial matches better than it can find full-tree matches [29]. As it is an n-gram
embedding model, it focuses on matching of n-grams. Also, vectors aim to capture features
from the n-gram appearing frequently next to each other, and this approach has less of a focus
on structural matching of formulas. Therefore, in TangentCFT2TED, we rerank the top retrieval
results from TangentCFT-2 using tree edit distance (TED). We considered three edit operations:
deletion, insertion, and substitution. Note that in our work, we only consider the node values
and ignore the edge labels. For each edit operation, we use weights learnt on the NTCIR-12
[22] collection.7 We use an inverse tree-edit distance score as the similarity score:
1
𝑠𝑖𝑚(𝑇1 , 𝑇2 ) = . (2)
𝑇 𝐸𝐷(𝑇1 , 𝑇2 ) + 1
5
https://gitlab.com/dprl/mathfire
6
https://opensearch.org/
7
We have also trained weights on ARQMath-1, and those weights were similar to those trained on the NTCIR-12
collection.
The tree edit distance was used on both SLT and OPT representations, and the results were
combined using modified Reciprocal Rank Fusion as in equation 1.
Because in ARQMath-2 this run had the highest nDCG′ , this year we annotated it as our
primary run, for which the hits are pooled to a greater depth.
Learning to Rank: Our third isolated formula search model is a learning to rank approach
for formula search that we introduced in [29]. In this model, sub-tree, full-tree, and embedding
similarity scores are used to train an SVM-rank model [30]. Our features are:

• Maximum Sub-tree Similarity (MSS) [7]
• Tuple and node matching scores [7]
• Weighted and Unweighted tree edit distance scores [29]
• Cosine similarity from the Tangent-CFT model

All features, with the exception of MSS, were calculated using both OPT and SLT represen-
tations, both with and without unification of node values to types. The MSS features were
calculated only on the unified OPT and SLT representations. MSS is computed from the largest
connected match between the formula query and a candidate formula obtained using a greedy
algorithm, evaluating pairwise alignments between trees using unified node values. Tuple
matching scores are calculated by considering the harmonic mean of the ratio of matching
symbol pair tuples between a query and the candidates. The tuples are generated using the
Tangent-S [21] system, which traverses the formula tree depth-first and generates tuples for
pairs of symbols and their associated paths.
For training, we used all topics from ARQMath-1 and -2, a total of about 32K pairs. Following
our original proposed approach in [29], we re-rank Tangent-S results using linearly weighted
features, with weights defined using SVM-rank.

3.2. Contextual Formula Search Runs
Some previous approaches to combined formula and text search combined separate search
results from isolated formula search and text retrieval models. For example, Zhong et al. [31]
combined retrieval results from a isolated formula search engine, Approach0 [32], and for text
used the Anserini toolkit. Similarly, Ng et al. [33] combined retrieval results from Tangent-L [34]
and BM25+. In the work of Krstovski et al. [35], by contrast, equation embeddings generated
unified representations by linearizing formulas as tuples and then treated them as tokens in the
text. These equation embeddings utilized a context window around formulas and used a word
embedding model [36] to construct vector representations for formulas.
While our first category of runs focused on isolated formula search, in our second category
we made use of the formula’s context. Our reasoning is based in part on the potential for
complementary between different sources of evidence for meaning, and in part on the relevance
definition for Task 2, where a candidate formula’s interpretation in the context of its post
matters. In particular, it is possible for a formula identical to the query be considered not
relevant. As an example from ARQMath-2, for the formula query 𝑥𝑛 + 𝑦 𝑛 + 𝑧 𝑛 (B.289), some
exact matches were considered irrelevant, as for that formula query, x, y, and z could (according
to the question text) be any real numbers. The assessors thus considered all exact matches in
the pooled posts in which x, y, and z referred not to real numbers but specifically to integers as
not relevant.
Therefore, in our contextual formula search model we use MathAMR for search. For each
candidate formula, we considered the sentence in which the formula appeared along with a
sentence before and after that (if available) as the context. We then generated MathAMR for
each candidate. To generate MathAMR for query formulas, we used the same procedure, using
the same context window. For matching, we made use of Sentence-BERT Cross-Encoders [4].
To use Sentence-BERT, we traversed the MathAMR depth-first. For simplicity, we ignored the
edge labels in the AMRs. For Figure 3(c) (included the full OPT from Figure 3(b)), the linearized
MathAMR string is:

find-01 you thing general-02 solve-01 equal-01 plus SUP z n SUP y n
SUP x n imperative

To train a Sentence-BERT cross encoder, we used the pre-trained ‘all-distilroberta-v1’ model
and trained our model with 10 epochs using a batch size of 16 and a maximum sequence size of
256. Note that our final model is what is trained after 10 epochs on the training set, and we
did not use a separate validation set. For training, we made use of all the available pairs from
ARQMath-1 and -2 topics. For labeling the data, high and medium relevance formula instances
were labeled 1, low relevance instances were labeled 0.5, and non-relevant instances 0. For
retrieval, we re-rank the candidates retrieved by the Tangent-CFTED system (top-1000 results).
Note that candidates are ranked only by the similarity score from our Sentence-BERT model.
Our fifth run combined the search results from MathAMR and Tangent-CFT2TED systems.
This was motivated by MathAMR embeddings for sentences with multiple formulas having the
same representation, meaning that all formulas in that sentence receive the same matching
score. Because Tangent-CFT2TED performs isolated formula matching, combining these results
helps avoid this issue. For combining the results, we normalize the scores to the range 0 to 1
with Min-Max normalization and use modified Reciprocal Rank Fusion as given in Equation 1.
Additional Unofficial Post Hoc Run. For our MathAMR model in our official submission
we used three sentences as the context window: the sentence in which the candidate formula
appears in, a sentence before and as sentence after that. We made a change to the context
window size and considered only the sentence in which the formula appears as the context.
Also, in our previous approach with context window of three sentences, we simply split the
question post containing the query formula at periods (.) in the body text, and then choose
the sentence with the candidate formula. However, a sentence can end with other punctuation
such as ‘?’. Also, formulas are delimited within LATEX by ‘$’; these formula regions commonly
contain sentence punctuation. To address these two issues, in our additional run, we first move
any punctuation (. , ! ?) from the end of formula regions to after final delimiter. Then, we use
Spacy8 to split paragraphs into sentences and choose the sentence that a formula appears in.
After getting the results with using a context window size of one, we also consider the modified
reciprocal rank fusion of this system with Tangent-CFT2ED as another additional post-hoc run.

8
https://spacy.io/
Table 1
DPRL Runs for Formula Retrieval (Task 2) on ARQMath-1 (45 topics) and ARQMath-2 (58 topics) for
topics used in training (i.e., test-on-train). The Data column indicates whether isolated math formulas,
or both math formulas and surrounding text are used in retrieval.
Evaluation Measures
Formula Retrieval ARQMath-1 ARQMath-2
Run Data nDCG′ MAP′ P′ @10 nDCG′ MAP′ P′ @10
LtR Math 0.733 0.532 0.518 0.550 0.333 0.491
Tangent-CFT2 Math 0.607 0.438 0.482 0.552 0.350 0.510
Tangent-CFT2TED Math 0.648 0.480 0.502 0.569 0.368 0.541
MathAMR Both 0.651 0.512 0.567 0.623 0.482 0.660
Tangent-CFT2TED+MathAMR Both 0.667 0.526 0.569 0.630 0.483 0.662

3.3. Experiment Results
This section describes the results of our runs on the ARQMath-1, -2 and -3 Task 2 topics.
ARQMath 1 and -2 Progress Test Results. Table 1 shows the results of our progress
test runs on ARQMath-1 and -2 topics. Note that because some of our systems are trained
using relevance judgments for ARQMath-1 and -2 topics, those results should be interpreted
as training results rather than as a clean progress test since some models (and in particular
MathAMR) may be over-fit to this data.9
To compare isolated vs contextual formula search, we look at results from Tangent-CFT2TED
and MathAMR runs. Using MathAMR can be helpful specifically for formulas for which context
1
is important. For example, in query 𝑓 (𝑥) = 1+ln 2
𝑥
(B.300 from ARQMath-2), the formula is
described as “is uniformly continuous on 𝐼 = (0, ∞)”. Similar formulas such as 𝑔(𝑥) = 𝑥 ln12 𝑥 ,
that in isolation are less similar to the query are not ranked in the top-10 results from Tangent-
CFT2ED. However, with MathAMR, as this formula in its context has the text “is integrable on
[2, ∞)”; it was ranked 4th by MathAMR. The P′ @10 for this query is 0.1 for Tangent-CFT2TED
and 0.7 for MathAMR.
In contrast to this, there were cases where P′ @10 was lower for MathAMR compared to
Tangent-CFT2TED. As an example, (︀ for1 )︀formula query, B.206, appearing in the title of a question
𝑛
as: “I’m confused on the limit of 1 + 𝑛 ”, a low relevant formula appearing in the same context,
“Calculate limit of (1 + 𝑛12 )𝑛 ” gets higher rank in MathAMR than Tangent-CFT2ED. Both query
and candidate formula appear in the questions’ title and there is no additional useful information
in the text other then the word ‘limit’ is not providing any new information. Therefore, we can
consider this another limitation in our current model, that we are not distinguishing between
formula queries that are or are not dependent on the surrounding text, and also there is no
pruning applied to MathAMR to remove information that is not necessary helpful.
ARQMath-3 Results. Table 2 shows the Task 2 results on ARQMath-3. Tangent-CFT2TED
achieved the highest nDCG′ among our models, significantly better than other representations,
except Tangent-CFT2TED+MathAMR (p < 0.05, 𝑡-test with Bonferroni correction).

9
This progress-test-on-train condition was the condition requested by the ARQMath organizers; all systems
were to be run on ARQMath-1 and ARQMath-2 topics in the same configuration as they were run on ARQMath-3
topics.
Table 2
DPRL Runs for Formula Retrieval (Task 2) on ARQMath-3 (76) topics. Tangent-CFT2TED is our primary
run.
Formula Retrieval Evaluation Measures
Run Data nDCG′ MAP′ P′ @10
Tangent-CFT2TED Math 0.694 0.480 0.611
Tangent-CFT2 Math 0.641 0.419 0.534
Tangent-CFT2TED+MathAMR Both 0.640 0.388 0.478
LtR Math 0.575 0.377 0.566
MathAMR Both 0.316 0.160 0.253
Additional Unofficial Post Hoc
Tangent-CFT2TED+MathAMR Both 0.681 0.471 0.617
MathAMR Both 0.579 0.367 0.549

We compare our Tangent-CFT2TED model with MathAMR, looking at the effect of using
context. One obvious pattern is that using MathAMR can help with topics for which variables
are important. For example, for topic 𝐹 = 𝑃 ⊕ 𝑇 . (Topic B.326), P is a projective module and F
is a free module. There are instances retrieved in the top-10 results by TangentCFT2ED, such as
𝑉 = 𝐴 ⊕ 𝐵, where variables are referring to different concepts; in this a formula k-dimensional
subspace. With MathAMR, formulas such as 𝑃 ⊕ 𝑄 = 𝐹 appearing in a post that specifically
says: “If P is projective, then 𝑃 ⊕ 𝑄 = 𝐹 for some module P and some free module F.” (similar
text to the topic) are ranked in the top-10 results.
For cases that Tangent-CFT2ED has better effectiveness, two patterns are observed. In the
first pattern, the formula is specific and variables do not have specifications. In the second
pattern, the context is not helpful (not providing any useful information) for retrieval. For
instance, topic B.334, “logarithm proof for 𝑎𝑙𝑜𝑔𝑎 (𝑏) = 𝑏” the formula on its own is informative
enough. Low relevant formulas appearing in a context such as “When I tried out the proof, the
final answer I ended up with was 𝑎𝑙𝑜𝑔𝑏 𝑛 ” are ranked in the top-10 results because of having
proof and part of formula.
Combining the results on Tangent-CFT2ED and MathAMR with our modified RRF provided
better P′ @10 than one of the individual system results for only 10% of the topics. For the
topic (B.338) appearing in the a title of a question as “Find all integer solutions of equation
𝑦 = 𝑎+𝑏𝑥 ′
𝑏−𝑥 ”, both Tangent-CFT2ED and MathAMR had P @10 of 0.6. However combining the
′
results with modified RRF increases the P value to 0.9. Table 3 shows the top-10 results for
Tangent-CFT2ED+MathAMR, along with the original ranked lists for the Tangent-CFT2ED
and MathAMR systems. As can be seen, there are relevant formula that Tangent-CFT2ED
or MathAMR model gave lower rank to, but the other system provided a better ranking and
combining the systems with our modified RRF improved the results.

4. Task 1: Answer Retrieval
The goal of the ARQMath answer retrieval task is to find relevant answers to the mathematical
questions in a collection of MathSE answer posts. These are new questions that were asked after
Table 3
Top-10 Formulas Retrieved by Tangent-CFT2ED+MathAMR along with their ranks in original Tangent-
CFT2ED and MathAMR runs for topic (B.338), appearing in a question post title as “Find all integer
solutions of equation 𝑦 = 𝑎+𝑏𝑥
𝑏−𝑥 ”. For space, sentences for formula hits (used by MathAMR) are omitted.

TangentCFT+MathAMR Relevance Tangent-CFT2TED MathAMR
Top-10 Formula Hits Score Rank Rank
1. 𝑦 = 𝑎+𝑏𝑥
𝑐+𝑥 2 1 10
2. 𝑦 = 𝑎+𝑏𝑥
𝑥+𝑐 2 3 88
𝑎+𝑥
3. 𝑦 = 𝑏+𝑐𝑥 2 8 8
4. 𝑦 = 𝑎+𝑏𝑥
𝑐+𝑑𝑥 2 2 30
𝑏𝑥
5. 𝑦 = 1 29 5
𝑥−𝑎
6. 𝑦 = 𝑎𝑥+𝑏
𝑐𝑥+𝑑 3 53 2
𝑏+𝑑𝑥
7. 𝑦 = 1−𝑏−𝑑𝑥 2 4 42
8. 𝑔(𝑥) = 𝑎+𝑏𝑥
𝑏+𝑎𝑥 , 2 7 31
9. 𝑦 = | 𝑏+𝑐𝑥
𝑎+𝑥 | 2 27 9
𝑏−𝑥
10. 𝑦 = 1−𝑏𝑥 2 19 14

the posts in the ARQMath collection (i.e., not during 2010-2018). Our team provided 5 runs10 for
this task. Two of our runs considered only text and LATEX representation of the formulas. Two
other runs used strings created by depth-first MathAMR tree traversals. Our fifth run combined
the retrieval results from the two runs, one from each of the approaches.

4.1. Raw Text Approaches
We submitted two runs that use the raw text, with formulas being represented with LATEX repre-
sentations. In both runs, we first find similar questions for the given question in Task 1 and
then compile all the answers given to those questions and re-rank them based on the similarity
to the question.

4.1.1. Candidate Selection by Question-Question Similarity
To identify questions similar to a topic, we started with a model pre-trained on the Quora
question pairs dataset,11 and then fine-tuned that model to recognize question-question similarity
in actual ARQMath questions. To obtain similar training questions we used the links in the
ARQMath collection (i.e., the data from 2010-2018 that predates the topics that we are actually
searching) to related and duplicate questions. Duplicate question(s) are marked by MathSE
moderators as having been asked before, whereas related questions are marked by MathSE
moderators as similar to, but not exactly the same as, the given question. We applied two-step
fine-tuning: first using both related and duplicate questions, and then fine-tuning more strictly
using only duplicate questions. We used 358,306 pairs in the first round, and 57,670 pairs in the
second round.
10
Note that ARQMath teams are limited to 5 submissions.
11
https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
For training, we utilized a multi-task learning framework provided by Sentence-BERT, used
previously for detecting duplicate Quora questions in the ‘distilbert-base-nli-stsb-quora-ranking‘
model. This framework combines two loss functions. First, the contrastive loss [37] function
minimizes the distance between positive pairs and maximizes the distance between for negative
pairs, making it suitable for classification tasks. The second loss function is the multiple negatives
ranking loss [38], which considers only positive pairs, minimizing the distance between positive
pairs out of a large set of possible candidates, making it suitable for ranking tasks. We expected
that with these two loss functions we could distinguish well between relevant and not-relevant
candidates, and also rank the relevant candidates well by the order of their relevance degrees.
We set the batch size to 64 and the number of training epochs to 20. The maximum sequence
size was set to 128. In our training, half of the samples were positive and the other half were
negative, randomly chosen from the collection. In the first fine-tuning, a question title and
body are concatenated. In the second fine-tuning, however, we considered the same process for
training, with three different inputs:
• Using the question title, with a maximum sequence length of 128 tokens.

• Using the first 128 tokens of the question body.

• Using the last 128 tokens of the question body.
To find a similar question, we used the three models to separately retrieve the top-1000 most
similar questions. The results were combined by choosing the maximum similarity score for
each candidate question across the three models.

4.1.2. Candidate Ranking by Question-Answer Similarity
Having a set of candidate answers given to similar questions, we re-rank them differently in
our two runs, as explained below.
QQ-QA-RawText. In our first run, we used QASim (Question-Answer Similarity) [25] which
achieved our highest nDCG′ value in ARQMath-2. Our training procedure is the same as
for our ARQMath-2 system, but this time we added ARQMath-2 training pairs to those from
ARQMath-1. For questions, we used the concatenation of title and body, and for the answer we
choose only the answer body. For both questions and answers, the first 256 tokens are chosen.
For ranking, we compute the similarity score between the topic question and the answer, and
the similarity score between the topic question and the answer’s parent question. We multiplied
those two similarity scores to get the final similarity score. Our pre-trained model is Tiny-BERT,
with 6 layers trained on the “MS Marco Passage Reranking” [39] task. The inputs are triplets of
(Question, Answer, Relevance), where the relevance is a number between 0 and 1. In ARQMath,
high and medium relevance degrees were considered as relevant for precision-based measures.
Based on this, for training, answers from ARQMath-1 and -2 assessed as high or medium got a
relevance label of 1, a label of 0.5 was given to those with low relevance, and 0 was given for
non-relevant answers. For the system details refer to [25].
SVM-Rank (Primary Run). Previous approaches for the answer retrieval task have shown
that information such as question tags and votes can be useful in finding relevant answers
[33, 31]. We aimed to make use of these features and study their effect for retrieval. In this second
run (which we designated as our primary Task 1 run), we considered 6 features: Question-
Question similarity (QQSim) score, Question-Answer similarity (QASim) score, number of
comments on the answer, the answer’s MathSE score (i.e, upvotes−downvotes), a binary field
showing if the answer is marked as selected by the asker (as the best answerto their question),
and the percentage of topic question post tags that the question associated with an answer
post also contains (which we refer to as question tag overlap). Note that we did not apply
normalization to feature value ranges. We trained a ranking SVM model [30] using all the
assessed pairs from ARQMath-1 and -2, calling the result SVM-Rank. After training, we found
that QQSim, QASim, and overlap between the tags were the most important features, with
weights 0.52, 2.42 and 0.05, respectively, while the weights for other features were less than
0.01.
Both out QQ-QA-RawText and SVM-Rank models have the same first stage retrieval, using
Sentence-BERT to find similar questions. Then the candidates are ranked differently. While
both approaches make use of Question-Question and Question-Answer similarity scores (using
Sentence-BERT), the second approach considers additional features and learns weights for the
features using ARQMath-1 and -2 topics.

4.2. MathAMR Approaches
In our second category of approaches, we made use of our MathAMR representation, providing
two runs. As in our raw text-based approach, retrieval is comprised of two stages: identifying
candidates from answers to questions similar to a topic question, and then ranking candidate
answers by comparing them with the topic question.

4.2.1. Candidate Selection by Question-Question Similarity
In our first step, we find similar questions for a given question in Task 1. For this, we only
focus on the question title. Our intuition is that AMR was designed to capture meaning from a
sentence. As the question titles are usually just a sentence, we assume that similar questions
can be found by comparing AMR representations of their titles.
Following our approach in Task 2, we generated MathAMR for each question’s title. Then
the MathAMRs are linearized using a depth-first traversal. We used a model that we trained
on RawText for question-question similarity as our pretrained model, although in this case we
trained on the question titles. We used the known duplicates from the ARQMath collection
(2010-2018) to fine tune our model on the linearized AMR of questions, using a similar process
as for raw text.

4.2.2. Candidate Ranking by Question-Answer Similarity
Answers to similar questions are ranked in two ways for our two Task 1 AMR runs.
QQ-MathSE-AMR. Using a question title’s MathAMR, we find the top-1000 similar questions
for each topic. Starting from the most similar question and moving down the list, we compile
the answers given to the similar questions. The answers for each similar question are ranked
based on their MathSE score (i.e., upvotes−downvotes). To determine the similarity score of
topic and an answer, we used the reciprocal of the rank after getting the top-1000 answers. Note
that this approach does not use any topics from ARQMath-1 or -2 for training.
QQ-QA-AMR. This run is similar to our QQ-QA-RawText run, but instead of raw text
representations, we use MathAMR representations. For similarity of questions, we only use the
question titles, while for similarity of a question and an answer we use the first 128 tokens of
the linearized MathAMR from the post bodies of the question and the answer. We trained a
Sentence-BERT model, and did retrieval, similarly to our QAsim model with two differences: (1)
we used ‘all-distilroberta-v1’ as the pre-trained model (2) instead of raw text we use linearized
MathAMR. The parameters for Sentence-BERT such as number of epochs, batch size and loss
function are the same. Our Sentence-BERT design is similar to the QAsim we had used for raw
text in ARQMath-2 [25]. We used both ARQMath-1 and -2 topics from Task 1 for training.
For our fifth run, we combined the results from our SVM-Rank model (from raw text ap-
proaches) and QQ-QA-AMR (from MathAMR approaches) using modified reciprocal rank fusion,
naming that run RRF-AMR-SVM.
Additional Unofficial Post Hoc Runs. In ARQMath-2021, we had two other runs using
raw text representations that we also include here for ARQMath-3 topics, using post hoc scoring
(i.e., without these runs having contributed to the judgement pools). One is our ‘QQ-MathSE-
RawText’ run, which uses question-question (QQ) similarity to identify similar questions and
then ranks answers associated with similar question using MathSE scores (upvotes−downvotes).
The similarity score was defined as:

𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒(𝑄𝑇 , 𝐴) = 𝑄𝑄𝑆𝑖𝑚(𝑄𝑇 , 𝑄𝐴 ) · 𝑀 𝑎𝑡ℎ𝑆𝐸𝑠𝑐𝑜𝑟𝑒 (𝐴) (3)

where the 𝑄𝑇 is the topic question, 𝐴 is a candidate answer and 𝑄𝐴 is the question to which
answer 𝐴 was given. The other is our ‘RRF-QQ-MathSE-QA-RawText’ run, which combines
retrieval results from two systems, ‘QQ-MathSE-RawText’ and ‘QQ-QA-RawText’, using our
modified reciprocal rank fusion.
A third additional unofficial post hoc run that we scored locally is ‘QQ-MathSE(2)-AMR’. To
find similar questions, this model uses the exact same model as ‘QQ-MathSE-AMR’. However,
for ranking the answers, instead of the ranking function used for ‘QQ-MathSSE-AMR’, we use
the ranking function in equation (3).
For corrected runs, we fixed an error for ‘QQ-QA-RawText’ model and report the results.
This model affects two other models, “SVM-rank model” and “RRF-AMR-SVM”. Therefore, we
report the results on these systems as well.

4.3. Experiment Results
ARQMath 1 and -2 Results. Table 4 shows the results of our progress test runs for Task 1
on ARQMath-1 and -2 topics. As with our Task 2 progress test results, those results should be
interpreted as training results rather than as a clean progress test since some models may be
over-fit to this data. Note that the runs in each category of Raw Text and MathAMR have the
same set of candidates to rank, which may lead to similar effectiveness measures.
ARQMath-3 Results. Table 5 shows the RPDL Task 1 results on ARQMath-3 topics along
with baseline Linked MSE post that our models aim to automate. Our highest nDCG′ and mAP′
are achieved by our additional unofficial ‘QQ-MathSE-RawText’ run, while our highest P′ @10
Table 4
DPRL progress test Runs for Answer Retrieval (Task 1) progress test on ARQMath-1 (71 topics) and
ARQMath-2 (77 topics) for topics used in training (test-on-train). All runs use both text and math
information. Stage-1 selects answers candidates that are then ranked in Stage-2. SVM-Rank is the
primary run.
Answer Retrieval Stage-1 Stage-2 ARQMath-1 ARQMath-2
Run Selection Ranking nDCG′ MAP′ P′ @10 nDCG′ MAP′ P′ @10
QQ-QA-AMR QQ-MathAMR QQSIM x QASIM (MathAMR) 0.276 0.180 0.295 0.186 0.103 0.237
QQ-MathSE-AMR QQ-MathAMR MathSE 0.231 0.114 0.218 0.187 0.069 0.138
QQ-QA-RawText QQ-RawText QQSIM x QASIM (RawText) 0.511 0.467 0.604 0.532 0.460 0.597
SVM-Rank QQ-RawText SMV-Rank 0.508 0.467 0.604 0.533 0.460 0.596
RRF-AMR-SVM — — 0.587 0.519 0.625 0.582 0.490 0.618

Table 5
DPRL Runs for Answer Retrieval (Task 1) on ARQMath-3 (78) topics along with the Linked MSE posts
baseline. SVM-Rank is the primary run. For Post Hoc runs, (C) indicates corrected run, and (A) indicates
additional run. Linked MSE posts is a baseline system provided by ARQMath organizers.
Answer Retrieval Stage-1 Stage-2 Evaluation Measures
Run Selection Ranking nDCG′ MAP′ P′ @10
Linked MSE posts - - 0.106 0.051 0.168
SVM-Rank QQ-RawText SVM-Rank 0.283 0.067 0.101
QQ-QA-RawText QQ-RawText QQSIM x QASIM (RawText) 0.245 0.054 0.099
QQ-MathSE-AMR QQ-MathAMR MathSE 0.178 0.039 0.081
QQ-QA-AMR QQ-MathAMR QQSIM x QASIM (MathAMR) 0.185 0.040 0.091
RRF-AMR-SVM - - 0.274 0.054 0.022
Post Hoc Runs
QQ-QA-RawText (C) QQ-RawText QQSIM x QASIM (RawText) 0.241 0.030 0.151
SVM-Rank (C) QQ-RawText SVM-Rank 0.296 0.070 0.101
RRF-AMR-SVM (C) - - 0.269 0.059 0.106
QQ-MathSE-RawText (A) QQ-RawText MathSE 0.313 0.147 0.087
RRF-QQ-MathSE-QA-RawText (A) - - 0.250 0.067 0.110
QQ-MathSE(2)-AMR (A) QQ-MathAMR MathSE(2) 0.200 0.044 0.100

is for the our unofficial corrected “QQ-QA-RawText” run. Comparing the QQ-QA models using
MathAMR or raw text, in 41% of topics raw text provided better P′ @10, while with MathAMR a
higher P′ @10 was achieved for 21% of topics. In all categories of dependencies (text, formula, or
both), using raw text was on average more effective than MathAMR. The best effectiveness for
MathAMR was when questions were text dependent, with an average P′ @10 of 0.12, over the
10 assessed topics dependent on text. Considering topic types, for both computation and proof
topics, P′ @10 was 0.10 and 0.06 higher, respectively, using raw text than MathAMR. For concept
topics, P′ @10 was almost the same for the two techniques. Considering topic difficulty, only for
hard questions did MathAMR do even slightly better numerically than raw text by P′ @10, with
just a 0.01 difference. Among those topics that did better at P′ @10 using MathAMR, 94% were
hard or medium difficulty topics.
To further analyze our approaches, we look at the effect of different representations on
individual topics. With both raw text and MathAMR, selecting candidates is done by first
finding similar questions. Considering the titles of questions to find similar questions, there
are cases where MathAMR can be more effective due to considering OPT representations. For
Table 6
Titles of the top-5 ∑︀
most similar questions found with MathAMR and raw text, for the topic question
𝑛
with title “Proving 𝑘=1 cos 2𝜋𝑘𝑛 = 0”.
Rank MathAMR RawText
𝑁 ∑︀𝑛 2𝜋𝑘
∑︀
1 Prove that cos(2𝜋𝑛/𝑁 ) = 0 How to prove 𝑘=1 cos( 𝑛 ) = 0 for any 𝑛 > 1?
𝑛=1∑︀
𝑛 ∑︀𝑛−1
How to prove 𝑘=1 cos( 2𝜋𝑘 How to prove that 𝑘=0 cos 2𝜋𝑘
(︀ )︀
2 𝑛 ) = 0 for any 𝑛 > 1? 𝑛 +𝜑 =0
∑︀𝑛−1 )︀ ∑︀𝑛−1 𝑁
Proving that 𝑥=0 cos 𝑘 + 𝑥 2𝜋 = 𝑥=0 sin 𝑘 + 𝑥 2𝜋
(︀ (︀ )︀ ∑︀
3 𝑛 𝑛 = 0. Prove that cos(2𝜋𝑛/𝑁 ) = 0
∑︀𝑛−1 (︀ 2𝜋𝑘 )︀ ∑︀𝑛−1 (︀ 2𝜋𝑘 )︀ 𝑛=1 ∑︀𝑛
4 ∑︀𝑘=0 cos 𝑛 = 0 = 𝑘=0 sin 𝑛 Understanding a step in applying deMoivre’s Theorem to
∑︀ 𝑘=0 cos(𝑘𝜃)
5 cos when angles are in arithmetic progression cos when angles are in arithmetic progression

example, this happens for topic A.328 with∑︀the title:
“Proving 𝑛𝑘=1 cos 2𝜋𝑘 𝑛 = 0”
Table 6 shows the titles of the top-5 similar questions for that topic. As seen in this table,
MathAMR representations retrieved two similar questions (at ranks 3 and 4) that have similar
formulas, whereas raw text failed to retrieve those formulas in its top-5 results. The P′ @10 on
that topic for the QASim model using MathAMR was 0.5, whereas with raw text it was 0.1.

5. Task 3: Open Domain Question Answering
Open domain question answering is a new pilot task introduced in ARQMath-3. The goal of
this task is to provide answers to the math questions in any way, based on any sources. The
Task 3 topics are the same as those used for Task 1. Our team created four runs for this task,
each having the same architecture. All our four runs use extractive summarization, where a
subset of sentences are chosen from the answer to form a summary of the answer. This subset
hopefully contains the important section of the answer. The organizers provided one run using
GPT-3 [40] from OpenAI as the baseline system.
We made two runs, “SVM-Rank” and “QQ-QA-AMR” from those two Task 1 runs by simply
truncating the result set for each topic after the first post, then applying one of two BERT-based
summarizers to the top-1 answer for each question for each run. For summarizers, we used one
that we call BERT that uses ‘bert-large-uncased’ [41] and a second called Sentence-BERT (SBERT)
[4] that is implemented using an available python library,12 with its ‘paraphrase-MiniLM-L6-v2’
model.
Both summarizers split answers into sentences, and then embed each sentence using BERT
or Sentence-BERT. Sentence vectors are then clustered into k groups using k-means clustering,
after which the k sentences closest to each cluster centroid are returned unaltered, in-order, in
the generated response. We set 𝑘 to 3, meaning that all sentences for posts with up to three
sentences are returned, and exactly three sentences are returned for posts with four or more
sentences.
Results. The results of our Task 3 are reported in Table 7. As seen in this table, our results
are not comparable to the baseline system. The highest Average Relevance (AR) and P@1 are
achieved using the Sentence-BERT summarizer to summarize the top-1 answered retrieved with
the SVM-Rank model for Task 1. Answers extracted by BERT and Sentence-BERT from top-1
12
https://pypi.org/project/bert-extractive-summarizer/
Table 7
DPRL Runs for Open Domain Question Answering (Task 3) on ARQMath-3 (78) topics. For this task,
we submitted the top hit from each run (i.e., a MathSE answer post) that was then passed through a
BERT-based summarizer. All runs use both math and text to retrieve answers. GPT-3 is provided by the
organizers as the baseline system.
Open Domain QA
Run Avg. Rel. P@1
GPT-3 (Baseline) 1.346 0.500
SBERT-SVMRank 0.462 0.154
BERT-SVMRank 0.449 0.154
SBER-QQ-AMR 0.423 0.128
BERT-QQ-AMR 0.385 0.103

Table 8
Sentence-BERT vs. BERT extracted summaries on the first answer retrieved by QQ-QA-AMR model for
topic A.325.

Topic Title Find consecutive composite numbers (A.325)
BERT "No, we can find consecutive composites that are not of this form. The
point of 𝑛! is just that it is a ""very divisible number""."
SBERT No, we can find consecutive composites that are not of this form. For
example the numbers 𝑛!2 + 2, 𝑛!2 + 4 · · · + 𝑛!2 + 𝑛 or 𝑛!3 + 2, 𝑛!3 +
3 . . . 𝑛!3 + 2. Also 𝑘𝑛! + 2, 𝑘𝑛! + 3 . . . 𝑘!𝑛 + 𝑛 works for all 𝑘 > 0 ∈ Z
You can also get a smaller examples if instead of using 𝑛! we use the least
common multiple of the numbers between 1 and 𝑛.

SVM-Rank answers were only different for 13 of the 78 assessed topics. However, P@1 was
identical for each topic.
For models using AMR, p@1 differed beteen BERT and Sentence-BERT for 3 topics, although
19 topics had different sentences extracted. In two of those three cases, Sentence-BERT included
examples in the extracted answer, resulting in a higher P@1 in both cases compared to BERT.
Table 8 shows the answers extracted for Topic A.325, which has the title “Find consecutive
composite numbers”, with the BERT and Sentence-BERT summarizers, where the answers are
highly relevant and low relevant, respectively. The only case in hich P@1 for the Sentence-BERT
summarizer was lower than that of the BERT summarizer with the “QQ-QA-AMR” model was a
case in which the answer extracted by Sentence-BERT was not rendered correctly, and thus
was not assessed, which in Task 3 was scored as non-relevant.

6. Conclusion
This paper has described the DPRL runs for the ARQMath lab at CLEF 2022. Five runs were
submitted for the Formula Retrieval task. These runs used isolated or contextual formula search
models. Our models with tree-edit distance ranking had the highest effectiveness among the
automatic runs. For the Answer Retrieval task, five runs were submitted using raw text or
new unified representation of text and math that we call MathAMR. While for model provided
better effectiveness compared to the baseline model we were aiming to automate, there results
were less effective compared to our participating teams. For the new Open Domain Question
Answering task, four runs were submitted, each of which summarizes the first result from an
Answer Retrieval run using extractive summarization on models with MathAMR and raw text.
The models using raw text were more effective.

Acknowledgments
We thank Heng Ji and Kevin Knight for helpful discussions about AMR and multimodal text
representations. This material is based upon work supported by the Alfred P. Sloan Foundation
under Grant No. G-2017-9827 and the National Science Foundation (USA) under Grant No.
IIS-1717997.

References
[1] B. Mansouri, A. Agarwal, D. W. Oard, R. Zanibbi, Advancing Math-Aware Search: The
ARQMath-3 Lab at CLEF 2022, in: European Conference on Information Retrieval, Springer,
2022.
[2] R. Zanibbi, D. W. Oard, A. Agarwal, B. Mansouri, Overview of ARQMath 2020: CLEF
Lab on Answer Retrieval for Questions on Math, in: International Conference of the
Cross-Language Evaluation Forum for European Languages, Springer, 2020.
[3] B. Mansouri, A. Agarwal, D. Oard, R. Zanibbi, Advancing Math-Aware Search: The
ARQMath-2 lab at CLEF 2021, in: European Conference on Information Retrieval, Springer,
2021.
[4] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-
Networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), 2019.
[5] L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn,
M. Palmer, N. Schneider, Abstract Meaning Representation for Sembanking, in: Proceed-
ings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse,
2013.
[6] R. Zanibbi, D. Blostein, Recognition and retrieval of mathematical expressions, Int. J. Doc-
ument Anal. Recognit. 15 (2012) 331–357. URL: https://doi.org/10.1007/s10032-011-0174-4.
doi:10.1007/s10032-011-0174-4.
[7] K. Davila, R. Zanibbi, A. Kane, F. W. Tompa, Tangent-3 at the NTCIR-12 MathIR task, in:
N. Kando, T. Sakai, M. Sanderson (Eds.), Proceedings of the 12th NTCIR Conference on Eval-
uation of Information Access Technologies, National Center of Sciences, Tokyo, Japan, June
7-10, 2016, National Institute of Informatics (NII), 2016. URL: http://research.nii.ac.jp/ntcir/
workshop/OnlineProceedings12/pdf/ntcir/MathIR/06-NTCIR12-MathIR-DavilaK.pdf.
[8] B. Mansouri, S. Rohatgi, D. W. Oard, J. Wu, C. L. Giles, R. Zanibbi, Tangent-CFT: An
Embedding Model for Mathematical Formulas, in: Proceedings of the 2019 ACM SIGIR
International Conference on Theory of Information Retrieval, 2019.
[9] P. R. Kingsbury, M. Palmer, From TreeBank to PropBank., in: LREC, 2002.
[10] F. Liu, J. Flanigan, S. Thomson, N. Sadeh, N. A. Smith, Toward Abstractive Summarization
Using Semantic Representations, in: Proceedings of the 2015 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2015.
[11] K. Liao, L. Lebanoff, F. Liu, Abstract Meaning Representation for Multi-Document Sum-
marization, in: Proceedings of the 27th International Conference on Computational
Linguistics, 2018.
[12] P. Kapanipathi, I. Abdelaziz, S. Ravishankar, S. Roukos, A. Gray, R. F. Astudillo, M. Chang,
C. Cornelio, S. Dana, A. Fokoue-Nkoutche, et al., Leveraging Abstract Meaning Repre-
sentation for Knowledge Base Question Answering, in: Findings of the Association for
Computational Linguistics: ACL-IJCNLP 2021, 2021.
[13] W. Xu, H. Zhang, D. Cai, W. Lam, Dynamic Semantic Graph Construction and Reasoning
for Explainable Multi-hop Science Question Answering, in: Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021, 2021.
[14] S. Garg, A. Galstyan, U. Hermjakob, D. Marcu, Extracting Biomolecular Interactions
Using Semantic Parsing of Biomedical Text, in: Thirtieth AAAI Conference on Artificial
Intelligence, 2016.
[15] Z. Zhang, H. Ji, Abstract Meaning Representation Guided Graph Encoding and Decoding
for Joint Information Extraction, in: Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2021.
[16] J. Flanigan, S. Thomson, J. G. Carbonell, C. Dyer, N. A. Smith, A Discriminative Graph-
based Parser for the Abstract Meaning Representation, in: Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014.
[17] C. Wang, N. Xue, S. Pradhan, A transition-based algorithm for AMR parsing, in: Pro-
ceedings of the 2015 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 2015.
[18] M. Bevilacqua, R. Blloshmi, R. Navigli, One SPRING to Rule them Both: Symmetric AMR
Semantic Parsing and Generation without a Complex Pipeline, in: Proceedings of AAAI,
2021.
[19] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
moyer, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Gener-
ation, Translation, and Comprehension, in: Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, 2020.
[20] K. Knight, B. Badarau, L. Baranescu, C. Bonial, M. Bardocz, K. Griffitt, U. Hermjakob,
D. Marcu, M. Palmer, T. O’Gorman, et al., Abstract Meaning Representation (AMR)
Annotation Release 3.0 (2021). URL: https://catalog.ldc.upenn.edu/LDC2020T02.
[21] K. Davila, R. Zanibbi, Layout and Semantics: Combining Representations for Mathematical
Formula Search, in: Proceedings of the 40th International ACM SIGIR Conference on
Research and Development in Information Retrieval, 2017.
[22] R. Zanibbi, A. Aizawa, M. Kohlhase, I. Ounis, G. Topic, K. Davila, NTCIR-12 MathIR Task
Overview., in: NTCIR, 2016.
[23] J. May, J. Priyadarshi, Semeval-2017 Task 9: Abstract Meaning Representation Parsing and
Generation, in: Proceedings of the 11th International Workshop on Semantic Evaluation
(SemEval-2017), 2017.
[24] B. Mansouri, R. Zanibbi, D. W. Oard, A. Agarwal, Overview of ARQMath-2 (2021): Second
CLEF Lab on Answer Retrieval for Questions on Math, in: International Conference of the
Cross-Language Evaluation Forum for European Languages, LNCS, Springer, 2021.
[25] B. Mansouri, D. W. Oard, R. Zanibbi, DPRL Systems in the CLEF 2021 ARQMath Lab:
Sentence-BERT for Answer Retrieval, Learning-to-Rank for Formula Retrieval (2021).
[26] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword
Information, Transactions of the Association for Computational Linguistics 5 (2017).
[27] J. Johnson, M. Douze, H. Jégou, Billion-Scale Similarity Search with GPUs, IEEE Transac-
tions on Big Data (2019).
[28] G. V. Cormack, C. L. Clarke, S. Buettcher, Reciprocal Rank Fusion Outperforms Condorcet
and Individual Rank Learning Methods, in: Proceedings of the 32nd International ACM
SIGIR Conference on Research and Development in Information Retrieval, 2009.
[29] B. Mansouri, R. Zanibbi, D. W. Oard, Learning to Rank for Mathematical Formula, in: Pro-
ceedings of the 44th International ACM SIGIR Conference on Research and Development
in Information Retrieval, 2021.
[30] T. Joachims, Training Linear SVMs in Linear Time, in: Proceedings of the 12th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.
[31] W. Zhong, X. Zhang, J. Xin, J. Lin, R. Zanibbi, Approach Zero and Anserini at the CLEF-
2021 ARQMath Track: Applying Substructure Search and BM25 on Operator Tree Path
Tokens, CLEF, 2021.
[32] W. Zhong, J. Lin, PYA0: A Python Toolkit for Accessible Math-Aware Search, in: Proceed-
ings of the 44th International ACM SIGIR Conference on Research and Development in
Information Retrieval, 2021.
[33] Y. K. Ng, D. J. Fraser, B. Kassaie, F. W. Tompa, Dowsing for Math Answers, in: International
Conference of the Cross-Language Evaluation Forum for European Languages, Springer,
2021.
[34] D. Fraser, A. Kane, F. W. Tompa, Choosing Math Features for BM25 Ranking with Tangent-L,
in: Proceedings of the ACM Symposium on Document Engineering 2018, 2018.
[35] K. Krstovski, D. M. Blei, Equation embeddings, arXiv preprint arXiv:1803.09123 (2018).
[36] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed Representations of
Words and Phrases and their Compositionality, Advances in Neural Information Processing
Systems (2013).
[37] S. Chopra, R. Hadsell, Y. LeCun, Learning a Similarity Metric Discriminatively, with
Application to Face Verification, in: 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), IEEE, 2005.
[38] M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos,
R. Kurzweil, Efficient Natural Language Response Suggestion for Smart Reply, arXiv
preprint arXiv:1705.00652 (2017).
[39] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO:
A human generated machine reading comprehension dataset, in: CoCo@ NIPS, 2016.
[40] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al., Language Models are Few-Shot Learners, 2020.
[41] D. Miller, Leveraging BERT for Extractive Text Summarization on Lectures, arXiv preprint
arXiv:1906.04165 (2019).