Overview of ARQMath-3 (2022):
Third CLEF Lab on Answer Retrieval for Questions on
Math (Working Notes Version)
Behrooz Mansouri1 , Vít Novotný3 , Anurag Agarwal1 , Douglas W. Oard2 and
Richard Zanibbi1
1
  Rochester Institute of Technology, NY, USA
2
  University of Maryland, College Park, USA
3
  Faculty of Informatics, Masaryk University, Czech Republic


                                         Abstract
                                         This paper provides an overview of the third and final year of the Answer Retrieval for Questions on
                                         Math (ARQMath-3) lab, run as part of CLEF 2022. ARQMath has aimed to introduce test collections
                                         for math-aware information retrieval. ARQMath-3 has two main tasks, Answer Retrieval (Task 1) and
                                         Formula Search (Task 2), along with a new pilot task Open Domain Question Answering (Task 3). Nine
                                         teams participated in ARQMath-3, submitting 33 runs for Task 1, 19 runs for Task 2, and 13 runs for Task
                                         3. Tasks, topics, evaluation protocols, and results for each task are presented in this lab overview.

                                         Keywords
                                         Community Question Answering, Open Domain Question Answering, Mathematical Information Re-
                                         trieval, Math-aware Search, Math Formula Search


1. Introduction
Math information retrieval (Math IR) aims at facilitating the access, retrieval and discovery of
math resources, and is needed in many scenarios [1]. For example, many traditional courses
and Massive Open Online Courses (MOOCs) release their resources (books, lecture notes and
exercises, etc.) as digital files in HTML or XML. However, due to the specific characteristics of
math formulae, classic search engines do not work well for indexing and retrieving math.
   Math-aware search systems can be beneficial for learning activities. Students can search for
references to help solve problems, increase knowledge, reduce doubts, and clarify concepts.
Instructors might also benefit from these systems by creating learning communities within
a classroom. For example, a teacher can pool different digital resources to create the subject
matter and then let students search through them for mathematical notation and terminology.
Math-aware search engines can also help researchers identify potentially useful systems, fields,
and collaborators. Good examples of this interdisciplinary approach benefiting physics include
the AdS/CFT correspondence and holographic duality theories.
CLEF’22: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ bm3302@rit.edu (B. Mansouri); witiko@mail.muni.cz (V. Novotný); axasma@rit.edu (A. Agarwal);
oard@umd.edu (D. W. Oard); rxzvcs@rit.edu (R. Zanibbi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   A key focus of mathematical searching is formulae. In contrast to simple words or other
objects, a formula can have a well defined set of properties, relations, applications, and often also
a ‘result’. There are many (mathematically) equivalent formulae which are structurally quite
different. For example, it is of fundamental importance to ask what information a user wants
when searching for 𝑥2 + 𝑦 2 = 1: is it the value of the variables 𝑥 and 𝑦 that satisfy this equation,
all indexed objects that contain this formula, all indexed objects containing 𝑎2 + 𝑏2 = 1, or the
geometric figure that is represented by this equation?
   This third Answer Retrieval for Questions on Math (ARQMath-3) lab at the Conference and
Labs of the Evaluation Forum (CLEF) completes our development of test collections for Math IR
from content found on Math Stack Exchange,1 a Community Question Answering (CQA) forum.
This year, ARQMath continues its two main tasks: Answer Retrieval for Math Questions (Task
1) and Formula Search (Task 2). We also introduce a new pilot task, Open Domain Question
Answering (Task 3).
   Using the question posts from Math Stack Exchange, participating systems are given a
question (in Tasks 1 and 3) or a formula from a question (in Task 2), and asked to return a ranked
list of either potential answers to the question (Task 1) or potentially useful formulae (Task
2). For Task 3, given the same questions as Task 1, the participating systems also provide an
answer, but are not limited to searching the ARQMath collection to find that answer. Relevance
is determined by the expected utility of each returned item. These tasks allow participating
teams to explore leveraging math notation together with text to improve the quality of retrieval
results.


2. Related Work
Prior to ARQMath, three test collections were developed over a period of five years at the NII
Testbeds and Community for Information Access Research (NTCIR) shared task evaluations.
To the best of our knowledge, NTCIR-10 [2] was the first shared task on Math IR, considering
three scenarios for searching:
    • Formula Search: find similar formulae for the given formula query.
    • Formula+Text Search: search the documents in the collection with a combination of
      keywords and formula queries.
    • Open Information Retrieval: search the collection using text queries.
   NTCIR-11 [3] considered the formula+text search task as the main task and introduced an
additional Wikipedia open subtask, using the same set of topics with a different collection and
different evaluation methods. Finally, in NTCIR-12 [4], the main task was formula+text search
on two different collections. A second task was Wikipedia Formula Browsing (WFB), focusing
on formula search. Formula similarity search (the simto task) was a third task, where the goal
was to find formulae ‘similar’ (not identical) to the formula query.
   An earlier effort to develop a test collection started with the Mathematical REtrieval Collec-
tion (MREC) [5], a set of 439,423 scientific documents that contained more than 158 million
formulae. This was initially only a collection, with no shared relevance judgments (although
    1
        https://math.stackexchange.com/
the effectiveness of individual systems was measured by manually assessing a set of topics).
The Cambridge University MathIR Test Collection (CUMTC) [6] subsequently built on MREC,
adding 160 test queries derived from 120 MathOverflow discussion threads (although not all
queries contained math). CUMTC relevance judgments were constructed using citations to
MREC documents cited in MathOverflow answers.
   To the best of our knowledge, ARQMath’s Task 1 is the first Math IR test collection to focus
directly on answer retrieval. ARQMath’s Task 2 (formula search) extends earlier work on
formula search, with several improvements:
    • Scale. ARQMath has an order of magnitude more assessed topics than prior formula
      search test collections. There are 22 topics in NTCIR-10, and 20 in NTCIR-12 WFB (+20
      variants with wildcards).
    • Contextual Relevance. In the NTCIR-12 WFB task [4], there was less attention to
      context. ARQMath Task 2, by contrast, has evolved as a contextualized formula search
      task, where relevance is defined both by the query and retrieved formulae and also the
      contexts in which those formulae appear.
    • Deduplication. NTCIR collections measured effectiveness using formula instances. In
      ARQMath we clustered visually identical formulae to avoid rewarding retrieval of multiple
      instances of the same formula.
    • Balance. ARQMath balances formula query complexity, whereas prior collections were
      less balanced (reannotation shows low complexity topics dominate NTCIR-10 and high
      complexity topics dominate NTCIR-12 WFB [7]).
  In ARQMath-3, we introduced a new pilot task, Open Domain Question Answering. The
most similar prior work is the SemEval 2019 [8] math question answering task, which used
question sets from Math SAT practice exams in three categories: Closed Algebra, Open Algebra
and Geometry. A majority of the Math SAT questions were multiple choice, with some having
numeric answers.
  While we have focused on search and question answering tasks in ARQMath, there are
other math information processing tasks that can be considered for future work. For example,
extracting definitions for identifiers, math word problem solving, and informal theorem proving
are active areas of research: for a survey of recent work in these areas, see Meadows and Ferentes
[9]. Summarization of mathematical texts, text/formula co-referencing, and the multimodal
representation and linking of information in documents are some other examples.


3. The ARQMath Stack Exchange Collection
For ARQMath-3, we reused the collection2 from ARQMath-1 and -2.3 The collection was
constructed using the March 1st, 2020 Math Stack Exchange snapshot from the Internet Archive.4
    2
        By collection we mean the content to be searched. That content together with topics and relevance judgments
is a test collection. There is only one ARQMath collection
      3
        ARQMath-1 was built for CLEF 2020, ARQMath-2 was built for CLEF 2021. We refer to submitted runs or
evaluation results by year, as AQRMath-2020 or ARQMath-2021. This distinction is important because ARQMath-2022
participants also submitted runs for both the ARQMath-1 and -2 test collections.
      4
        https://archive.org/download/stackexchange
Questions and answers from 2010-2018 are included in the collection. The ARQMath test
collection contains roughly 1 million questions and 28 million formulae. Formulae in the
collection are annotated using <span> XML elements with the class attribute math-container,
and a unique integer identifier given in the id attribute. Formulae are also provided separately
in three index files for different formula representations (LATEX, Presentation MathML, and
Content MathML), which we describe in more detail below.
   During ARQMath-2021, participants identified three issues with the ARQMath collection that
had not been noticed and corrected earlier. In 2022, we have made the following improvements
to the collection:
   1. Formula Representations. We found and corrected 65,681 formulae with incorrect
      Symbol Layout Tree (SLT) and Operator Tree (OPT) representations. This resulted from in-
      correct handling of errors generated by the LATEXML tool that had been used for generating
      those representations.
   2. Clustering Visually Distinct Formulae. Correcting SLT representations resulted in
      a need to adjust the clustering of formula instances. Each cluster of visually identical
      formulae was assigned a unique ‘Visual ID’. Clustering had been performed using SLT
      where possible, and LATEX otherwise. To correct the clustering, we split any cluster that
      now included formulae with different representations. In such cases, the partition with
      the largest number of instances retained its Visual ID; remaining formulae were assigned
      to another existing Visual ID (with the same SLT or LATEX) or, if necessary, to a new Visual
      ID. To break ties, the partition with the largest cumulative ARQMath-2 relevance score
      retained its Visual ID or, failing that, choosing the partition with the lowest Formula ID.
      29,750 new Visual IDs resulted.
   3. XML Errors. In the XML files for posts and comments, the LATEX for each formula is
      encoded as a <span> XML element with the class attribute math-container. We found
      and corrected 108,242 formulae that had not been encoded in that way.
   4. Spurious Formula Identifiers. The ARQMath collection includes an index file that
      includes Formula ID, Visual ID, Post ID, SLT, OPT, and LATEX for each formula instance.
      However, there were also formulae in the index file that did not actually occur in any post
      or comment in the collection. This happened because formula extraction was initially
      done on the Post History file, which also contained some content that had later been
      removed. We added a new annotation to the formula index file to mark such cases.
   The Math Stack Exchange collection was distributed to participants as XML files on Google
Drive.5 To facilitate local processing, the organizers provided python code on GitHub6 for
reading and iterating over the XML data, and for generating the HTML question threads. All
of the code to generate the corrected ARQMath collection is available from that same GitHub
repository.


   5
       https://drive.google.com/drive/folders/1ZPKIWDnhMGRaPNVLi1reQxZWTfH2R4u3
   6
       https://github.com/ARQMath/ARQMathCode
4. Task 1: Answer Retrieval
The goal of Task 1 is to find and rank relevant answers to math questions. Topics are constructed
from questions posted to Math Stack Exchange in 2021, and the collection to search is only
the answers to earlier questions (from 2010-2018) in the ARQMath collection. System results
(‘runs’) are evaluated using measures that characterize the extent to which answers judged by
relevance assessors as having higher relevance come before answers with lower relevance in
the system results (e.g., using nDCG′ ). In this section, we describe the Task 1 search topics,
participant runs, baselines, pooling, relevance assessment, and evaluation measures, and we
briefly summarize the results.

4.1. Topics
ARQMath-3 Task 1 topics were selected from questions posted to Math Stack Exchange in 2021.
There were two strict criteria for selecting candidate topics: (1) any candidate must have at least
one formula in the title or the body of the question, (2) any candidate must have at least one
known duplicate question (from 2010 to 2018) in the ARQMath collection. Duplicates have been
annotated by Math Stack Exchange moderators as part of their ongoing work, and we chose
to limit our candidates to topics for which a known duplicate question existed. We did this to
avoid assessing topics with no relevant answers in the assessment pools or even the collection
itself. In ARQMath-2 we had included 11 topics for which there were no known duplicates on
an experimental basis. Of those 11, 9 had turned out to have no relevant answers found by any
participating system or baseline.
   We selected 139 candidate topics from among the 3313 questions that satisfied both of our
strict criteria by applying additional soft criteria based on the number of terms and formulae
in the title and body of the question, the question score that Math Stack Exchange users had
assigned to the question, and the number of answers, comments, and views for the question.
From those 139, we manually selected 100 topics in a way that balanced three desiderata: (1) A
similar topic should not already be present in the ARQMath-1 or ARQMath-2 test collections, (2)
we expected that our assessors would have (or be able to easily acquire) the expertise to judge
relevance to the topic, and (3) the set of topics maximized diversity across four dimensions
(question type, difficulty, dependence, and complexity).
   In prior years, we had manually categorized topic type as computation, concept or proof and
we did so again for ARQMath-3. A disproportionately large fraction of Math Stack Exchange
questions ask for proofs, so we sought to stratify the ARQMath-3 topics in a way that was
somewhat better balanced. Of the 100 ARQMath-3 topics, 49 are categorized as proof, 28 as
computation, and 23 as concept. Question difficulty also benefited from restratification. Our
insistence that topics have at least one duplicate question in the collection injects a bias in favor
of easier questions, and such a bias is indeed evident in the ARQMath-1 and ARQMath-2 test
collections. We made an effort to better balance (manually estimated) topic difficulty for the
ARQMath-3 test collection, ultimately resulting in 24 topics categorized as hard, 55 as medium,
and 21 as easy. We also paid attention to the (manually estimated) dependency of topics on
text, formulae, or both, but we did not restratify on that factor. Of the 100 ARQMath-3 topics,
12 are categorized as dependent to text, 28 on formulae, and 60 on both. New this year, we
                                                                                                                      Task 1: Question Answering


    < Topics >
       ...
       < T o p i c number = "A . 3 8 4 " >
          < T i t l e >What d o e s t h i s b r a c k e t n o t a t i o n mean ? < / T i t l e >
          < Question >
               I am c u r r e n t l y t a k i n g MIT6 . 0 0 6 and I came a c r o s s t h i s p r ob l e m on t h e
               p ro b l em s e t . D e s p i t e t h e f a c t I have l e a r n e d D i s c r e t e Mathematics
               b e f o r e , I have n e v e r s e e n such n o t a t i o n b e f o r e , and I would l i k e t o
               know what i t means and how i t works , Thank you :
               < span c l a s s =``math − c o n t a i n e r ' ' i d =`` q_898 ' ' >
                   $ $ f _ 3 ( n ) = \ binom n2$$
               </ span >
           </ Q u e s t i o n >
          < Tags > d i s c r e t e − mathematics , a l g o r i t h m s < / Tags >
       </ T o p i c >
       ...
    </ T o p i c s >


                                                                                                                        Task 2: Formula Retrieval


    < Topics >
       ...
       < T o p i c number = " B . 3 8 4 " >
          < F o r m u l a _ I d > q_898 < / F o r m u l a _ I d >
          < L a t e x > f _ 3 ( n ) = \ binom n2 < / L a t e x >
          < T i t l e >What d o e s t h i s b r a c k e t n o t a t i o n mean ? < / T i t l e >
          < Question >
               I am c u r r e n t l y t a k i n g MIT6 . 0 0 6 and I came a c r o s s t h i s p r ob l e m on t h e
               p ro b l em s e t . D e s p i t e t h e f a c t I have l e a r n e d D i s c r e t e Mathematics
               b e f o r e , I have n e v e r s e e n such n o t a t i o n b e f o r e , and I would l i k e t o
               know what i t means and how i t works , Thank you :
               < span c l a s s =``math − c o n t a i n e r ' ' i d =`` q_898 ' ' >
                   $ $ f _ 3 ( n ) = \ binom n2$$
               </ span >
           </ Q u e s t i o n >
          < Tags > d i s c r e t e − mathematics , a l g o r i t h m s < / Tags >
       </ T o p i c >
       ...
    </ T o p i c s >


Figure 1: Example XML Topic Files. Formula queries in Task 2 are taken from questions for Task 1. Here,
ARQMath-3 formula topic B.384 is a copy of ARQMath-3 question topic A.384 with two additional fields
for the query formula (1) identifier and (2) LATEX.


also paid attention to whether a topic actually asks several questions rather than just one. For
these multi-part topics, our relevance criteria require that a highly relevant answer provide
relevant information for all parts of the question. Among ARQMath-3 topics, 14 are categorized
as multi-part questions.
   The topics were published in the XML file format illustrated in Figure 1. Each topic has a
unique Topic ID, a Title, a Question (which is the body of the question post), and Tags provided
by the asker of the question on the Math Stack Exchange. Notably, links to duplicate or related
questions are not included. To facilitate system development, we provided python code that
participants could use to load the topics. As in the collection, the formulae in the topic file are
placed in <span> XML elements, with each formula instance represented by a unique identifier
and its LATEX representation. Similar to the collection, there are three Tab Separated Value (TSV)
files, for the LATEX, OPT and SLT representations of the formulae, in the same format as the
Table 1
ARQMath-3: Submitted Runs. Baselines for Task 1 (5), Task 2 (1) and Task 3 (1) were generated by the
organizers. Primary and alternate runs were pooled to different depths, as described in Section 4.4.
                                                 Automatic                    Manual
                                             Primary  Alternate          Primary  Alternate
           Task 1: Answer Retrieval
           Baselines                            2             3
           Approach0                                                         1        4
           DPRL                                 1              4
           MathDowsers                          1              2
           MIRMU                                1              4
           MSM                                  1              4
           SCM                                  1              4
           TU_DBS                               1              4
            Totals (38 runs)                    8             25             1        4
           Task 2: Formula Retrieval
           Baseline                             1
           Approach0                                                         1        4
           DPRL                                 1              4
           MathDowsers                          1              2
           JU_NITS                              1              2
           XY_PHOC_DPRL                         1              2
            Totals (20 runs)                    5             10             1        4
           Task 3: Open Domain QA
           Baseline                             1
           Approach0                                                         1        4
           DPRL                                 1             3
           TU_DBS                               1             3
            Totals (14 runs)                    3             6              1        4


collection’s TSV files. The Topic IDs in ARQMath-3 start from 301 and continue to 400. In
ARQMath-1, Topic IDs were numbered from 1 to 200, and in ARQMath-2, from 201 to 300.

4.2. Participant Runs
ARQMath Participants submitted their runs on Google Drive. As in previous years, we expect
all runs to be publicly available.7 A total of 33 runs were received from 7 teams. Of these, 28
runs were declared to be automatic, with no human intervention at any stage of generating the
ranked list for each query. The remaining 5 runs were declared to be manual, meaning that
there was some type of human involvement in at least one stage of retrieving answers. Manual
runs were invited in ARQMath to increase the quality and diversity of the pool of documents
that are judged for relevance, but it is important to note that they might not be fairly compared
to automatic runs. The teams and submissions are shown in Table 1. For the details of each run,
please see the participant papers in the working notes.
   7
       https://drive.google.com/drive/u/1/folders/1l1c2O06gfCk2jWOixgBXI9hAlATybxKv
4.3. Baseline Runs
For Task 1, five baseline systems were provided by the organizers.8 This year, the organizers
included a new baseline system using PyTerrier [10] for the TF-IDF model. The other baselines
were also run for ARQMath 2020 and 2021. Here is a description of our baseline runs.
   1. TF-IDF. We provided two TF-IDF baselines . The first uses Terrier [11] with default
      parameters and raw LATEX strings, as in prior years of the lab. One problem with this
      baseline is that Terrier removes some LATEX symbols during tokenization. The second
      uses PyTerrier [10], with symbols in LATEX strings first mapped to English words to avoid
      tokenization problems.
   2. Tangent-S. This baseline is an isolated formula search engine that uses both SLT and
      OPT representations [12]. The target formula was selected from the question title if at
      least one existed, otherwise from the question body. If there were multiple formulae in
      the field, a formula with the largest number of symbols (nodes) in its SLT representation
      was chosen; if more than one had the largest number of symbols, we chose randomly
      between them.
   3. TF-IDF + Tangent-S. Averaging normalized similarity scores from the TF-IDF (only
      from PyTerrier) and Tangent-S baselines. The relevance scores from both systems were
      normalized in [0,1] using min-max normalization, and then combined using an unweighted
      average.
   4. Linked Math Stack Exchange Posts. Using duplicate post links from 2021 in Math
      Stack Exchange, this oracle system returns a list of answers from posts in the ARQMath
      collection that had been given to questions marked in Math Stack Exchange as duplicates
      to ARQMath-3 topics. These answers are ranked by descending order of their vote scores.
      Note that the links to duplicate questions were not available to the participants.

4.4. Relevance Assessment
Relevance judgments for Tasks 1 and 3 were performed together, with the results for the
two tasks intermixed in the judgment pools.
   Pooling. For each topic, participants were asked to rank up to 1,000 answer posts. We
created pools for relevance judgments by taking the top-𝑘 retrieved answer posts from every
participating system or baseline in Tasks 1 or 3. For Task 1 primary runs, the top 45 answer
posts were included; for alternate runs the top 20 were included. These pooling depths were
chosen based on assessment capacity, with the goal of identifying as many relevant answer posts
as possible. Two Task 1 baseline runs, PyTerrier TF-IDF+Tangent-S. and Linked Math Stack
Exchange Posts, were pooled as primary runs (i.e, to depth 45); other baselines were pooled as
alternate runs (i.e., to depth 20). All Task 3 run results (each of which is a single answer; see
section 5.6) were also included in the pools. After merging these top-ranked results, duplicate
posts were deleted and the resulting pools were sorted randomly for display to assessors. On
average, the judgment pools for Tasks 1 and 3 contain 464 answer posts per topic.
   8
     Source code and instructions for running the baselines are available from GitLab (Tangent-S: https://gitlab.
com/dprl/tangent-s, PyTerrier: https://gitlab.com/dprl/pt-arqmath/) and GoogleDrive (Terrier: https://drive.google.
com/drive/u/0/folders/1YQsFSNoPAFHefweaN01Sy2ryJjb7XnKF)
Table 2
Relevance Assessment Criteria for Tasks 1 and 2.
       Score     Rating           Definition
       Task 1: Answer Retrieval
         3      High            Sufficient to answer the complete question on its own
         2      Medium          Provides some path towards the solution. This path might come from clarifying
                                the question, or identifying steps towards a solution
         1      Low             Provides information that could be useful for finding or interpreting an answer,
                                or interpreting the question
         0      Not Relevant    Provides no information pertinent to the question or its answers. A post that
                                restates the question without providing any new information is considered non-
                                relevant

       Task 2: Formula Retrieval
         3      High            Just as good as finding an exact match to the query formula would be
         2      Medium          Useful but not as good as the original formula would be
         1      Low             There is some chance of finding something useful
         0      Not Relevant    Not expected to be useful


   Relevance definition. The relevance definitions were the same those defined for ARQMath-
1 and -2. The assessors were asked to consider an expert (modeling a math professor) judging
the relevance of each answer to the topics. This was intended to avoid the ambiguity that might
result from guessing the level of math knowledge of the actual posters of the original Math
Stack Exchange question. The definitions of the four levels of relevance are shown in Table
2. In judging relevance, ARQMath assessors were asked not to consider any link outside the
ARQMath collection. For example, if there is a link to a Wikipedia page, which provides relevant
information, the information in the Wikipedia page should not be considered to be a part of the
answer.

4.5. Assessor Selection
Paid ARQMath-3 assessors were recruited over email at the Rochester Institute of Technology.
44 students expressed interest, 11 were invited to perform 3 sample assessment tasks, and 9
students specializing in mathematics or computer science were then selected, based on an
evaluation of their judgments by an expert mathematician. Of those, 6 were assigned to Tasks 1
and 3; the others performed assessment for Task 2.
   Assessment tool. As with ARQMath-1 and ARQMath-2, we used Turkle, a system similar
to Amazon Mechanical Turk. As shown in Figure 2, there are two panes, one having the
question topic (left pane) and the other having a candidate answer from the judgment pool
(right panel). For each topic, the title and question body are provided for the assessors. To
familiarize themselves with the topic question, assessors can click on the Thread link for the
question, which shows the question and the answers given to it (i.e., answers posted in 2021,
which were not available to task participants), along with other information such as tags and
comments. Another Thread link is also available for the answer post being assessed. By clicking
on that link, the assessor can see a copy of the original question thread on Math Stack Exchange
in which the candidate answer was given, as recorded in the March 2020 snapshot used for the
ARQMath test collection.
  Note that these Thread links are provided to help the assessors gain just-in-time knowledge
that they might need for unfamiliar concepts, but the content of the threads is neither a part of
the topic nor of the answer being assessed, and thus it should have no effect on their judgement
beyond serving as reference information.
  In the right pane, below the candidate answer, assessors can indicate the relevance degree. In
addition to four relevance degrees, there are two additional choices: ‘System failure’ to indicate
system issues such as unintelligible rendering of formulae, and ‘Do not know’ which can be
used if after possibly consulting external sources such as Wikipedia or viewing the Threads the
assessor is simply not able to decide the relevance degree. We asked the assessors to leave a
comment in the event of a ‘System failure’ or ‘Do not know’ selection.
  Assessor Training. All training was done remotely, over Zoom, in four sessions, with some
individual assessment practice between each Zoom session. As in ARQMath-1 and -2, in the first
session the task and relevance criteria were explained. A few examples were then shown to the
assessors and they were asked for their opinions on relevance, which were then discussed with
an expert assessor (a math professor). Then, three rounds of training were conducted, with each
round consisting of assessment of small judgment pools for four sample topics from ARQMath-2.
For each topic, 5-6 answers with different ground truth relevance degrees (from the ARQMath-
2 qrels) were chosen. After each round, we held a Zoom session to discuss their relevance
judgements, with the specific goal of clarifying their understanding of the relevance criteria.
The assessors discussed the reasoning for their choices, with organizers (always including the
 Project: ARQMath3_Gregory / Batch: B.301                                                                                                                                                                                                   Accept Task    Skip Task   Stop Preview

math professor) sharing their own judgments and their supporting reasoning. The primary
                                                                 Instructions: Select the Relevance of the highlighted formula within each post to the query formula (shown at bottom-left).

Go Back


Inequality between norm 1,norm 2 and norm ∞ of Matrices                                                 Answer Post
                                                                                                        Thread
Thread
Suppose A is a m × n matrix.
                                                                                                        Answer:
                                −−−−−−−−−
Then Prove that, ∥A∥2 ≤ √∥A∥1 ∥A∥∞                                                                      As the answer you looked at indicates, the key here is that a unitary transition of basis preserves the matrix norm. In particular, if A has SVD
                                                                                                                                                                            −−−−−−−−−
                                                                                                        A = UDV T , then we'll have ∥A∥ = ∥D∥. So indeed, ∥A∥ = √λmax (AT A).
                                       1                      −−
                                        − ∥A∥∞ ≤ ∥A∥2 ≤ √m∥A∥∞
                                      √n                                                                 High
I have proved the following relations:                                Also I feel that
                                         1                     −                                                              Annotator comment
                                         −− ∥A∥1 ≤ ∥A∥2 ≤ √n∥A∥1                                         Medium
                                       √m
somehow Holder's inequality for the special case when p = 1 and q = ∞ might be useful.But I              Low

couldn't prove that.                                                                                     Not Relevant
                                                                                −−−−−−
Edit: I would like to have a prove that do not use the information that ∥A∥2 = √ρ(AT A)

Usage of inequalities like Cauchy Schwartz or Holder is fine.
                                                                                                         System failure

                                                                                                         Do not know


                                                                                                        Thread


                                                                                                        Answer:
                                                                                                        Write ME as the matrix with the columns ej and MF as the matrix with the columns fj , then notice that A = MET MF . Notice we are searching
                                                                                                                      −−−−−−−−−
                                                                                                        for ∥A∥ = √λmax (AT A) and trying to show that it is less than 1. Now ∥A∥ ≤ ∥ME ∥ ∥MF ∥ and since MET ME = Ik , MFT MF = Il we have
                                                                                                        ∥ME ∥ = 1, ∥MF ∥ = 1 and finally yield ∥A∥ ≤ 1 .


                                                                                                         High
                                                                                                                              Annotator comment
                                                                                                         Medium

                                                                                                         Low

                                                                                                         Not Relevant


                                                                                                         System failure

                                                                                                         Do not know


                                                                                                You must ACCEPT the Task before you can submit the results.
Figure 2: Turkle Assessment Interface. Shown are hits for Formula Retrieval (Task 2). In the left pane,
the formula query is highlighted. In the right pane, two answer posts containing the same retrieved
formula are shown. For Task 1, the same interface was used, but without formula highlighting, and
presenting only one answer post at a time.
                      0.35                                                                   0.35                                                                       0.35
                      0.30                                                                   0.30                                                                       0.30
                      0.25                                                                   0.25                                                                       0.25


                                                                                                                                                   Kappa Coefficient
 Kappa Coefficient


                                                                        Kappa Coefficient
                      0.20                                                                   0.20                                                                       0.20
                      0.15                                                                   0.15                                                                       0.15
                      0.10                                                                   0.10                                                                       0.10
                      0.05                                                                   0.05                                                                       0.05
                      0.00                                                                   0.00                                                                       0.00
                     -0.05                                                                  -0.05                                                                      -0.05
                     -0.10                                                                  -0.10                                                                      -0.10
                     -0.15                                                                  -0.15                                                                      -0.15
                              1       2      3       4     5       6                                 1       2       3      4        5       6                                  1       2      3       4        5       6
                                            Assessor Id                                                             Assessor Id                                                                Assessor Id
                             All Relevance Degrees   Binary Relevance                               All Relevance Degrees       Binary Relevance                               All Relevance Degrees       Binary Relevance


Figure 3: Inter-annotator agreement for 6 assessors during training sessions for Task 1 (mean Cohen’s
kappa), with four-way classification in gray, and two-way classification (H+M binarized) in black. Left-
to-right: agreements for rounds 1, 2, and 3.


goal of training was to help assessors make self-consistent annotations, as topic interpretations
will vary across individuals. Some of the topics involve issues that are not typically covered
in regular undergraduate courses, and some such cases required the assessors to get a basic
understanding of those issue before they could do the assessment. The assessors found the
question Threads made available in the Turkle interface helpful in this regard (see Figure 2).
   Figure 3 shows average Cohen’s kappa coefficients for agreement between each assessor and
all others during training. Collapsing relevance to binary by considering only high and medium
as relevant (henceforth “H+M binarization") yielded better agreement among the assessors.9
The agreement values in the second round are unusually low, but the third round agreement is
in line with what we had seen at the end of training in prior years.
   Assessment Results. Among 80 topics assessed, two (A.335 and A.367) had only one
answer assessed as high or medium; these two topics were removed from the collection as
score quantization for MAP′ can be quite substantial when only a single relevant document
contributes to the computation. For the remaining 78 topics, an average of 446.8 answers
were assessed, with an average assessment time of 44.1 seconds per answer post. The average
number of answers labeled with any degree of relevance (high, medium, or low; henceforth
“H+M+L binarization”) over those 78 topics was 100.8 per question (twice as high as that seen
in ARQMath-2), with the highest number being 295 (for topic A.317) and the lowest being 11
(for topic A.385).
   Post Assessment. After assessments of 80 topics for Task 1 were done, each of the assessors
for this task, assessed one topic assessed by another assessor.10 With Cohen’s kappa coefficient,
a kappa of 0.24 was achieved on the four-way assessment task, and with H+M binarization, the
average kappa value was 0.25.

4.6. Evaluation Measures
While this is the third year of the ARQMath lab, with several relatively mature systems partici-
pating, it is still possible that many relevant answers may remain undiscovered. To support fair
comparisons with future systems that may find different documents, we have adopted evaluation

    9
      H+M binarization corresponds to the definition of relevance usually used in the Text Retrieval Conference
(TREC).
   10
      One assessor (with id 8) was not able to continue assessment.
measures that ignore unjudged answers, rather than adopting the more traditional convention of
treating unjudged answers as not relevant. Specifically, the primary evaluation measure for Task
1 is the nDCG′ (read as “nDCG-prime”) introduced by Sakai and Kando [13]. nDCG′ is simply
the nDCG@1000 that would be computed after removing unjudged documents from the ranked
list. This measure has shown better discriminative power and somewhat better system ranking
stability (with judgement ablation) compared to the bpref [14] measure that had been adopted
for experiments using the NTCIR Math IR collections for similar reasons [12, 15]. Moreover,
nDCG′ yields a single-valued measure with graded relevance, whereas bpref, Precision@k,
and Mean Average Precision (MAP) all require binarized relevance judgments. As secondary
measures, we compute Mean Average Precision (MAP@1000) with unjudged posts removed
(MAP′ ) and Precision at 10 with unjudged posts removed (P′ @10). For MAP′ and P′ @10 we
used H+M binarization. Note that the answers assessed as “System failure” or “Do not know”
were not considered for evaluation, thus can be viewed as answers that are not assessed.

4.7. Results
Progress Testing. In addition to their submissions on the ARQMath-3 topics, we asked each
participating team to also submit results from exactly the same systems on ARQMath-1 and
ARQMath-2 topics for progress testing. Note, however, that ARQMath-3 systems could be
trained on topics from ARQMath-1 and -2; Together, there were 158 topics (77 from ARQMath-1,
81 from ARQMath-2) that could be used for training. The progress test results thus need to
be interpreted with this train-on-test potential in mind. Progress test results are provided in
Table 3.
   ARQMath-3 Results. Table 3 also shows results for ARQMath-3 Task 1. This table shows
baselines first, followed by teams, and within teams their systems, ranked by nDCG′ . As seen in
the table, the manual primary run of the approach0 team achieved the best results, with 0.508
nDCG′ . Among automatic runs, nDCG′ , 0.504, was achieved by the MSM team. Note that the
highest possible nDCG′ and MAP′ values are 1.0, but because fewer than 10 assessed relevant
answers (with H+M binarization) were found in the pools for some topics, the highest possible
P′ @10 value in ARQMath-3 Task 1 is 0.95.


5. Task 2: Formula Search
The goal of the formula search task is to find a ranked list of formula instances from both
questions and answers in the collection that are relevant to a formula query. The formula
queries are selected from the questions in Task 1. One formula was selected from each Task
1 question topic to produce Task 2 topics. For cases in which suitable formulae were present
in both the title and the body of the Task 1 question, we selected the Task 2 formula query
from the title. For each query, a ranked list of 1,000 formulae instances were returned by their
identifiers in the <span> XML elements and the accompanying TSV LATEX formula index file,
along with their associated post identifiers.
   While in Task 1, the goal was to find relevant answers for the questions, in Task 2, the goal
is to find relevant formulae that are associated with information that can help to satisfy an
information need. The post in which a formula is found need not be relevant to the question post
in which the formula query originally appeared for a formula to be relevant to a formula query,
but those post contexts inform the interpretation of each formula (e.g., by defining operations
and identifying variable types). A second difference is that the retrieved formulae instances in
Task 2 can be found in either question posts or answer posts, whereas in Task 1, only answer
posts were retrieved.
   Finally, in Task 2, we distinguish visually distinct formulae from instances of those formulae,
and systems are evaluated by the ranking of the visually distinct formulae they return. The
same formula can appear in different posts, and we call these individual occurrences formula
instances. A visually distinct formula is a formula associated with a set of instances that are
visually identical when viewed in isolation. For example, 𝑥2 is a formula, 𝑥 · 𝑥 is a different (i.e.,
visually distinct) formula, and each time 𝑥2 appears, it is an instance of the visually distinct
formula 𝑥2 . Although systems in Task 2 rank formula instances in order to support the relevance
judgment process, the evaluation measure for Task 2 is based on the ranking of visually distinct
formulae. As shown by Mansouri et al. (2021) [7], using visually-distinct formulae for evaluation
can result in a different preference order between systems than would evaluation on formula
instances.

5.1. Topics
Each formula query was selected from a Task 1 topic. Similarly to Task 1, Task 2 topics were
provided in XML in the format shown in Figure 1. Differences are:
   1. Topic Id. Task 2 topic ids are in the form "B.x" where x is the topic number. There is a
      correspondence between topic id in tasks 1 and 2. For instance, topic id "B.384" indicates
      the formula is selected from topic "A.384" in Task 1, and both topics include the same
      question post (see Figure 1).
   2. Formula Id. This added field specifies the unique identifier for the query formula instance.
      There may be other formulae in the Title or Body of the same question post, but the
      formula query is only the formula instance specified by this Formula_Id.
   3. LATEX. This added field is the LATEX representation of the query formula instance, as found
      in the question post.
As the query formulae are selected from Task 1 questions, the same LATEX, SLT and OPT TSV
files that were provided for the Task 1 topics can be used when SLT or OPT representations for
a query formula are needed.
   Formulae for Task 2 were manually selected using a heuristic approach to stratified sampling
over two criteria: complexity and elements. Formula complexity was ∫︀labeled low, medium or
                                                                               1
high by the third author. For example, [𝑥, 𝑦] = 𝑥 is low complexity, (𝑥2 +1)      𝑛 𝑑𝑥 is medium
                         √
                           1−𝑝 2
complexity, and 2𝜋(1−2𝑝 sin(𝜙) cos(𝜙)) is high complexity. These annotations, available in an
auxiliary file, can be useful as a basis for fine-grained result analysis, since formula queries
of differing complexity may result in different preference orders between systems [16]. For
elements, our intuition was to make sure that we have formula queries that contain different
elements and math phenomena such as integral, limit, and matrices.
5.2. Participant Runs
A total of 19 runs were received for Task 2 from a total of five teams, as shown in Table 1.
Among the participating runs, 5 were annotated as manual and the others were automatic. Each
run retrieved up to 1,000 formula instances for each formula query, ranked by relevance to that
query. For each retrieved formula instance, participating teams provided the formula_id and
the associated post_id for that formula. Please see the participant papers in the working notes
for descriptions of the systems that generated these runs.

5.3. Baseline Run: Tangent-S
Tangent-S [12] is the baseline system for ARQMath-3 Task 2. That system accepts a formula
query without using any associated text from its associated question post. Since a single formula
is specified for each Task 2 query, the formula selection step in the Task 1 Tangent-S baseline is
not needed for Task 2. Timing was similar to that of Tangent-S in ARQMath-1 and -2 (i.e., with
an average retrieval time of around six seconds per query).

5.4. Assessment
Pooling. For each topic, participants were asked to rank up to 1,000 formula instances. However,
the pooling was done using visually distinct formulae. The visual ids, which were provided
beforehand for the participants, were used for clustering formula instances. Pooling was done
by going down each ranked list until 𝑘 visually distinct formulae were found. For primary runs
(and the baseline system), the first 25 visually distinct formulae were pooled; for alternate runs,
the first 15 visually distinct formulae were pooled.
   The visual Ids used for clustering retrieval results were determined by the SLT representation
when possible, and the LATEX representation otherwise. When SLT was available, we used
Tangent-S [12] to create a string representation using a depth-first traversal of the SLT, with
each SLT node and edge generating a single item in the SLT string. Formula instances with
identical SLT strings were then considered to be the same formula. For formula instances with
no Tangent-S SLT string available, we removed the white space from their LATEX strings and
grouped formula instances with identical LATEX strings. This process is simple and appears to
be reasonably robust, but it is possible that some visually identical formula instances were not
captured due to LATEXML conversion failures, or where different LATEX strings produce visually
identical formulae (e.g., if subscripts and superscripts appear in a different order in LATEX).
   Task 2 assessment was done on formula instances. For each visually distinct formula at most
five instances were selected for assessment. As in ARQMath-2 Task 2, formula instances to be
assessed were chosen in a way that prefers highly-ranked instances and that prefers instances
returned in multiple runs. This was done using a simple voting protocol, where each instance
votes by the sum of its reciprocal ranks within each run, breaking ties randomly. For each query,
on average there were 154.35 visually distinct formulae to be assessed, and only 6% of visually
distinct formulae had more than 5 instances.
   Relevance definition. To distinguish between different relevance degrees, we relied on the
definitions in Table 2. The usefulness is defined as the likelihood of the candidate formula being
                    0.60                                                                      0.60                                                                      0.60


                                                                          Kappa Coefficient


                                                                                                                                                    Kappa Coefficient
Kappa Coefficient
                    0.50                                                                      0.50                                                                      0.50
                    0.40                                                                      0.40                                                                      0.40
                    0.30                                                                      0.30                                                                      0.30
                    0.20                                                                      0.20                                                                      0.20
                    0.10                                                                      0.10                                                                      0.10
                    0.00                                                                      0.00                                                                      0.00
                                7                  8            9                                           7                8               9                                        7                8               9
                                          Assessor Id                                                                 Assessor Id                                                               Assessor Id
                           All Relevance Degrees       Binary Relevance                              All Relevance Degrees       Binary Relevance                              All Relevance Degrees       Binary Relevance

Figure 4: Annotator agreement for 3 assessors during training for Task 2 (mean Cohen’s kappa). Four-
way classification is shown in gray, and two-way (H+M binarized) classification in black. Left-to-right:
agreements for rounds 1, 2, and 3.


associated with information (text) that can help a searcher to accomplish their task. In our case,
the task is answering the question from which a query formula is taken.
   To judge the relevance of a candidate formula instance, the assessor was given the candidate
formula (highlighted) along with the (question or answer) post in which it had appeared. They
were then asked to decide on relevance by considering the definitions provided. For each
visually distinct formula, up to 5 instances were shown to assessors and they would assess the
instances individually. For assessment, they could look at the formula’s associated post in an
effort to understand factors such as variable types, the interpretation of specific operators, and
the area of mathematics it concerns. As in Task 1, assessors could also follow Thread links to
increase their knowledge by examining the thread in which the query formula had appeared, or
in which a candidate formula had appeared.
   Assessment tool. As in Task 1, we used Turkle for the Task 2 assessment process, as
illustrated
√︀          in Figure 2. There are two panes, the left pane showing the formula query (‖𝐴‖2 =
   𝜌(𝐴𝑇 𝐴) in this case) highlighted in yellow inside its question post, and the right pane showing
the (in this case, two) candidate formula instances of a single visually distinct formula. For each
topic, the title and question body are provided for the assessors. Thread links can be used by
the assessors just for learning more about mathematical concepts in the posts. For each formula
instance, the assessment is done separately. As in Task 1, the assessors can choose between
different relevance degrees, they can choose ‘System failure’ for issues with Turkle, or they can
choose ‘Do not know’ if they are not able to decide on a relevance degree.
   Assessor Training. Three paid undergraduate and graduate mathematics and computer
science students from RIT were selected to perform relevance judgments. As in Task 1, all
training sessions were done remotely, over Zoom.
   There were four Task 2 training sessions. In the first meeting, the task and relevance criteria
were explained to assessors and then a few examples were shown, followed by discussion about
relevance level choices. In each subsequent training round, assessors were asked to first assess
four ARQMath-2 Task 2 topics, each with 5-6 visually distinct formula candidates with a variety
of relevance degrees. Organizers then met with the assessors to discuss their choices and clarify
relevance criteria. Figure 4 shows the average agreement (kappa) of each assessor with the
others during training. As can be seen, agreement had improved considerably by round three,
reaching levels comparable to that seen in prior years of ARQMath.
   Assessment Results. Among 76 assessed topics, all have at least two relevant visually
distinct formulae with H+M binarization, so all 76 topics were retained in the ARQMath-3 Task
2 test collection. An average of 152.3 visually distinct formulae were assessed per topic, with an
average assessment time of 26.6 seconds per formula instance. The average number of visually
distinct formulae with H+M+L binarization was 63.2 per query, with the highest number being
143 (topic B.305) and the lowest being 2 (topic B.333).
   Post Assessment. After Task 2 assessments were done, each of the three assessors, assessed
two topics, each assessed by the other two assessors. Using Cohen’s kappa coefficient, a kappa
of 0.44 was achieved on the four-way assessment task (higher than ARQMath-1 and -2), and
with H+M binarization the average kappa value was 0.51.

5.5. Evaluation Measures
As in Task 1, the primary evaluation measure for Task 2 is nDCG′ , with MAP′ and P′ @10
also reported. Participants submitted ranked lists of formula instances used for pooling, but
with evaluation measures computed over visually distinct formulae. The ARQMath-2 Task 2
evaluation script replaces each formula instance with its associated visually distinct formula, and
then deduplicates from the top of the list downward, producing a ranked list of visually distinct
formulae, from which our “prime” evaluation measures are then computed using trec_eval,
after removing unjudged visually distinct formulae. For the visually distinct formulae with
multiple instances, the maximum relevance score of any judged instance was used as the
relevance visually distinct formula’s relevance score. This reflects a goal of having at least
one instance that provides useful information. Similar to Task 1, formulas assessed as “System
failure” or “Do not know” were treated as not being assessed.

5.6. Results
Progress Testing. As with Task 1, we asked Task 2 teams to run their ARQMath-3 systems on
ARQMath-1 and -2 Topics for progress testing (see Table 4). Some progress test results may
represent a train-on-test condition: there were 70 topics from ARQMath-2 and 74 topics from
ARQMath-1 available for training. Note also that while the relevance definition stayed the same
for ARQMath-1, -2, and -3, the assessors were instructed differently in ARQMath-1 on how to
handle the specific case in which two formulae were visually identical. In ARQMath-1 assessors
were told such cases are always highly relevant, whereas ARQMath-2 and ARQMath-3 assessors
were told that from context they might recognize cases in which a visually identical formula
would be less relevant, or not relevant at all (e.g., where identical notation is used with very
different meaning). Assessor instruction did not change between ARQMath-2 and -3.
   ARQMath-3 Results. Table 4 also shows results for ARQMath-3 Task 2. In that table,
the baseline is shown first, followed by teams and then their systems ranked by nDCG′ on
ARQMath-3 Task 2 topics. As shown, the highest nDCG′ was achieved by the manual primary
run from the approach0 team, with an nDCG′ value of 0.720. Among automatic runs, the highest
nDCG′ value was the DPRL primary run, with an NDCG′ of 0.694. Note that 1.0 is a possible
score for nDCG′ and MAP′ , but that the highest possible P′ @10 value is 0.93 because (with
H+M binarization) 10 visually distinct formulae were not found in the pools for some topics.
6. Task 3: Open Domain Question Answering
The new pilot task developed for ARQMath-3 (Task 3) is Open Domain Question Answering.
Unlike Task 1, system answers are not limited to content from any specific source. Rather,
answers can be extracted from anywhere, automatically generated, or even written by a person.
For example, suppose that we ask a Task 3 system the question “What does it mean for a
matrix to be Hermitian?” An extractive system might first retrieve an article about Hermitian
matrices from Wikipedia and then extract the following excerpt as the answer: “In mathematics,
a Hermitian matrix (or self-adjoint matrix) is a complex square matrix that is equal to its own
conjugate transpose.” By contrast, a generative system such as GPT-3 can directly construct an
answer such as: “A matrix is Hermitian if it is equal to its transpose conjugate.” For a survey of
open-domain question answering, see Zhu et al. [17]. In this section, we describe the Task 3
search topics, runs from participant and baseline systems, assessment and evaluation procedures,
and results.

6.1. Topics and Participant Runs
The topics for Task 3 are the Task 1 topics, with the same content provided (title, question body,
and tags). A total of 13 runs were received from 3 teams. Each run consists of a single result for
each topic. 9 runs from the TU_DBS and DPRL teams were declared to be automatic and 5 runs
from the approach0 team were declared as manual. The 4 automatic runs from the TU_DBS
team used generative systems, whereas the remaining 9 runs from the DPRL and approach0
teams used extractive systems. The teams and their submissions are listed in Table 1.

6.2. Baseline Run: GPT-3
The ARQMath organizers provided one baseline run for this task using GPT-3. This baseline
system uses the text-davinci-002 model of GPT-3 [18] from OpenAI. First, the system prompts
the model with the text Q: followed by the text and the LATEX formulae of the question, two
newline characters, and the text A: as follows:
  Q: What does it mean for a matrix to be Hermitian?

  A:

Then, GPT-3 completes the text and produces an answer of up to 570 tokens:
  Q: What does it mean for a matrix to be Hermitian?

  A: A matrix is Hermitian if it is equal to its transpose conjugate.

If the answer is longer than the maximum of 1,200 Unicode characters, the system retries until
the model has produced a sufficiently short answer.
   To provide control over how creative an answer is, GPT-3 resmooths the output layer 𝐿
using the temperature 𝜏 as follows: softmax(𝐿/𝜏 ) [19]. A temperature close to zero ensures
deterministic outputs on repeated prompts, whereas higher temperatures allow the model’s
decoder to consider many different answers. Our system uses the default temperature 𝜏 = 0.7.
6.3. Evaluation Measures
In this section, we first describe the measures we used to evaluate participating systems. Then,
we describe additional evaluation measures that we have developed with the goal of providing
a fair comparison between participating systems and future systems that return answers from
outside Math Stack Exchange, or that are generated.

6.3.1. Manual Evaluation Measures
As described in Section 4.4, the assessors produced a relevance score between 0 and 3 for most
answers from each participating system. The exceptions were ‘System failure’ and ‘Do not
know’ assessments, which we interpreted as relevance score 0 (‘Not relevant’) in our evaluation
of Task 3. To evaluate participating systems, we report the Average Relevance (AR) score and
Precision@1 (P@1). AR is equivalent to the unnormalized Discounted Cumulative Gain at
position 1 (DCG@1).11 P@1 is computed using H+M binarization.
   Task 1 systems approximate a restricted class of Task 3 systems. For this reason, we also
report AR and P@1 for ARQMath-3 Task 1 systems in order to extend the range of system
comparisons that can be made. To do this, we truncate the Task 1 result lists after the first result.
Note, however, that Task 3 answers were limited to a maximum of 1,200 Unicode characters,
whereas Task 1 systems had no such limitation. Approximately 15% of all answer posts in
the collection are longer than 1,200 Unicode characters when represented as text and LATEX.
Therefore, the Task 3 measures that we report for Task 1 systems should be treated as somewhat
optimistic estimates of what might have been achieved by an extractive system that was limited
to the ARQMath collection.

6.3.2. Automatic Evaluation Measures
In Task 1, systems pick answers from a fixed collection of potential answers. When evaluated
with measures that differentiate between relevant, non-relevant, and unjudged answers, reason-
able comparisons can be made between participating systems that contributed to the judgement
pools and future systems that did not. By contrast, the open-ended nature of Task 3 means that
relevance judgements on results from participating systems can not be used in the same way to
evaluate future systems that might (and hopefully will!) generate different answers.
   The problem lies in the way AR and P@1 are defined; they rely on our ability to match new
answers with judged answers. For future systems, however, the best we might reasonably hope
for is similarity between the new answers and the judged answers. If we are to avoid the need
to keep assessors around forever, we need automatic evaluation measures that can be used to
compare participating Task 3 systems with future Task 3 systems. With that goal in mind, we
also report Task 3 results using the following evaluation measures:

   1. Lexical Overlap (LO) Following SQuAD and CoQA [20, Section 6.1], we represent
      answers as a bag of tokens, where tokens are produced by the MathBERTa12 tokenizer.
   11
      For ranked lists of depth 1 there is no discounting or accumulation, and in ARQMath the relevance value is
used directly as the gain.
   12
      https://huggingface.co/witiko/mathberta
      For every topic, we compute the token 𝐹1 score between the system’s answer and each
      known relevant Task 3 answer (using H+M binarization). The score for a topic is the
      maximum across these 𝐹1 scores. The final score is the average across all topics of those
      per-topic maximum 𝐹1 scores.
   2. Contextual Similarity (CS) Although lexical overlap can account for answers with
      high surface similarity, it cannot recognize answers that use different tokens with similar
      meaning. For context similarity, we use BERTScore [21] with the MathBERTa language
      model. As with our computation of lexical overlap, for BERTScore we also compute a
      token 𝐹1 score, but instead of exact matches, we match tokens with the most similar
      contextual embeddings and interpret their similarity as fractional membership. For every
      topic, we compute 𝐹1 score between the system’s answer and each known relevant answer
      (with H+M binarization). The score for a topic is the maximum across these 𝐹1 scores.
      The final score is the average across all topics of those per-topic maximum 𝐹1 scores.

  When computing the automatic measures for a participating system, we exclude relevant
answers uniquely contributed to the pools by systems from the same team. This ablation avoids
the perfect overlap scores that systems contributing to the pools would otherwise get from
matching their own results.

6.4. Results
Task 3 runs were assessed together with Task 1 runs, using the same relevance definitions,
although after that assessment was complete, we also did some additional annotation that was
specific to Task 3. Here we present results for the baseline and submitted runs using manual
and automatic measures, along with additional analysis that we performed using the additional
annotation.

6.4.1. Manual Evaluation Measures
Table 5 shows ARQMath-3 results for Task 3 systems. This table shows baselines first, followed
by teams ordered by their best Average Recall (AR), and within teams their runs are ordered by
AR. As seen in the table, the automatic generative baseline run using GPT-3 achieved the best
results, with 1.346 AR. Note that uniquely among ARQMath evaluation measures, AR is not
bounded between 0 and 1; rather, it is bounded between 0 and 3.13 Among manual extractive
non-baseline runs, the highest AR was achieved by a run from the approach0 team, with 1.282
AR. Among automatic extractive non-baseline runs, the highest AR was achieved by a run from
the DPRL team, with 0.462 AR. Among automatic generative non-baseline runs, the highest AR
was achieved by the TU_DBS team, with 0.325 AR. No manual generative non-baseline runs
were submitted to ARQMath-3 Task 3.
   Table 7 shows ARQMath-3 Task 3 results for Task 1 systems. Similarly to Table 5, Table
7 shows baselines first, followed by teams ordered by their best AR, and within teams their
runs are ordered by AR. As seen in the table, the Linked MSE posts baseline achieved the best

   13
      Because some topics have no highly relevant answers, the actual maximum value of AR on the Task 3 topics is
2.346.
result, with 1.608 AR. Among non-baseline runs, the highest AR was achieved by a run from
the approach0 team, with 1.377 AR. Among automatic runs, the highest AR was achieved by a
run from the TU_DBS team, with 1.192 AR. Compared to ARQMath-3 Task 1 results in Table 3,
the TU_DBS team’s best run did relatively better, swapping order with the best runs from the
MSM and MIRMU. Within teams, the fusion_alpha05 run from approach0, which achieved the
highest nDCG′ on Task 1, did not do as well as that team’s rerank_nostemmer system when
both were scored using Task 3 measures. The RRF-AMR-SVM run from DPRL, which achieved
the second highest nDCG′ score among DPRL runs on Task 1, received the lowest AR and P@1
among Task 1 systems. These differences result from the exclusive focus of Task 3 measures on
the single highest-ranked result.

6.4.2. Automatic Evaluation Measures
At least one participating system produced a relevant answer (with H+M binarization) for 66
of the 78 Task 3 topics. However, automated evaluation can only be computed with ablation
of each team’s contributions if two or more of the three teams produced a relevant answer;
there were only 35 such topics. We therefore expanded the set of references for automatic Task
3 measures to also include relevant answers (with H+M binarization) that were produced for
ARQMath-3 topics by Task 1 systems, but only for relevant answers that were no longer than
1,200 Unicode characters.
   As one measure of the suitability of our automatic evaluation measures for the evaluation
of future systems, we report paired pointwise correlation measures between our automatic
measures and manual measures, using Pearson’s 𝑟 to characterize the linear relationship between
the measures, and Kendall’s 𝜏 to characterize differences in how the evaluation measures rank
systems.
   Table 5 also shows results for automatic evaluation measures. The automatic generative
baseline run using GPT-3, which achieved the best result using manual measures, scored below
extractive runs from the approach0 and DPRL teams on both automatic measures. We theorize
that this is because we used relevant answers produced by Task 1 systems in our automatic
measures, which favors extractive systems over generative systems, because identical hits may
be retrieved by extractive systems. Both automatic measures maintained the ordering of teams
given by the manual measures.
   Table 6 shows pointwise correlations between the manual and automatic measures. Both
automatic measures show a strong linear relationship to the manual measures, with lexcial
overlap (LO) and average relevance (AR) having Pearson’s 𝑟 of 0.837, and contextual similarity
(CS) and AR having Pearson’s 𝑟 of 0.839. LO is better able to maintain the ordering of results
given by the manual measures, having Kendall’s 𝜏 with AR of 0.736, compared to CS, which
has Kendall’s 𝜏 with AR of 0.670. Furthermore, LO is also more easily interpretable than CS,
because it only considers exact matches between tokens, and is independent of a specific BERT
language model, which may have to be replaced in the future. This suggests that LO may be
preferable as an automatic measure to evaluate future Math OpenQA systems.
6.4.3. Characterizing Answers
The answers for Task 3 were assessed together with the Task 1 results, using the same relevance
definitions. We also provided a sample of Task 1 and Task 3 answers to assessors, and asked
them to annotate:
   1. Whether answers were machine-generated
   2. Whether answers contained information unrelated to the topic question
In Tasks 1 and 3, answers are considered relevant if any part of the answer is relevant to the
question. Annotating unrelated information allows us to determine whether extractive systems
stuff answers with unrelated information, perhaps in the hope that some of it will be relevant,
and whether generative systems generate off-topic content together with on-topic content. To
support that analysis, assessors were asked to differentiate between undesirable answer stuffing
and the possibly desirable inclusion of background information that is related to the question
or relevant part(s) of the answer.
   We report the answers to these questions using two measures:
   1. Machine-Generated (MG). The fraction of answers assessed as machine-generated.
      Ideally this would always be zero, but in practice we are interested in whether it is larger
      for generative systems than for extractive systems.
   2. Unrelated Information (UI). The fraction of answers assessed as containing information
      unrelated to the question. Again, ideally this would be zero.
We report these measures as averages over 73 of the 78 Task 3 topics because one assessor was
unable to complete this post-evaluation assessment process.14
   Table 5 includes results for these measures. The manual extractive run of approach0 produced
the smallest fraction of answers annotated as machine-generated (11%). Among generative
runs, the automatic baseline run using GPT-3 produced the fewest answers annotated as
machine-generated (28.8%). With the exception of the automatic extractive SBERT-QQ-AMR
run from DPRL, which had 34.2% of answers annotated as machine-generated, the generative
runs are linearly separable from the extractive runs (by MG > 0.26). This suggests that even
though people would perform worse than chance at identifying answers as machine generated
for systems such as GPT-3, they would often be able to differentiate between extractive and
generative systems after seeing many answers from a system.
   We also see that UI has a strong inverse correlation with AR, with Pearson’s 𝑟 of −0.97 and
Kendell’s 𝜏 of −0.88. Moreover, we also see that 90.43% of answers that were annotated as
containing information unrelated to the question had been assessed as not relevant (with H+M
binarization), whereas only 79.03% of all answers were annotated as not relevant (with H+M
binarization). That suggests that answer stuffing does not seem to have been a serious problem
in our evaluation.


   14
        The five topics for which results were not characterized in this way are A.301, A.314, A.322, A.324, and A.350.
7. Conclusion
Over the course of three years, ARQMath has created test collections for three tasks that together
include relevance judgments for hundreds of topics for two of those tasks, and 78 topics for
the third. Coming as it did at the dawn of the neural age in information retrieval, considerable
innovation in methods has been evident throughout the three years of the lab. ARQMath has
included substantial innovation in evaluation design as well, including better contextualized
definitions for graded relevance, and piloting a new task on open domain question answering.
Having achieved our twin goals of building a new test collection from Math Stack Exchange
posts and bringing together a research community around that test collection, the time as now
come to end this lab at CLEF. We expect, however, that both that collection and that community
will continue to contribute to advancing the state of the art in Math IR for years to come.

7.0.1. Acknowledgements
We thank our student assessors from RIT: Duncan Brickner, Jill Conti, James Hanby, Gursimran
Lnu, Megan Marra, Gregory Mockler, Tolu Olatunbosun, and Samson Zhang. This material
is based upon work supported by the National Science Foundation (USA) under Grant No.
IIS-1717997 and the Alfred P. Sloan Foundation under Grant No. G-2017-9827.


References
 [1] B. Mansouri, R. Zanibbi, D. W. Oard, Characterizing Searches for Mathematical Concepts,
     in: Joint Conference on Digital Libraries (JCDL), 2019.
 [2] A. Aizawa, M. Kohlhase, I. Ounis, NTCIR-10 Math Pilot Task Overview, in: Proceedings
     of the 10th NTCIR, 2013.
 [3] A. Aizawa, M. Kohlhase, I. Ounis, NTCIR-11 Math-2 Task Overview, in: Proceedings of
     the 11th NTCIR, 2014.
 [4] R. Zanibbi, A. Aizawa, M. Kohlhase, I. Ounis, G. Topic, K. Davila, NTCIR-12 MathIR Task
     Overview, in: Proceedings of the 12th NTCIR, 2016.
 [5] M. Líška, P. Sojka, M. Růžička, P. Mravec, Web Interface and Collection for Mathematical
     Retrieval WebMIaS and MREC (2011).
 [6] Y. Stathopoulos, S. Teufel, Retrieval of Research-level Mathematical Information Needs:
     A Test Collection and Technical Terminology Experiment, in: Proceedings of the 53rd
     Annual Meeting of the Association for Computational Linguistics and the 7th International
     Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015.
 [7] B. Mansouri, D. W. Oard, A. Agarwal, R. Zanibbi, Effects of context, complexity, and
     clustering on evaluation for math formula retrieval, arXiv preprint arXiv:2111.10504
     (2021).
 [8] M. Hopkins, R. Le Bras, C. Petrescu-Prahova, G. Stanovsky, H. Hajishirzi, R. Koncel-
     Kedziorski, SemEval-2019 Task 10: Math Question Answering, in: Proceedings of the 13th
     International Workshop on Semantic Evaluation, 2019.
 [9] J. Meadows, A. Freitas, A Survey in Mathematical Language Processing, arXiv preprint
     arXiv:2205.15231 (2022).
[10] C. Macdonald, N. Tonellotto, Declarative Experimentation in Information Retrieval using
     PyTerrier, in: Proceedings of the 2020 ACM SIGIR on International Conference on Theory
     of Information Retrieval, 2020.
[11] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, D. Johnson, Terrier Information
     Retrieval Platform, in: European Conference on Information Retrieval, Springer, 2005.
[12] K. Davila, R. Zanibbi, Layout and Semantics: Combining Representations for Mathematical
     Formula Search, in: Proceedings of the 40th International ACM SIGIR Conference on
     Research and Development in Information Retrieval, 2017.
[13] T. Sakai, N. Kando, On Information Retrieval Metrics Designed for Evaluation with
     Incomplete Relevance Assessments, Information Retrieval (2008).
[14] C. Buckley, E. M. Voorhees, Retrieval Evaluation with Incomplete Information, in: Pro-
     ceedings of the 27th Annual International ACM SIGIR Conference on Research and Devel-
     opment in Information Retrieval, 2004.
[15] B. Mansouri, S. Rohatgi, D. W. Oard, J. Wu, C. L. Giles, R. Zanibbi, Tangent-CFT: An
     Embedding Model for Mathematical Formulas, in: Proceedings of the 2019 ACM SIGIR
     International Conference on Theory of Information Retrieval (ICTIR), 2019.
[16] B. Mansouri, R. Zanibbi, D. W. Oard, Learning to Rank for Mathematical Formula Re-
     trieval, in: Proceedings of the 44th International ACM SIGIR Conference on Research and
     Development in Information Retrieval, 2021.
[17] F. Zhu, W. Lei, C. Wang, J. Zheng, S. Poria, T.-S. Chua, Retrieving and Reading: A Compre-
     hensive Survey on Open-Domain Question Answering, arXiv preprint arXiv:2101.00774v3
     (2021).
[18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language Models are Few-Shot Learners, 2020.
[19] J. Ficler, Y. Goldberg, Controlling linguistic style aspects in neural language generation,
     in: Proceedings of the Workshop on Stylistic Variation, Association for Computational
     Linguistics, 2017.
[20] S. Reddy, D. Chen, C. D. Manning, CoQA: A Conversational Question Answering Challenge,
     Transactions of the Association for Computational Linguistics (2019).
[21] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating Text
     Generation with BERT, arXiv preprint arXiv:1904.09675 (2019).
Table 3
ARQMath 2022 Task 1 (CQA) results. P: primary run, M: manual run, (✓): baseline pooled as a primary
run. For MAP′ and P′ @10, H+M binarization was used. (D)ata indicates use of (T)ext, (M)ath, (B)oth
text and math, or link structure (*L).
                                           ARQMath-1                 ARQMath-2                 ARQMath-3
                             Type           77 Topics                 71 topics                 78 topics
 Run                  D    P     M   nDCG′    MAP′    P′ @10   nDCG′    MAP′    P′ @10   nDCG′    MAP′    P′ @10
 Baselines
 TF-IDF(Terrier)      B              0.204    0.049    0.073   0.185    0.046    0.063   0.272    0.064    0.124
 TF-IDF(PyTerrier)
  +Tangent-S          B    (✓)       0.249    0.059    0.081   0.158    0.035    0.072   0.229    0.045    0.097
 TF-IDF(PyTerrier)    B              0.218    0.079    0.127   0.120    0.029    0.055   0.190    0.035    0.065
 Tangent-S            M              0.158    0.033    0.051   0.111    0.027    0.052   0.159    0.039    0.086
 Linked MSE posts     *L   (✓)       0.279    0.194    0.384   0.203    0.120    0.282   0.106    0.051    0.168
 approach0
 fusion_alpha05       B    ✓     ✓   0.462    0.244    0.321   0.460    0.226    0.296   0.508    0.216   0.345
 fusion_alpha03       B          ✓   0.460    0.246    0.312   0.450    0.221    0.278   0.495    0.203   0.317
 fusion_alpha02       B          ✓   0.455    0.243    0.309   0.443    0.217    0.266   0.483    0.195   0.305
 rerank_nostemer      B          ✓   0.382    0.205    0.322   0.385    0.187    0.276   0.418    0.172   0.309
 a0porter             B          ✓   0.373    0.204    0.270   0.383    0.185    0.241   0.397    0.159   0.271
 MSM
 Ensemble_RRF         B    ✓         0.422    0.172    0.197   0.381    0.119    0.152   0.504    0.157    0.241
 BM25_system          B              0.332    0.123    0.168   0.285    0.082    0.116   0.396    0.122    0.194
 BM25_TfIdf
  _system             B              0.332    0.123    0.168   0.286    0.083    0.116   0.396    0.122    0.194
 TF-IDF               B              0.238    0.074    0.117   0.169    0.040    0.076   0.280    0.064    0.081
 CompuBERT22          B              0.115    0.038    0.099   0.098    0.030    0.090   0.130    0.025    0.059
 MIRMU
 MiniLM+RoBERTa       B    ✓         0.466    0.246    0.339   0.487    0.233    0.316   0.498    0.184    0.267
 MiniLM
  +MathRoBERTa        B              0.466    0.246    0.339   0.484    0.227    0.310   0.496    0.181    0.273
 MiniLM_tuned
  +MathRoBERTa        B              0.470    0.240    0.335   0.472    0.221    0.309   0.494    0.178    0.262
 MiniLM_tuned
  +RoBERTa            B              0.466    0.246    0.339   0.487    0.233    0.316   0.472    0.165    0.244
 MiniLM+RoBERTa       T              0.298    0.124    0.201   0.277    0.104    0.180   0.350    0.107    0.159
 MathDowsers
 L8_a018              B    ✓         0.511    0.261    0.307   0.510    0.223    0.265   0.474    0.164    0.247
 L8_a014              B              0.513    0.257    0.313   0.504    0.220    0.265   0.468    0.155    0.237
 L1on8_a030           B              0.482    0.241    0.281   0.507    0.224    0.282   0.467    0.159    0.236
 TU_DBS
 math_10              B    ✓         0.446    0.268    0.392   0.454    0.228    0.321   0.436    0.158    0.263
 Khan_SE_10           B              0.437    0.254    0.357   0.437    0.214    0.309   0.426    0.154    0.236
 base_10              B              0.438    0.252    0.369   0.434    0.209    0.299   0.423    0.154    0.228
 roberta_10           B              0.438    0.254    0.372   0.446    0.224    0.309   0.413    0.150    0.226
 math_10_add          B              0.421    0.264    0.405   0.566    0.445    0.589   0.379    0.149    0.278
 DPRL
 SVM-Rank             B    ✓         0.508    0.467   0.604    0.533    0.460   0.596    0.283    0.067    0.101
 RRF-AMR-SVM          B              0.587    0.519   0.625    0.582    0.490   0.618    0.274    0.054    0.022
 QQ-QA-RawText        B              0.511    0.467   0.604    0.532    0.460   0.597    0.245    0.054    0.099
 QQ-QA-AMR            B              0.276    0.180   0.295    0.186    0.103   0.237    0.185    0.040    0.091
 QQ-MathSE-AMR        B              0.231    0.114   0.218    0.187    0.069   0.138    0.178    0.039    0.081
 SCM
 interpolated_text
  +positional_word
  2vec_tangentl       B    ✓         0.254    0.102    0.182   0.197    0.059    0.149   0.257    0.060    0.119
 joint_word2vec       B              0.247    0.105    0.187   0.183    0.047    0.106   0.249    0.059    0.106
 joint_tuned
  _roberta            B              0.248    0.104    0.187   0.184    0.047    0.109   0.249    0.059    0.105
 joint_positional
  _word2vec           B              0.247    0.105    0.190   0.184    0.047    0.109   0.248    0.059    0.105
 joint_roberta_base   T              0.135    0.048    0.101   0.099    0.023    0.060   0.188    0.040    0.077
Table 4
ARQMath 2022 Task 2 (Formula Retrieval) results. P: primary run, M: manual run, (✓): baseline pooled
as a primary run. MAP′ and P′ @10 use H+M binarization. Baseline results in parentheses. Data
indicates sources used by systems: (M)ath, or (B)oth math and text.
                                                 ARQMath-1                 ARQMath-2                 ARQMath-3
                                Type              45 topics                 58 topics                 76 Topics
 Run                 Data   P          M   nDCG′    MAP′    P′ @10   nDCG′    MAP′    P′ @10   nDCG′    MAP′    P′ @10
 Baselines
 Tangent-S            M     (✓)            0.691    0.446    0.453   0.492    0.272    0.419   0.540    0.336    0.511
 approach0
 fusion_alph05        M     ✓          ✓   0.647    0.507    0.529   0.652    0.471    0.612   0.720    0.568   0.688
 fusion_alph03        M                ✓   0.644    0.513    0.520   0.649    0.470    0.603   0.720    0.565   0.665
 fusion_alph02        M                ✓   0.633    0.502    0.513   0.646    0.469    0.597   0.715    0.558   0.659
 a0                   M                ✓   0.582    0.446    0.477   0.573    0.420    0.588   0.639    0.501   0.615
 fusion02_ctx         B                ✓   0.575    0.448    0.496   0.575    0.417    0.590   0.631    0.490   0.611
 DPRL
 TangentCFT2ED        M     ✓              0.648    0.480   0.502    0.569    0.368   0.541    0.694    0.480    0.611
 TangentCFT2          M                    0.607    0.438   0.482    0.552    0.350   0.510    0.641    0.419    0.534
 T-CFT2TED+MathAMR    B                    0.667    0.526   0.569    0.630    0.483   0.662    0.640    0.388    0.478
 LTR                  M                    0.733    0.532   0.518    0.550    0.333   0.491    0.575    0.377    0.566
 MathAMR              B                    0.651    0.512   0.567    0.623    0.482   0.660    0.316    0.160    0.253
 MathDowsers
 latex_L8_a040        M                    0.657    0.460    0.516   0.624    0.412    0.524   0.640    0.451    0.549
 latex_L8_a035        M                    0.659    0.461    0.516   0.619    0.410    0.522   0.640    0.450    0.549
 L8                   M     ✓              0.646    0.454    0.509   0.617    0.409    0.510   0.633    0.445    0.549
 XYPhoc
 xy7o4                M                    0.492    0.316    0.433   0.448    0.250    0.435   0.472    0.309    0.563
 xy5                  M                    0.419    0.263    0.403   0.328    0.168    0.391   0.369    0.211    0.518
 xy5IDF               M     ✓              0.379    0.241    0.374   0.317    0.156    0.391   0.322    0.180    0.461
 JU_NITS
 formulaL             M     ✓              0.238    0.151    0.208   0.178    0.078    0.221   0.161    0.059    0.125
 formulaO             M                    0.007    0.001    0.009   0.182    0.101    0.367   0.016    0.008    0.001
 formulaS             M                    0.000    0.000    0.000   0.142    0.070    0.159   0.000    0.000    0.000
Table 5
ARQMath 2022 Task 3 (Open Domain QA) results for Task 3 systems. P: primary run, M: manual run, G:
generative system, (✓): baseline pooled as primary run. All runs use (B)oth math and text. P@1 uses
H+M binarization. AR: Average Relevance. LO: Lexical Overlap metric. CS: Contextual Similarity metric.
MG: Ratio of answers assessed as Machine-Generated. UI: Ratio of answers with Unrelated Information.
Task 3 topics are the same as Task 1 topics except for MG and UI, where we only use a subset of 73
topics. Baseline results are in parentheses.
                                          Type                        78 Topics                  73 Topics
 Run                    Data        P      M      G    AR            P@1       LO       CS      MG      UI
 Baselines
 GPT-3                    B         (✓)           ✓   (1.346)    (0.500)     0.317     0.851    0.288     (0.466)
 approach0
 run1                     B               ✓           1.282      0.436       0.509     0.886    0.110      0.562
 run4                     B               ✓           1.231      0.397       0.515     0.886    0.123      0.616
 run3                     B               ✓           1.179      0.372       0.467     0.879    0.247      0.658
 run2                     B               ✓           1.115      0.321       0.427     0.868    0.164      0.616
 run5                     B         ✓     ✓           0.949      0.282       0.444     0.873    0.151      0.671
 DPRL
 SBERT-SVMRank            B                            0.462      0.154      0.330     0.846    0.205      0.767
 BERT-SVMRank             B         ✓                  0.449      0.154      0.329     0.846    0.178      0.808
 SBERT-QQ-AMR             B                            0.423      0.128      0.325     0.852    0.342      0.877
 BERT-QQ-AMR              B                            0.385      0.103      0.323     0.851    0.260      0.863
 TU_DBS
 amps3_se1_hints          B                       ✓    0.325      0.078      0.263     0.835    0.833      0.931
 se3_len_pen_10           B                       ✓    0.244      0.064      0.248     0.806    0.877      0.890
 amps3_se1_len_
  pen_20_sample_hint      B                       ✓    0.231      0.051      0.254     0.813    0.959      0.932
 shortest                 B         ✓             ✓    0.205      0.026      0.239     0.820    0.849      0.918


Table 6
ARQMath 2022 Task 3 (Open Domain QA) correlations between automatic and manual evaluation
measures from Table 5. P@1 uses H+M binarization. AR: Average Relevance. LO: Lexical Overlap metric.
CS: Contextual Similarity metric. Task 3 topics are the same as Task 1 topics.
                AR     P@1       LO        CS                            AR         P@1       LO         CS
       AR      1.000   0.989    0.837     0.839                 AR      1.000       0.994    0.736      0.670
      P@1      0.989   1.000    0.787     0.802                P@1      0.994       1.000    0.729      0.674
       LO      0.837   0.787    1.000     0.952                 LO      0.736       0.729    1.000      0.805
       CS      0.839   0.802    0.952     1.000                 CS      0.670       0.674    0.805      1.000
                  (a) Pearson’s r                                          (b) Kendall’s 𝜏
Table 7
ARQMath 2022 Task 3 (Open Domain QA) results for Task 1 systems. P: primary run, M: manual run, (✓):
baseline pooled as a primary run. P@1 uses H+M binarization. AR: Average Relevance. (D)ata indicates
use of (T)ext, (M)ath, (B)oth text and math, or link structure (*L). Baseline results are in parenthesis.
                                                                       Type        78 topics
       Run                                                Data     P       M      AR      P@1
       Baselines
       Linked MSE posts                                    *L     (✓)           (1.608)   (0.541)
       TF-IDF(Terrier)                                     B                     0.590     0.154
       TF-IDF(PyTerrier)+Tangent-S                         B      (✓)            0.513     0.167
       Tangent-S                                           M                     0.410     0.128
       TF-IDF(PyTerrier)                                   B                     0.333     0.051
       approach0
       rerank_nostemer                                      B             ✓      1.377     0.481
       fusion_alpha05                                       B      ✓      ✓      1.247     0.468
       fusion_alpha03                                       B             ✓      1.077     0.385
       fusion_alpha02                                       B             ✓      0.974     0.346
       a0porter                                             B             ✓      0.885     0.321
       TU_DBS
       math_10                                              B      ✓             1.192     0.372
       math_10_add                                          B                    1.128     0.321
       Khan_SE_10                                           B                    1.103     0.333
       base_10                                              B                    1.038     0.295
       roberta_10                                           B                    0.910     0.269
       MIRMU
       MiniLM+RoBERTa                                       B      ✓             1.143     0.377
       MiniLM_tuned+RoBERTa                                 B                    1.141     0.372
       MiniLM+MathRoBERTa                                   B                    1.013     0.338
       MiniLM_tuned+MathRoBERTa                             B                    0.974     0.308
       MiniLM+RoBERTa                                       T                    0.679     0.205
       MSM
       Ensemble_RRF                                         B      ✓             1.026     0.295
       BM25_system                                          B                    0.718     0.218
       BM25_TfIdf_system                                    B                    0.705     0.218
       TF-IDF                                               B                    0.423     0.141
       CompuBERT22                                          B                    0.256     0.051
       MathDowsers
       L8_a018                                              B      ✓             1.038     0.333
       L1on8_a030                                           B                    0.936     0.308
       L8_a014                                              B                    0.910     0.282
       DPRL
       QQ-QA-RawText                                        B                    0.577     0.179
       QQ-QA-AMR                                            B                    0.526     0.179
       SVM-Rank                                             B      ✓             0.474     0.128
       QQ-MathSE-AMR                                        B                    0.423     0.128
       RRF-AMR-SVM                                          B                    0.064     0.013
       SCM
       interpolated_text+positional_word2vec_tangentl       B      ✓             0.551     0.179
       joint_word2vec                                       B                    0.551     0.154
       joint_tuned_roberta                                  B                    0.551     0.154
       joint_positional_word2vec                            B                    0.551     0.154
       joint_roberta_base                                   T                    0.333     0.077