=Paper= {{Paper |id=Vol-2696/paper_200 |storemode=property |title=ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open? |pdfUrl=https://ceur-ws.org/Vol-2696/paper_200.pdf |volume=Vol-2696 |authors=Philipp Scharpf,Moritz Schubotz,André Greiner-Petter,Malte Ostendorff,Olaf Teschke,Bela Gipp |dblpUrl=https://dblp.org/rec/conf/clef/ScharpfSGOTG20 }} ==ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open?== https://ceur-ws.org/Vol-2696/paper_200.pdf

ARQMath Lab: An Incubator for Semantic Formula
Search in zbMATH Open?

Philipp Scharpf 1 , Moritz Schubotz 2,3 , André Greiner-Petter2 ,
Malte Ostendorff 1 , Olaf Teschke3 , and Bela Gipp 2
1
University of Konstanz, Konstanz, Germany
{first.last}@uni-konstanz.de
2
University of Wuppertal, Wuppertal, Germany
andre.greiner-petter@zbmath.org, {last}@uni-wuppertal.de
3
FIZ Karlsruhe, Karlsruhe, Germany
{first.last}@fiz-karlsruhe.de

Abstract. The zbMATH database contains more than 4 million bibliographic en-
tries. We aim to provide easy access to these entries. Therefore, we maintain dif-
ferent index structures, including a formula index. To optimize the findability of
the entries in our database, we continuously investigate new approaches to satisfy
the information needs of our users. We believe that the findings from the
ARQMath evaluation will generate new insights into which index structures are
most suitable to satisfy mathematical information needs. Search engines, recom-
mender systems, plagiarism checking software, and many other added-value ser-
vices acting on databases such as the arXiv and zbMATH need to combine natu-
ral and formula language. One initial approach to address this challenge is to
enrich the mostly unstructured document data via Entity Linking. The ARQMath
Task at CLEF 2020 aims to tackle the problem of linking newly posted questions
from Math Stack Exchange (MSE) to existing ones that were already answered
by the community. To deeply understand MSE information needs, answer-, and
formula types, we performed manual runs for tasks 1 and 2. Furthermore, we
explored several formula retrieval methods: For task 2, such as fuzzy string
search, k-nearest neighbors, and our recently introduced approach to retrieve
Mathematical Objects of Interest (MOI) with textual search queries. The task re-
sults show that neither our automated methods nor our manual runs archived good
scores in the competition. However, the perceived quality of the hits returned by
the MOI search particularly motivates us to conduct further research about MOI.

Keywords: Information Retrieval, Mathematical Information Retrieval,
Question Answering, Semantic Search, Machine Learning, Mathematical Ob-
jects of Interest, ARQMath Lab

1 Introduction

In 2013 the first prototype of formula -search in zbMATH was announced [1], which
became an integral part of the zbMATH interface by now. At the beginning of 2021,
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece.
zbMATH will transform its business model from a subscription-based service to a pub-
licly funded open service. In this context, we evaluate novel approaches to include
mathematical formulae as first-class citizens in our mathematical information retrieval
infrastructure. Despite the standard search that targets abstract, review, and publication
meta-data, zbMATH also traces incoming links from the Question Answering platform
MathOverflow and provides backlinks from scientific articles to MathOverflow links,
mentioning the publication [1]. We hypothesize that federating information from
zbMATH and MathOverflow will enhance the zbMATH search experience signifi-
cantly. The ARQMath Lab at CLEF 2020 aims to tackle the problem of linking newly
posted questions from Math Stack Exchange to existing ones that we re already an-
swered by the community [2]. Using question postings from a test collection (extracted
by the ARQMath organizers from an MSE Internet Archive Snapshot 1 until 2018) as
queries, the goal is to retrieve relevant answer posts, containing both text and at least
one formula . The test collection created for the task is intended to be used by research-
ers as a benchmark for mathematical retrieval tasks that involve both natural and math-
ematical language. The ARQMath Lab consists of two separate subtasks. Task 1 – An-
swer poses the challenge to retrieve relevant community answer post given a question
from Math Stack Exchange (MSE). Task 2 – Formulas poses the challenge to retrieve
relevant formulas from question and answer posts. Specifically, the aim of Task 1 is to
be able to find old answers to new questions to speed up the community answer process.
The aim of Task 2 is to find a ranked list of relevant formulae in old questions and
answers to match to a query formula from the new question. This task design seems to
a good fit for our research interest, since the information needs are related. Moreover,
MathOverflow and math.stackexchange use the same data -format, which enables us to
reuse software developed during this competition and to transform it into production
software later on. On the other hand, the mathematical level of questions on Math Stack
Exchange is less sophisticated and thus not all relevant rankings might be suitable for
our use-case.

1.1 ARQMath Lab

The ARQMath lab was motivated by the fact that Mansouri et al. discovered “that 20%
of the mathematical queries in general-purpose search engines were expressed as well-
formed questions” [2], [3]. Furthermore, with the increasing public interest in Commu-
nity Question Answering sites such as MSE2 and MathOverflow3 , it will be beneficial
to develop computational methods to support human answerers. Particularly, the “time-
to-answer” should be shortened by linking to related answers already provided on the
platform, which can potentially lead to the answer more quickly. This will be of great
help since most of the time the question is urgent and related – sometimes even directly
exact – existing answers are available. However, the task is challenging because both

1
https://archive.org/download/stackexchange
2
https://math.stackexchange.com
3
https://mathoverflow.net
questions and answers can be a combination of natural and mathematical language,
involving words and formulae. ARQMath lab at CLEF 2020 will be the first in a three-
year sequence through which the organizers “aim to push the state of the art in evalua-
tion design for math-aware IR” [2]. The task starts with the domain of mathematics
involving formula language. The goal is to later extend the task to other domains (e.g.,
chemistry or biology), which employ other types of special notation.

1.2 Math Stack Exchange

Stack Exchange is an online platform with a host of Q&A forums [4]. The Stack Ex-
change network consists of 177 Q&A communities including Stack Overflow, which
claims to be “the largest, most trusted online community for developers to learn and
share their knowledge”2 . The different topic sites include Q&A on com puter issues,
math, physics, photography, etc. Users can rank questions and answers by voting them
up or down according to their quality assessment. Stack Exchange provides its content
publicly available in XML format under the Creative Commons license [4]. The Math
Stack Exchange collection for the ARQ lab tasks comprises Q&A postings extracted
from data dumps from the Internet Archive 4 . Currently, over 1 million questions are
included [2].

2 Related Work

2.1 Mathematical Question Answering

Already in 1974, Smith [5] describes a project investigating the understanding of natu-
ral language by computers. He develops a theoretical model of natural language pro-
cessing (NLP) and algorithmically implements his theory. Specifically, he chooses the
domain of elementary mathematics to construct a Q&A system for unrestrict ed natural
language input. However, for some time later, there was little interest and progress in
the field of mathematical question answering. In 2012, Nguyen et al. [6] present a math-
aware search engine for a math question answering system . Their system handles both
textual keywords as well as mathematical expressions. The math feature extraction is
designed to encode the semantics of math expressions via a Finite State Machine model.
They tested their approach against three classical information retrieval strategies on
math documents crawled from Math Overﬂow, claiming to outperform them by more
than 9%. In 2017, Bhattacharya et al. [7] publish a survey of question answering for
math and science problems. They explore the current achievements towards the goal of
making computers smart enough to pass math and science tests. They conclude claim-
ing that “the smartest AI could not pass high school”. In 2018, Gunawan et al. [8] pre-
sent an Indonesian question answering system for solving arithmetic word problems
using pattern matching. Their approach is integrated into a physical humanoid robot.
For auditive communication with the robot, the user’s Indonesian question must be

4
https://archive.org
translated into English text. They employ NLP using the NLTK toolkit 5 , specifically
co-referencing, question parsing, and preprocessing. They conclude claiming that the
Q&A system achieves an accuracy between 80% and 100%. However, they state that
the response time is rather slow with average about more than one minute. Also in 2018,
Schubotz et a l. [9] present MathQA6 , an open-source math-aware question answering
system based on Ask Platypus 7 . The system returns as a single mathematical formula
for a natural language question in English or Hindi. The formulae are fetched from the
open knowledge-base Wikidata 8 . With numeric values for constants loaded from Wik-
idata, the user can do computations using the retrieved formula. It is claimed that the
system outperforms a popular computational mathematical k nowledge-engine by 13%.
In 2019, Hopkins et al. [10] report on the SemEva l 2019 task on math question answer-
ing. The derived a question set from Math SAT practice exams, including 2778 training
questions and 1082 test questions. According to their study, the top system correctly
answered 45% of the test questions, with a random guessing baseline at 17%. Beyond
the domain of math Q&A, Pineau [11] and Abdi et al. [12] present first approaches to
answer questions on physics.

2.2 Mathematical Document Subject Class Classification
For open-domain question redirection, it is beneficial to classify a given mathematical
question by its domain, e.g. geometry, calculus, set theory, physics, etc. There have
been several approaches to perform categorization or subject class classification for
mathematical documents. In 2017, Suzuki and Fujii [13] test classification methods on
collections built from MathOverflow9 and the arXiv 10 paper preprint repository. The
user tags include both keywords for math concepts and categories form the Mathemat-
ical Subject Classification (MSC) 2010 11 top and second-level subjects. In 2020,
Scharpf et al. [9] investigate how combining encodings of natural and mathematical
language affect the classification and clustering of documents with ma thematical con-
tent. They employ sets of documents, sections, and abstracts from the arXiv 10, labeled
by their subject class (mathematics, computer science, physics, etc.) to compare differ-
ent encodings of text and formulae and evaluate the performance and runtimes of se-
lected classification and clustering algorithms. Also in 2020, Schubotz et al. [14] ex-
plore whether it is feasible to automatically assign a coarse-grained prima ry classifica-
tion using the MSC scheme using multi-class classification algorithms. They claim to
achieve a precision of 81% for the autom atic article classification. We conclude that
for math Q&A systems, the classification needs to be performed at the sentence level.
If MSE questions contain several sentences, the problem could potentially also be
framed as an abstract classification problem.
5
https://www.nltk.org
6
http://mathqa.wmflabs.org
7
https://askplatyp.us
8
https://www.wikidata.org
9
https://mathoverflow.net
10
https://arxiv.org
11
http://msc2010.org
2.3 Connecting Natural and Mathematical Language

For mathematical question answering, mathematical information needs to be connected
to natural language queries. Yang & Ko [15] present a search engine for formulae in
MathML12 using a plain word query. Mansouri et al. [3] investiga te how queries for
mathematical concepts are performed in search engines. They conclude “that math
search sessions are typically longer and less successful than general search sessions”.
For non-mathematical queries, search engines like Google 13 or DuckDuckGo 14 already
provide entity cards with a short encyclopedic description of the searched concept [16].
For mathematical concepts, however, there is an urgent need to connect a natural lan-
guage query to a formula representing the keyword. Dmello [16] proposes integrating
entity cards into the math-aware search interface MathSeer15 . Scharpf et al. [17] pro-
pose a Formula Concept Retrieval challenge for Formula Concept Discovery (FCD)
and Formula Concept Recognition (FCR) tasks. They present first machine learning
based approaches for retrieving formula concepts from the NTCIR 11/12 arXiv da-
taset 16 .

2.4 Semantic Annotations

To connect mathematical formulae and symbols to natural language keywords, seman-
tic annotations are an effective means. So far there are only a few annotation systems
available for mathematical documents. Dumitru et al. [18] present a browser-based an-
notation tool (“KAT system”) for linguistic/semantic annotations in structured
(XHTML5) documents. Scharpf et al. [19] present “AnnoMathTeX”, a recommender
system for formula and identifier annotation of Wikipedia articles using Wikidata 17
QID item tags. The annotations can be integrated into the MathML markup using
MathML Wikidata Content Dictionaries18 [20], [21], [22].

3 Summary of Our Approach

We tackle the ARQMath lab tasks (Task 1 – answer retrieval, Task 2 – formula re-
trieval) using manual run selection benchmarking. Therefore, we create, populate, and
employ a Wiki19 with pages for normal (Task 1) and formula (Task 2) topics. The main
objective of our experiments was to explore methods to enable automatic answer as-
signment recommendations to question postings on Mathematics Stack Exchange
(MSE). We tested the following approaches or methods: 1) manual run annotation using

12
https://www.w3.org/TR/MathML3
13
https://www.google.com
14
https://duckduckgo.com
15
https://www.cs.rit.edu/~dprl/mathseer
16
http://ntcir-math.nii.ac.jp
17
https://www.wikidata.org
18
https://www.openmath.org
19
https://arq20.formulasearchengine.com
Google and MSE search, 2) formula TF-IDF or Doc2vec20 encodings [23] using the
Python libraries Scikit-learn 21 [24] and Gensim 22 [25], 3) fuzzy string comparison or
matching using rapidfuzz 23 , 4) k-nearest neighbors algorithm, and 5) discovering of
Mathematical Objects of Interest (MOI) with textual search queries [26].
As result, we obtained a relevant MSE answer(s) ID for each query in the sample of
Task 1, and a ranked list of most relevant formulae for each query in the sample of Task
2 (if available). Finally, we analyzed our results using a manual consistency and quality
check.

4 Workflow of Our Approach

The workflow of our approa ch is illustrated in Fig. 1. It can be logically divided into
three stages: 1) the creation of a Wiki with pages for normal and formula topics, 2)
methods to tackle Task 1, and 3) methods to tackle Task 2.

•Retri eval of URLs using Google a nd MSE search
•Crea ti on of Wi ki at a rq20.formulasearchengine.com
Wiki •Crea ti on of Wi ki pages for normal a nd formula topics

•Ins ert links to math.stackexchange.com/questions/xxx on Wikipedia page
•Ma nual run selection of the most s uitable answer
•Ins ert links to https://math.stackexchange.com/a/xxx a s “relevant a nswers”
Task 1 property on Wikidata i tem for normal topics

•Ma nual run selection of the most s uitable formula(e)
•La TeX s tring as “defining formula” property a s subproperty of “relevant
Task 2 a ns wers” on Wikidata item for formula topics

Fig. 1. Workflow of our approach to retrieve answer and formula candidates for Tasks 1 and 2.

In the following, we describe the stages with their subtasks in more detail.
4.1 Setup Wiki Framework

The initial preparation step for our approach to tackle Task 1 and 2 was to create, pop-
ulate, and employ a MediaWiki environment connected to a mathoid [27] rendering

20
Also known as “Paragraph Vectors”, as introduced in [23].
21
https://scikit-learn.org
22
https://radimrehurek.com/gensim
23
https://github.com/maxbachmann/rapidfuzz
service with pages for normal and formula topics. For each query, there is a Wikibase
item with the following properties: ‘math-stackexchange-category’ (P10), ‘topic-id’
(P12), ‘post-type’ (P9), ‘math stackexcange post id’ (P5), and ‘relevant answers’ (P14).
Having set up the Wiki, we manually retrieved the question URLs using Google and
MSE search and inserted them as values for the ‘math stackexchange post id’ on the
respective question pages. Unfortunately by doing so some post 2019 new post-ids were
entered because we did not check the date carefully enough. The ‘math-stackexchange-
category’ values were automatically retrieved from the question tags. The ‘topic-id’
(e.g., A.50) was transferred from the task dataset, the ‘post-type’ set to “Question”.
Unfortunately, as we discovered later, the use of Google and MSE search led to results
outside the task dataset. This means that the answer that was accepted as the best answer
by the questioner was often not included in the task da taset. However, our aim was to
establish the “correct” answer as semantic reference in our MediaWiki.

4.2 Populate Topic Answers (Task 1)

The first part in our experimental pipeline was a manual run selection of the most suit-
able answer from the MSE question posting page (preferably the one selected by the
questioner, if available). Subsequently, we inserted links to the answers, i.e.,
math.stackexchange.com/a/xxx to the ‘relevant answers’ property of the query item
normal topics page.

4.3 Populate Formula Answers (Task 2)

The second part in our experimental pipeline was a manual run selection of the most
suitable formula per question or answer. The chosen formula was considered to answer
the given question as concise as possible. Thus, we did interpret Task 2 as having to
find formula answers to the question and only not similar formulae. We inserted the
extracted LaTeX string to the ‘defining formula ’ property, as a subproperty of ‘relevant
answers’ on the Wikidata item for formula topics.

4.4 Preparing Data for Experiments and Submission

After having populated our Wiki database, we used a SPARQL query (Fig. 2) to have
an overview of its content. The query fetches all Wikidata question items, displaying
their ‘topic-id’ (e.g. A.1 or B.1), ‘post-id’ (e.g., 3063081), and the formula LaTeX
string. With the list of normal and formula topic insertions, we performed a quality
check, correcting wrong or missing values.
Fig. 2. SPARQL query to retrieve our manually inserted data containing topic answer links (Task
1 - Answer) and formula LaTeX strings (Task 2 - Formulas). The query properties are ‘math-
stackexchange-category’ (P10), ‘topic-id’ (P12), ‘post-type’ (P9), ‘math stackexcange post id’
(P5), and ‘relevant answers’ (P14).

4.5 Discovering Mathematical Objects of Interest

The previously developed MOI search engine [26] allows us to search meaningful
mathematical expressions by a given textual search query. This workflow can be used
to solve Task 2, but it requires some substantial updates. Essentially, Task 2 requests
relevant formula IDs for a given input formula ID. Each formula ID is mapped to the
corresponding post ID. Hence, we can take the entire post of a formula ID as the input
for our MOI search engine. However, there are two main problems with the existing
approach: (i) the MOI search engine was developed and tested only to search for k ey-
words, thus, entering entire posts at once may harm the accuracy, and (ii) every re-
trieved MOI is by design a subexpression and, thus, has probably no designated formula
ID. To overcome these issues, we need to understand the current system. The MOI
search system retrieves MOIs in two steps. The first step retrieves relevant documents
from an elasticsearch 24 instance for the input query. Hence, we first indexed all
ARQMath posts in elasticsearch. To index the content of each post appropriately, we
set up the standard English stemmer, stopword filtering, HTML strippin g (filters out
HTML tags but preserves the content of each tag), and enable ASCII folding (converts
alphabetic, numeric, and symbolic characters to their ASCII equivalence, e.g., ‘á’ is
replaced by ‘a’). For the search query, we used the standard match query system but
boosted every mathematical expression in the input. This tells elasticsearch to focus
more on the math expressions in a search query, rather than the actual text. With this
setup, we overcome the mentioned issue (i) and can search for relevan t posts by enter-
ing an entire content of a post. In the second step of the MOI search engine, the engine

24
https://www.elastic.co
disassembles all formulae in the retrieved documents and calculates the mBM25 score
[26] for each of these subexpressions (MOI)
(𝑘 + 1)IDF (𝑡) ITF (𝑡, 𝑑 ) TF(𝑡, 𝑑 )
s (𝑡, 𝑑 ) ≔ ,
max TF (𝑡 ′ , 𝑑 ) + 𝑘 (1 − 𝑏 + 𝑏AVGDL )
𝑡′ ∈𝑑|𝑐(𝑡) | |
𝑑 AVGc
mBM25 (𝑡, 𝐷) ≔ max s (𝑡, 𝑑 ) ,
𝑑 ∈𝐷
where mBM25 (𝑡, 𝐷) is a modified version of the BM25 relevance score [28] with 𝐷 as
the entire ARQMath corpus, IDF(𝑡 ) is the inverse document frequency of the term 𝑡,
TF(𝑡, 𝑑 ) the term frequency of the term 𝑡 in the document 𝑑 ∈ 𝐷, ITF (𝑡, 𝑑 ) the inverse
term frequency (calculated the same way as IDF(𝑡) but on the document level for the
document 𝑑), AVGDL the average document length of 𝐷 and AVGC the average com-
plexity of 𝐷 (see [26] for a more detailed description). The top-scored expressions will
be returned. The mBM25 score requires the global term and document frequencies of
every subexpression. Hence, we first calculated these global values for every subex-
pression of every formula in the ARQMath dataset. Table 1 shows the statistics of this
MOI database in comparison to the previously generated databases for arXiv and
zbMATH. A document in ARQMath is a post from MSE. The dataset only includes
MathML representations. The complexity of a formula is the maximum depth of the
Presentation Ma thML representation of the formula. As Table 1 shows, the ARQMath
database can be interpreted as a hybrid between the full research papers in arXiv and
relatively short review discussions in zbMATH (mainly contain ing reviews of mathe-
matical articles).

Table 1. The MOI database statistics of ARQMath compared to the existing databases for arXiv
and zbMATH. The document length is the number of subexpressions.
arXiv zbMATH ARQMath
Documents 841,008 1,349,297 2,058,866
Formulae 294,151,288 11,747,860 26,074,621
Subexpressions 2,508,620,512 61,355,307 143,317,218
Unique Subexpressions 350,206,974 8,450,496 16,897,129
Avg. Doc. Length 2,982.87 45.47 69.69
Avg. Complexity 5.01 4.89 5.00
Max. Complexity 218 26 188

Table 2 lists the machine specification for the MOI retrieval and runtime for example
query B.1.

Table 2. Machine hardware specification and example runtime for query B.1.
Machine Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz - 4 Cores / 8 Threads
RAM 32GB 2133 MHz
Disk 1TB SSD
Required Diskspace 7.8 GB (Posts) + 3 GB (MOIs) = 10.8 GB
Runtime 6.0 s / query (average over all queries)
Considering that every formula in the ARQMath dataset has its own ID and the system
needs to preserve the ID during computation, we need to attach the ID to every gener-
ated MOI. However, this would result in a massive overload. For exam ple, the single
identifier 𝑥 appears 7.6 million times in ARQMath and thus will have millions of dif-
ferent formula IDs. The entire ARQMath dataset has 16.8 million unique MOIs. Handle
this number of different IDs is impractical. Hence, we choose a different approach to
get the formula IDs for every MOI. Since the search engine retrieves the relevant doc-
uments first, we only need to consider formula IDs that exist in these retrieved docu-
ments. To achieve this, we attached the formula IDs to every post in the elasticsearch
database ra ther than to the MOIs itself. A single document in elasticsearch now contains
the post ID, the textual content, and a list of MOIs with local term frequencies (how
often the MOI appears in the corresponding post) and formula IDs. Note that most MOI
still has multiple formula IDs, since a subexpression may appear multiple times in a
single post, but the number of different IDs reduced drastically. Since the IDs are now
attached to each post but are not used in the search query, the performance of retrieving
relevant documents from elasticsearch stays the same. With this approach, we may cal-
culate multiple but different mBM25 scores for a single formula ID, since a single
unique formula ID can be attached to multiple MOIs. To calcula te the final score for a
formula ID, we calculated the average of all mBM25 scores for a formula ID. For ex-
ample, consider we would retrieve the document with the ID 2759760. This post con-
tains the formula ID 25466124
𝑒
,
𝑥6
which would be disassembled into its subexpressions 𝑒, 𝑥 6 , and 𝑥. Hence, we would
calculate three mBM25 scores for 𝑒/𝑥 6 . The average of these scores would be the score
for the formula ID.

We used this updated MOI search engine to retrieve results for Task 2. Note t hat the
approach might be a bit unorthodox, since the MOI search engine takes the entire post
of the given formula ID rather than the formula ID alone. We interpreted Task 2 to
retrieve answer formulae for a given question formula, rather than retrieving v isually
or semantically similar formulae. Based on this interpretation, it makes sense to use the
entire post of a formula ID to search for relevant answers. In other words, we interpreted
Task 2 as an extension and math specific version of Task 1. In summary, the key steps
of the MOI search engine to solve Task 2 were the following:
1. Take the entire post of the given formula ID.
2. Search for posts similar to the retrieved post in step 1.
3. Extract all MOI from all retrieved posts in step 2.
4. Calculate mBM25 scores for all MOIs of step 3.
5. Group the MOIs by their associated formula IDs (every formula ID has now
multiple mBM25 scores).
6. Average the mBM25 scores for each formula ID.
For Task 2, we retrieved 107,476 MOIs. We used the provided annotation dataset to
evaluate the retrieved results. For a better comparison, we calculated the nDCG′p
(nDCG-prime) score, as the task organizers did [29]. Note the nDCG′p removes un-
judged documents before calculating the score. Since these were post-experiment cal-
culations, there is not much correlation between the retrieved MOI documents and the
judged formula IDs. We found 179 formula IDs that were retrieved by our MOI engine
and contained a judgment by the annotators of the ARQMath task. Based on these 179
judges, we retrieved an nDCG′p value of 0.374, which is in the midrange compared to
the other competitors.

4.6 Data Integration of Query and Pool Formulae

We tested two other approaches for Task 2: Formula pool retrieval via k -nearest neigh-
bors and fuzzy string matching. For both methods, we first needed to integrate the pool
of formulae (the task dataset) with our query set, consisting of the formulae, which we
‘manually’ chose from the candidate answers to be a formula answer to the question
asked.

Data
integration
query & pool

K-nearest Fuzzy string
neighbors candidates
retrieval retrieval

Fig. 3. Workflow for Task 2 – formula answer candidate retrieval. Manually selected ‘query’
formulae must be integrated with the task dataset pool before testing k-nearest neighbors or fuzzy
string formula candidate retrieval.
•Loa d TSV files for query a nd pool formulae
•Retri eve formula symbols (identifiers, operators) from mathml-tags
('ci ','mi','co','mo'), together with formula LaTeX s tring
DI •Integrate all formulae with IDs a nd s ave dictionary to a Python Pi ckle file

•Encode formula La TeX s trings via TF-IDF and Doc2Vec
•Retri eve distances a nd k-nearest formula ca ndidates via kNN algorithm
kNN

•Ca l culate pairwise fuzzy string partial ratios (matching percentage)
•Ra nk all percentages for each formula to identify cl osest candidates
fuzzy

Fig. 4. Workflow of the data integration (DI) and formula candidate retrieval via k-nearest neigh-
bors (kNN) and (fuzzy) string similarity matching for Task 2.

In our integrated formula dictionary, each query formula has the following properties:
- order 'ord', e.g., '1',
- entity URL 'item', e.g., 'https://arq20.formulasearchengine.com/entity/Q1023',
- question ID 'd', e.g., 'B.1',
- the ‘ma nually’ retrieved relevant answer MSE ID 'val', e.g., '3063081'
- MathML string including the LaTeX formula string 'mml', e.g., ' $c>{\frac {25}{64}}$