Dowsing for Answers to Math Questions: Ongoing Viability of Traditional MathIR

Dowsing for Answers to Math Questions: Ongoing Viability of Traditional MathIR YinKiNg David R. Cheriton School of Computer Science University of Waterloo

N2L 3G1 Waterloo ON Canada

DallasJFraser dallas.fraser.waterloo@gmail.com Knowledgehook Inc

151 Charles St W N2G 1H6 Kitchener ON Canada

BesatKassaie bkassie@uwaterloo.ca David R. Cheriton School of Computer Science University of Waterloo

N2L 3G1 Waterloo ON Canada

FrankWmTompa fwtompa@uwaterloo.ca David R. Cheriton School of Computer Science University of Waterloo

N2L 3G1 Waterloo ON Canada

Dowsing for Answers to Math Questions: Ongoing Viability of Traditional MathIR 1613-0073 89A8CD91BE8313B12FE0875268F20FE2 GROBID - A machine learning software for extracting information from scholarly documents Community Question Answering (CQA) Mathematical Information Retrieval (MathIR) Symbol Layout Tree (SLT) Mathematics Stack Exchange (MSE) ARQMath Lab Tangent-L formula matching proximity

We present our application of the math-aware search engine Tangent-L to the 2021 ARQMath Lab. This is a continuation of our MathDowsers submissions to last year's Lab, where we produced the best Task 1 participant run. Since then, we have improved the search engine's formula retrieval power by considering additional math features in the ranking function. This year, we also explore two approaches to incorporate proximity in evaluating the suitability of a document to be considered a match to a query.

For the 2021 ARQMath Lab, our primary run in Task 1 produces an nDCG ′ value of 0.434, which is nearly five points higher than that produced by the second-best participant run. An unsubmitted run, which corrects the setup of the primary run and preserves duplicate keyword terms during query term extraction, produces an even higher nDCG ′ of 0.462. Meanwhile, our primary run in Task 2 produces an nDCG ′ value of 0.552, which is the best automatic run and is comparable to the best participant run, a manual run from the Approach0 team.

The success of our runs continue to demonstrate that a traditional math information retrieval system remains a viable option for Community Question Answering specialized in the mathematical domain and for in-context formula retrieval.

Introduction

The growing popularity of Community Question Answering (CQA) sites such as Math Stack Exchange 1 (MSE) and Math Overflow 2 demonstrates the need to find answers to mathematical questions, especially for questions posed in mathematical natural language. An effective question answering system capable of handling mathematical formulas and terminology would be of great interest to help serve this need.

The ARQMath Lab at CLEF 2021 [1], hereafter referenced as ARQMath-2, continues the previous year's Lab [2] (ARQMath-1) by sponsoring an evaluation exercise centering around a CQA Task with questions involving math data. The Labs use a collection of questions and answers from MSE between 2010 and 2018 consisting of approximately 1.1 million question-posts and 1.4 million answer-posts. In this Lab series, Task 1 is the CQA Task in which participants are asked to return potential answers to unseen mathematical questions among existing answerposts in the collection. The closely related Task 2 considers formula retrieval in-context, in which formulas within questions serve as queries for matching relevant formulas from question-posts and answer-posts in the same collection.

In ARQMath-1, the Waterloo team of MathDowsers (Figure 1) participated in Task 1, and our best run achieved an nDCG ′ value of 0.345 [3], which outperformed other participating systems [4,5,6,7]. Our approach was a three-stage Mathematics Information Retrieval (MathIR) system centered around the use of a math-aware search engine, Tangent-L [8]: first, topics of mathematical questions were automatically transformed into formal queries consisting of keywords and formulas; then the formal queries were executed against a corpus of MSE questionanswer pairs by Tangent-L; finally, results were re-ranked based on a linear regression model trained on CQA metadata using mock relevance assessments. Submissions were made based on different configurations in each stage of the system, and the best run was produced without re-ranking, demonstrating success of a traditional math-aware query system in addressing a CQA task specialized in the mathematical domain.

For ARQMath-2, we participate again as the MathDowsers team for Task 1 and (for the first time) Task 2, with the goal to continue exploring the potential of a traditional math-aware query system in tackling both tasks. In particular, we are interested in further developing the formula matching capability of our core math-aware search engine Tangent-L, given that a satisfactory performance has been observed over formula-dependent questions in ARQMath-1 [9]. With the empowered Tangent-L, we then refine our system for Task 1 and develop two baseline approaches for Task 2.

Our refinement is successful, with our primary run for Task 1 continuing to be the best participant run with respect to the primary measure nDCG ′3 . Our primary run for Task 2 turns out to be the most effective automatic run, essentially indistinguishable from the best participant run, a manual run from the Approach0 team. In this paper, we present:

• an updated Tangent-L with several avenues that improve its formula matching capability, 3 Normalized Discounted Cumulative Gain (nDCG) with unjudged documents removed

Table 1

Generated repetition tokens for the formula in Figure 2.

Token Type Tokens Generated Remark

Repeated symbols • a refinement of our system for mathematical answer retrieval with respect to query conversion and searching with Tangent-L, • two related approaches that are motivated by proximity, • for in-context formula retrieval, two simple baselines based on our developed system, • performance results for both Task 1 and Task 2 in ARQMath-2

Improving Formula Matching with Tangent-L

Tangent-L is the cornerstone of our system for the tasks. It is a traditional math-aware query system built on the popular Lucene text search platform [10]. During both index time and search time, it converts a formula into a bag of math tokens that each capture local characteristics of the Symbol Layout Tree (SLT) representation of a formula [11], so that mathematical documents can be matched against a query through text tokens and converted math tokens using a weighted BM25 + ranking [12].

The basic math tokens used by Tangent-L and the approach to weighting text against math tokens are described elsewhere [9]. In this section, we describe improvements tested in this year's Lab.

Repeated Symbols

Repetitions of symbols are commonplace in a formula; for instance, 𝑥 repeats in the formula 𝑥 2 + 3 𝑥 + 𝑥, as does the operator +. Ideally, a search for either 𝑦 𝑥 − 𝑥 or 6𝑥 3 − 𝑦 + 𝑥 could match that formula because of the pattern of repetitions for 𝑥, and a search for 2𝑦 3 + 𝑦 + 5 could also match because of the repeated symbol +.

With this motivation, a new type of token-repetition tokens-is introduced into Tangent-L's formula representation to capture this characteristic. Repetition tokens are generated based on the relative positions of the repeated symbols in the formula's SLT representation. For every pair of repeated symbols:

1. if the pair of repeated symbols reside on the same root-to-leaf path of the SLT (that is, one is an ancestor of the other), then a repetition token {symbol, 𝑝} is generated, where 𝑝 represents the path between the repeated symbols; 2. otherwise, a repetition token {symbol, 𝑝 1 , 𝑝 2 } is generated where 𝑝 1 and 𝑝 2 represent the paths from the closest common ancestor in the SLT to each repeated symbol.

If a symbol repeats 𝑘 times where 𝑘 > 1, (︀ 𝑘

)︀ repetition tokens are generated for that symbol following the above procedure. For each of these tokens, an additional "location" token is generated with the augmentation of the path traversing from the root to the closest common ancestor of the pair. As such, a total of 2 • (︀ 𝑘

)︀ repetition tokens are generated and indexed. Table 1 shows the repetition tokens that would be indexed for the formula 𝑥 2 + 3 𝑥 + 𝑥 in Figure 2.

Revised Ranking Formula

With the introduction of repetition tokens, Tangent-L now generates three token types: text tokens, regular math tokens, and repetition tokens from documents or queries containing mathematical expressions. During a search, Tangent-L applies BM25 + ranking to the query terms and the document terms, using custom weights for each class of token as described here.

Let 𝑞 𝑡 be the set of text tokens, 𝑞 𝑚 be the set of regular math tokens, and 𝑞 𝑟 be the set of repetition tokens generated for the query terms. Let 𝑑 be a document represented by the set of all its indexed tokens. Then the revised ranking formula with the repetition tokens is:

BM25 + w (𝑞 𝑡 ∪ 𝑞 𝑚 ∪ 𝑞 𝑟 , 𝑑) = 𝛼 • 𝛾 • BM25 + (𝑞 𝑟 , 𝑑) + (1 − 𝛾) • BM25 + (𝑞 𝑚 , 𝑑) max(𝛾, 1 − 𝛾) + (1 − 𝛼) • BM25 + (𝑞 𝑡 , 𝑑)(1)

where 𝛼 and 𝛾 are parameters ranging from 0 to 1. The value of 𝛼 balances the weight of math features against keyword features, while the value of 𝛾 balances the weight of repetitions within math formulas against other math features. Both parameters can be tuned based on the target dataset.

Formula Normalization

Mathematical expressions can be rewritten in numerous ways without altering their meaning. For example, 𝐴 + 𝐵 matches 𝐵 + 𝐴 semantically because of the commutative law. To accommodate such variability and increase recall, we equip Tangent-L with the ability to generate similar math features for two formulas with the same semantics. We consider the following five classes of semantic matches: The adjustment to handle the first two classes, Commutativity and Symmetry, are similar. Recall that originally Tangent-L generates a math token for each pair of adjacent symbols with their orders preserved. For example, two math tokens (𝐴, +, →) and (+, 𝐵, →) are generated for the expression 𝐴 + 𝐵, and two different math tokens (𝐵, +, →) and (+, 𝐴, →) are generated for the expression 𝐵 + 𝐴. In order for an exact match to take place for the two expressions, a simple adjustment to the math tokens is to ignore the order of a pair of adjacent symbols whenever commutative operators or symmetric relations are involved. With this approach, both expressions 𝐴 + 𝐵 and 𝐵 + 𝐴 generate the same pair of math tokens, (+, 𝐴, →) and (+, 𝐵, →), so that an exact match is made possible. 4The next two classes, Alternative Notation and Operator Unification, can be easily accommodated by choosing a canonical symbol for each equivalence class of operators and consistently using only the canonical symbols in any math tokens generated as features.

The final class, Inequality Equivalence, can be handled by choosing a canonical symbol (for instance, choosing the symbol "≤" in preference to "≥") and then reversing the operands whenever necessary during math tokens generation. 5For each of these five classes of semantic matches, Tangent-L provides a separate flag to control whether or not the class is to be supported, so that only those deemed to be advantageous are applied when math tokens are generated.

Data Cleansing

For the ARQMath dataset, the original L A T E X formulas from the Math Stack Exchange collections are wrapped within an identifiable block (a span tag with class="math-container" and an id identifier), and the corresponding Presentation MathML representations are provided as separate files. Since the input to Tangent-L includes formulas encoded in Presentation MathML, its formula matching ability will be hindered when the quality of the MathML representation is poor or conversions from L A T E X are missing.

Thanks to the effort from the Lab organizers, coverage of the Presentation MathML for detected formulas has been increased from 92% for ARQMath-1 to over 99% for ARQMath-2 [13]. However, further cleansing is still beneficial in preparation for search. We further improve the data cleansing in preparation for search as follows.

Correcting Conversion Errors:

The provided Presentation MathML, generated from L A T E X representation using LaTeXML6 , contains conversion errors for formulas including either less-than "<" or greater-than ">"operators. In particular, when a L A T E X formula contains the operator "<", it is first encoded as "<", but then erroneously escaped again to form"&lt;". This results in an erroneous encoding in Presentation MathML, as shown in Table 2.

As part of our data preparation, Presentation MathML encodings with doubly-escaped representations for "<" and ">" are recognized with regular expression matching and replaced by our own converted representations, improving 869,074 (∼ 3%) formulas.

Providing Missing Formula Identifiers: Approximately 10% of the annotated formulas in the postings are not correctly and completely captured, many missing their unique formula identifiers, as shown in Figure 3. In this case, our program is unable to locate their Presentation MathML representation in the file provided by the Lab organizers.

Formulas such as those from Figure 3 are recognized as much as possible through regular expression matching for text within $ and $$ blocks. These are then checked against the formula file provided by the lab organizers to reverse-trace their formula-ids. As a result, our program is able to capture over 99% of the formulas, including the 10% that are improperly represented in math-container blocks without ids.

Task 1: Finding Answers to Math Questions

In Task 1, participants are given mathematical questions selected from MSE posts from either year 2019 (for ARQMath-1) or year 2020 (for ARQMath-2). Each question is formatted as a topic that contains a unique identifier, the title, the question body text, and the tags. Participant systems are asked to return the top-1000 potential answer-posts for each of the topics from the MSE collection. For ARQMath-2, we continue to use the three-stage system adopted for ARQMath-1 [9]:

Stage 1 Conversion: Transform the input (a mathematical question posed on MSE) into a well-formulated query consisting of a bag of formulas and keywords.

Stage 2 Searching: Use Tangent-L, the math-aware search engine, to execute the formal query to find the best matches against an indexed document corpus created from the collection.

Stage 3

Re-ranking: Re-order the best matches with a run-specific re-ranking model.

In this section, we describe various modifications we wished to explore. We first validate the benefits of each modification using the ARQMath-1 benchmark, and then we test them using the ARQMath-2 benchmark.

Conversion: Fine-tuning Keyword Extraction from Formulas

For ARQMath-1, our designed automated mechanism used to extract query keywords and formulas from the task topics was shown to be competitive with the human ability to select search terms [9], as it produces an result that is comparable to the manual set of query terms selected by the Lab organizers. For ARQMath-2, we fine-tune this automated mechanism using the ARQMath-1 benchmark for validation as follows:

1. Keywords within a formula representation are intentionally7 retained and extracted, as a drop in nDCG ′ occurs if they are removed. For example, "mod" is a crucial keyword for topic 𝐴.7-Finding out the remainder of 11 10 −1 100 using modulus -but this word is present within a formula representation only and not anywhere else in the text. Similarly, "sin", "cos", "tan" can be extracted from \sin, \cos, \tan in formula representations after punctuation is removed. 2. Every term extracted by the automated mechanism should become part of the query, and their weight should be boosted naturally if they repeat. 8 On the other hand, restricting the number of keywords and formulas extracted from the mechanism (as we had hypothesized to be a possible improvement last year) does not show an improved result.

After fine-tuning the automated mechanism, results obtained for the ARQMath-1 benchmark can be observed to consistently outperform those obtained with the manual set of query terms, validating the potential of this mechanism. For ARQMath-2, we continue to use question-answer pairs as indexing units for the document corpus, as worse performance results for the ARQMath-1 benchmark if the content of the associated question is dropped and only text from each answer is indexed. In addition to the fields included for ARQMath-1, comments 9 associated with answers are also included. As a Table 3 Various proximity measures [14], each of which can also be normalized by document length.

Searching: Enriching the Document Corpus

Span: length of the shortest document segment that covers all query term occurrences in a document, including repeated occurrences Normalized-Span: length of the shortest document segment that covers all query term occurrences in a document, including repeated occurrences, divided by the number of matched instances Min-Span: length of the shortest document segment that covers each matched query term at least once in a document Normalized-Min-Span: length of the shortest document segment that covers each matched query term at least once in a document, divided by the number of matched query terms Min-Distance: smallest distance value of all pairs of unique matched query terms Ave-Distance: average distance value of all pairs of unique matched query terms Max-Distance: largest value of all pairs of unique matched query terms

Table 4

Comparison of proximity measures on the ARQMath-1 benchmark for highly relevant (HR), relevant (R), partially relevant (PR), and non-relevant (NR) math answers, where Δ(𝑎, 𝑏) = prox(𝑎)−prox(𝑏) 0.5(prox(𝑎)+prox(𝑏)) . result, more formulas and more text words are available for matching. Figure 4 illustrates the fields indexed as part of each question-answer pair.

Δ(HR

Re-ranking: Proximity

Whereas in ARQMath-1 we attempted re-ranking the retrieved answers from Tangent-L based on CQA metadata, for ARQMath-2 we investigate the possibility of re-ranking based on proximity. Proximity is a measure of distance between matched query terms as detailed in Table 3, which can be a strong signal for document relevancy. Following the experimental design used by Tao and Zhai [14], we measure the average proximity of search terms for highly relevant, relevant, partially relevant, and non-relevant documents in the ARQMath-1 benchmark. The experimental result is shown in Table 4. We observe strong signals from several measures that distinguish relevance with the correct order (marked in gradient orange), particularly for normalized-span which correctly orders all four levels of relevancy (a smaller normalized-span indicating a higher level of relevancy) without the need to be normalized by document length. Motivated by this finding, for ARQMath-2 we attempt re-ranking of the retrieved answers by Tangent-L in increasing order of normalized-span, breaking ties by a decreasing BM25 + score returned from Tangent-L.

Matching Formulas Holistically

Formula matching within Tangent-L is based on comparing a set of math tokens from the query to those from each document (Equation 1). If we index a document that has multiple formulas, math tokens generated from all the formulas within the document are considered as a single unordered bag of terms. However, given the strong signal of proximity playing a role in document relevancy (Table 4), we hypothesize that matching each formula as a whole within a document, instead of matching math tokens irrespective of formulas that might scatter across a document, could produce a better result. 10 As such, as a post-experiment we design a holistic formula search as follows:

At preparation time, we first pre-build a formula corpus for Tangent-L that indexes all visually distinct formulas in the MSE dataset, each as a separate document with the formula's visual-id serving as a key. We define the formula similarity between two formulas to be the normalized BM25 + score for one formula when the other formula acts as a query. When indexing the question-answer corpus, rather than replacing each formula within the document by the set of math tokens generated for that formula, we represent each formula by a single holistic formula token that contains the formula's visual-id (that is, its key from the formula corpus). At query time, we first search for each query formula in the formula corpus and then replace the formula text in the query by the keys of the top-𝑘 most similar formulas, thus changing the query to search for those visual-ids (as well as whatever keywords are also part of the query, of course). Finally, the ranking formula for documents is revised to weight each match of a formula id by its formula similarity with respect to the original query formula.

In the following subsections, we describe these ideas in greater detail.

Formula Corpus

The formula corpus is built by extracting all visually distinct formulas from the document corpus described in Section 3.2-including formulas found within questions, answers, and comment posts. Each formula in this corpus is associated with the formula's visual-id, which serves as a key. The resulting corpus contains 8,595,899 out of 9,329,274 (∼ 92%) visually distinct formulas and is indexed by Tangent-L under the setup described in Section 2, each formula being considered as a document.

Normalized Formula Similarity

We define "formula similarity" as follows: Let 𝑓 𝑞 be an arbitrary formula used as a query, 𝐹 be the set of formulas in the formula corpus, and 𝑓 ∈ 𝐹 . Let RawScore(𝑓 𝑞 , 𝑓 ) represents the score obtained for formula 𝑓 when the query is 𝑓 𝑞 , using the following definition:

RawScore(𝑓 𝑞 , 𝑓 ) = (1 − 𝛾) • BM25 + (𝑞 𝑚 , 𝑓 ) + 𝛾 • BM25 + (𝑞 𝑟 , 𝑓 )(2)

where 𝑞 𝑚 is the set of regular math tokens and 𝑞 𝑟 is the set of repetition tokens in a query formula 𝑓 𝑞 . As in Equation 1, 0 ≤ 𝛾 ≤ 1 balances the weight of repetition tokens against regular math tokens.

The Normalized Formula Similarity of 𝑓 with respect to 𝑓 𝑞 is:

𝑁 (𝑓, 𝑓 𝑞 ) = RawScore(𝑓 𝑞 , 𝑓 ) max 𝜙∈𝐹 RawScore(𝑓 𝑞 , 𝜙)(3)

The value of 𝑁 (𝑓, 𝑓 𝑞 ) is in the range [0,1] and represents how well the query formula 𝑓 𝑞 is matched by 𝑓 relative to other formulas within the formula corpus.

Holistic formula token

A holistic formula token is a placeholder token that incorporates the formula's visual-id. Formulas in a question-answer document are replaced by their holistic formula tokens only, so that when searching the question-answer corpus, formulas can only be matched as a whole.

Ranking for Holistic Search

Let 𝑞 𝑡 be the set of keyword tokens and 𝑞 𝑓 be the set of query formulas. Let 𝑓 𝑞 ∈ 𝑞 𝑓 be a query formula and let 𝑆 𝑘 (𝑓 𝑞 ) be the set of keys for the top-𝜅 most similar formulas with respect to 𝑓 𝑞 , determined by Normalized Formula Similarity. Let 𝑑 be a document represented by the set of all its indexed tokens. When searching the document corpus, we adopt the following variant of BM25 + :

BM25 + w (𝑞 𝑡 ∪ 𝑞 𝑓 , 𝑑) = (1 − 𝛼) • BM25 + (𝑞 𝑡 , 𝑑) + 𝛼 • BM25 + (𝑞 𝑓 , 𝑑)(4)

and

BM25 + (𝑞 𝑓 , 𝑑) = ∑︁ 𝑓𝑞 ∈𝑞 𝑓 ∑︁ 𝑓 ∈(𝑑 ∩ 𝑆 𝑘 (𝑓𝑞 )) (︃ 𝑁 (𝑓, 𝑓𝑞) • (𝑘 + 1)tf 𝑓 𝑘 (︁ 1.0 − 𝑏 + 𝑏 |𝑑| 𝑑 )︁ + tf 𝑓 + 𝛿 )︃ log (︃ |𝐷| + 1 |𝐷 𝑓 | )︃(5)

where, as in Equation 1, 0 ≤ 𝛼 ≤ 1 balances math features against keyword features. 11

Task 1: Runs and Result

Parameter settings are chosen based on testing with the ARQMath-1 benchmark. For ARQMath-2, we prepared four automatic runs: 11 As usual for BM25 + [15], 𝑘, 𝑏, and 𝛿 are constants (following common practice, chosen to be 1.2, 0.75, and 1, respectively); tf 𝑓 is the number of occurrences of formula 𝑓 in 𝑑; |𝑑| is the total number of terms in 𝑑; 𝑑 = ∑︀ 𝑑∈𝐷 |𝑑| |𝐷| is the average document length; and |𝐷 𝑓 | is the number of documents in 𝐷 containing formula 𝑓 .

Table 5

The setup for the primary run for ARQMath-2.

Repeated Symbols (Sect. 2.1) Repetition tokens are adopted. Revised Ranking Formula (Sect. 2.2) In Equation 1, 𝛼 = 0.25 and 𝛾 = 0.1.

Formula Normalization (Sect. Query terms are (unintentionally) de-duplicated.

primary: A submitted run with most of the presumably best setup, based on tests on the ARQMath-1 benchmark, as described in Table 5.

proximityReRank: A submitted run based on Section 3.3. This uses the same setup as the primary run, but the top-1000 matches are subsequently re-ranked by proximity, using normalized span as the proximity measure.

holisticSearch: A post-experiment run that matches formulas holistically based on Section 3.4. When searching in the formula corpus, 𝛾 is set to 0.1 in Equation 2and when searching in the document corpus, 𝛼 is set to 0.5 in Equation 4and 𝜅 is set to be 300.

duplicateTerms:

A post-experiment run sharing the same setup as the primary run, except that duplicate query terms are preserved as described in Section 3.1.

The results of these runs for ARQMath-2 are shown in Table 6, together with the baseline runs and our submissions from last year over the ARQMath-1 benchmark. In general, after parameter selection based on the ARQMath-1 benchmark, our updated system produces results that have a significant improvement compared with those from last year's system over the ARQMAth-1 topics. For instance, our primary setup evaluated over the ARQMath-1 benchmark achieves an nDCG ′ score of 0.433, which is nearly a 10-point gain over the nDCG ′ score of 0.345 produced by our best participant run (alpha05-noR) last year.

This parameter selection based on the ARQMath-1 benchmark helps our updated system to achieve equally good results for the new set of math topics in ARQMath-2. Our primary run produces an nDCG ′ of 0.434, which remains the best run among all participants [13]. The unsubmitted run duplicateTerms, which corrects an oversight in the primary run and therefore reflects our intended "best" setup, scores even higher, with an nDCG ′ of 0.462.

The duplicateTerms run also has the highest values for the ARQMath-2 benchmark in all other evaluation measures, with the exception of P ′ @10 for the baseline run Linked MSE posts (which uses human-built links that were not available to participating teams [13]). With a closer look to the effectiveness breakdown by topic category in Table 7, we observe that this run has a strong performance for Formula-dependent topics, Proof-like topics, and topics of Low-level difficulty. In spite of a different set of math topics being evaluated, these observed strengths are similar to the observed strengths of our best participant run last year [3].

On the other hand, our submitted alternative run proximityReRank, which tries to re-rank the results using the proximity signal Normalized-Span, does not perform well. For the ARQMath-1

Table 6

Task 1: Evaluation of the MathDowsers runs and the baseline runs in ARQMath-2, compared with that over the ARQMath-1 benchmark. Parentheses indicate a result from an approach using privately held data not available to participants.

ARQMath-1 (77 Topics)

ARQMath-2 (71 Topics) nDCG ′ MAP ′ † P ′ @10 † bpref † nDCG ′ MAP ′ † P ′ @10 † bpref † benchmark, this run shows a 6-point loss compared to the primary run (0.373 vs 0.433) and the loss is enlarged to nearly 10 points in ARQMath-2 (0.335 vs 0.434), indicating an unsatisfactory re-ranking. It seems that even for a measure that shows a strong signal for proximity in Table 4, the separation among documents based on proximity might be inadequate to reflect relevance. Finally, our unsubmitted run holisticSearch, which is an approach also motivated by proximity, performs fairly well. Compared to the primary run, the nDCG ′ score for the holisticSearch run shows a 3-point loss over the ARQMath-1 benchmark (0.405 vs 0.433) and similarly a 2-point loss in ARQMath-2 (0.414 vs 0.434). Notably, this run outperforms all other runs submitted by participants in ARQMath-2 and outperforms our primary run in the 𝑃 ′ @10 and bpref measures. However, this run is outperformed by the unsubmitted duplicateTerms run in all evaluation measures (with nearly a 5-point loss (0.414 vs 0.462 for nDCG ′ ), suggesting room for improvement for this approach.

Baselines

Task 2: In-context Formula Retrieval

For Task 2, participants are asked to retrieve the top matching formulas, together with their associated posts, for each topic formula chosen from the set of topics used for Task 1. Relevancy of a retrieved formula is evaluated in context: both the associated post of a retrieved formula and the associated topic content of the topic formula are presented to the assessors for evaluation. Assessments are then aggregated so that each visually distinct formula is judged to be relevant if any of the corresponding formula occurrences are deemed to be relevant. The performance of a system is then determined by its performance with respect to visually distinct formulas only. For ARQMath-2, we propose two simple approaches that re-use two major components created for Task 1:

1. the Formula Corpus of all visually distinct formulas, as described in Section 3.4.1; 2. the results from Task 1 Answer-Ranking of the top 10,000 answer-posts for each topic, run with the primary setup as detailed in Table 5.

The rest of this section describes our two approaches built on these components.

Formula-centric: Selecting Visually Matching Formulas

The first straightforward approach is formula-centric, relying on Tangent-L's internal formula matching capability to find the matching formulas. To create a list of matching formulas for a topic, we first search for matches to the topic formula in the formula corpus of all visually distinct formulas. This gives us a ranking 𝑅 of visually distinct formulas. We then expand each element of 𝑅 with its set of formula occurrences: formulas that have the same visual-id but appearing in different posts. 12 We refer to a set of formula occurrences having the same visual-id as a visual group. The selection of formula occurrences to return is then governed by the rank of their associated posts in the answer retrieval task. In particular, 1. Formulas within the same visual group are ranked in the same order as the ranking of their associated posts in Task 1 for the corresponding topic. If the associated posts of formulas are question-posts that are not associated with any answer from Task 1, the formulas are assigned the lowest ranking. Finally, the lexical order of formula-ids is used to break ties. 2. For each of the top-20 visually distinct formulas in 𝑅, we select the top five formulas from its visual group (or all formulas in the visual group if there are fewer than five); for the remainder, we select the top formula only (if any have associated question or answer posts). 3. Sequentially considering the formulas in 𝑅 in order, selected formula occurrences from each visual group are appended to the final list of matching formulas until 1000 formula occurrences are selected in total.

Document-centric: Screening Formulas from Matched Documents

The second straightforward approach is document-centric, relying more on the results from the answer retrieval task. Based on the answer-ranking from Task 1, the final list of matching formula occurrences is selected from the answers as follows:

For each matched answer-post for the corresponding topic in Task 1, we retrieve its question-answer document from the document corpus. If the document contains only one formula, that formula is selected. Otherwise, each formula from the document is mapped to its visual group, and its Normalized Similarity Score (Equation 3) with respect to the topic formula is computed using 𝛾 = 0.1 in Equation 2 (but see below). Formulas having a score less than a threshold of 0.8 are screened out, and the rest are preserved and ranked accordingly. 2. Following the original answer-ranking, preserved formulas from each question-answer pair are appended to the final list until 1000 formulas are selected in total.

Formulas in an answer-post might correspond to visually distinct formulas any where in the formula corpus, but it is highly inefficient to compute the Normalized Similarity Score for every formula in the formula corpus, which requires retrieving over 8.5 million RawScores using Tangent-L. Therefore, for each topic, formulas in answer-posts that are not within the top 10,000 most similar formulas to the query formula are assigned a score of 0 and therefore screened out.

Task 2: Runs and Result

For ARQMath-2, we include two automatic runs: formulaBase: A submitted run selecting visually matching formulas as in Section 4.1; docBase: A submitted run selecting formulas from matched documents as in Section 4.2. The result of both runs in ARQMath-2 are shown in Table 8, together with the baseline run and the best participant runs for the ARQMath-1 and ARQMath-2 benchmarks. Our primary run formulaBase, with parameter selection based on the ARQMath-1 benchmark, achieves a very close performance to the best participant run Tangent-CFTED produced from the DPRL team last year (0.562 vs 0.563). However, on the ARQMath-1 benchmark, it does not perform as well as the ltrall run submitted this year by the DPRL team, having a 17-point loss on nDCG ′ over the same set of math topics (0.562 vs 0.735).

On the ARQMath-2 benchmark, however, with a new set of math topics, our primary run formulaBase performs approximately as well, with an nDCG ′ score of 0.552. This score is the best among all automatic runs, and it is almost indistinguishable from the best participant run P300 from the Approach0 team, which is a manual run. Notably, on the ARQMath-2 benchmark, it outperforms the ltrall run from the DPRL team by over 10 points (0.552 vs 0.445).

On the other hand, our alternative run docBase does not perform as well as expected. For the ARQMath-1 benchmark, this run shows nearly a 16-point loss with respect to our primary run (0.404 vs 0.562) and nearly a 12-point loss (0.433 vs 0.552) for the ARQMath-2 in terms of nDCG ′ . This run also achieves lower scores in all other evaluation measures, suggesting that simply selecting formulas from matching documents does not work well.

Efficiency

The machines used for our experiments have the following specifications: Note that data and index sizes show the values reported by the du command on Linux, which measures disk space usage based on blocks; thus the many small documents in the formula corpus require much more disk space than might be expected. (In fact, the total size of the data in the formula corpus is only 9.2 GB.)

Runs for ARQMath-2 were executed on Machine B with the following average, minimum, and maximum query times per topic as follows: The proximityReRank run uses Machine A to rerank the output from the primary run, thus requiring first the time shown for the primary run on Machine B and then an additional 8 hours to re-rank all topics on Machine A.

Conclusions and Further Work

We conclude that a traditional math-aware search system continues to be an efficient and effective approach to tackle the CQA task, which is proven by producing the best participant run in Task 1 again this year. In particular, a significant boost in effectiveness for Task 1 can be observed on both years' math topics after parameter selection based on tests on the ARQMath-1 benchmark. The best result is achieved through several aspects of improvement of the formula matching capability of Tangent-L, demonstrating the competitiveness of this math-aware search engine in handling text and mathematical notations together.

We also develop a simple but strong baseline for the in-context formula retrieval task. Being the best automatic run and competitive with the best participant run, our formula-centric run demonstrates again the strong formula matching ability of Tangent-L.

Nevertheless, several aspects of our runs turn out to be somewhat disappointing again. In the CQA task, we explore the incorporation of proximity in two approaches and the result does not improve effectiveness over using a bag-of-terms approach:

Proximity Re-Ranking: Re-ranking based on proximity is unsatisfactory, despite some proximity difference being observed based on the relevancy of judged documents. Perhaps proximity is a more important measure when the BM25 + score is low, and therefore it needs to be incorporated into the initial retrieval [14,16] rather than used for re-ranking. Alternatively, despite the percentage differences observed, the actual differences might be too small to serve as a reliable signal of relevance.

Matching Formulas Holistically:

The proposed method to match formulas holistically shows some promise but does not perform as well as matching based on math tokens. Perhaps Equation 5 can be improved to make better use of the formula similarity scores returned from the formula corpus. Improvements here might also provide insights into further improving our formula-centric approach in Task 2.

Additionally, our proposed document-centric baseline for the in-context formula retrieval task, which selects formulas from top matching math answers, does not perform as well as expected given our strong result in the answer retrieval task. Investigation into the distribution of matching formulas among the top relevant answers might be helpful in further exploring this simple tactic for the task.

All in all, while our updated system with Tangent-L continues to excel in both tasks, there is still a huge room for improvement in how we might use the document relevancy signals observed from the ARQMath-1 benchmark to propose new approaches that might further improve effectiveness. In retrospect, approaches that we attempted through re-ranking did not benefit sufficiently from the raw signals obtained from the ARQMath-1 benchmark. With the additional new evaluation data available from the ARQMath-2 benchmark, we expect to gain better insights, and we are excited to continue exploring question answering for the mathematical domain.

Figure 1 :1Figure 1: Researcher dowsing for answers to math queries.

Figure 2 :2Figure 2: Symbol Layout Tree for 𝑥 2 + 3 𝑥 + 𝑥 with repetitions highlighted.

Figure 3 :3Figure 3: Partial text from an answer post (post-id 2653) including "math-container" blocks without "id" attributes, even though the corresponding formulas are included in the formula file with formula-ids from 2285 to 2296.

Figure 4 :4Figure 4: An illustration of the revised indexing unit to create the document corpus. Each document is an HTML file containing a question-answer pair and its associated information.

2 . 3 )23Only semantic matches of Commutativity is supported. Data Cleansing (Sect. 2.4) Recognition of Presentation MathML is improved. Document Corpus (Sect. 3.2) Comments from answers are added to the indexing unit. Query Keyword Extraction (Sect. 3.1) Keywords within a formula representation are retained.

1 .1Commutativity: 𝐴 + 𝐵 should match 𝐵 + 𝐴 2. Symmetry: 𝐴 = 𝐵 should match 𝐵 = 𝐴 3. Alternative Notation: 𝐴 × 𝐵 should match 𝐴 𝐵, and 𝐴 ≯ 𝐵 should match 𝐴 ≤ 𝐵 4. Operator Unification: 𝐴 ≺ 𝐵 should match 𝐴 < 𝐵 5. Inequality Equivalence: 𝐴 ≥ 𝐵 should match 𝐵 ≤ 𝐴 and simple adjustments are applied to Tangent L's regular math tokens to support these semantic matches.

Table 22Erroneous Presentation MathML for the formula "0.999... < 1" (formula id 382).

Expected Presentation MathMLErroneous Presentation MathML Provided<mrow><mrow><mrow><mrow><mn>0.9999</mn><mn>0.9999</mn><mi mathvariant="normal">...</mi><mo></mo><mo><</mo><mi mathvariant="normal">...</mi><mn>1</mn><mo></mo></mrow><mi mathvariant="normal">&</mi><mrow><mo></mo><mi>l</mi><mo></mo><mi>t</mi></mrow><mo>;</mo><mn>1</mn></mrow>

Table 77Category performance of the duplicateTerms run in ARQMath-2. The better performance measure for each sub-category and each evaluation measure is highlighted in bold.TopicduplicateTermsCount nDCG ′ MAP ′ P ′ @10 bprefOverall710.462 0.187 0.241 0.163DependencyText100.423 0.158 0.260 0.142Formula210.516 0.235 0.319 0.204Both400.443 0.169 0.195 0.146Topic TypeCalculation250.455 0.189 0.200 0.165Concept190.429 0.160 0.232 0.137Proof270.492 0.204 0.285 0.178DifficultyLow320.509 0.216 0.300 0.199Medium200.383 0.116 0.150 0.098Hard190.466 0.213 0.237 0.169

Table 88Task 2: Evaluation of MathDowsers runs, the best participant runs, and baseline runs in ARQMath-2. MAP ′ † P ′ @10 † bpref † nDCG ′ MAP ′ † P ′ @10 † bpref †ARQMath-1ARQMath-2nDCG ′ BaselinesTangent-S0.691 0.446 0.453 0.4120.492 0.272 0.419 0.290MathDowsersformulaBase¶ 0.562 0.370 0.447 0.3740.552 0.333 0.450 0.348docBase*0.404 0.251 0.386 0.2750.433 0.257 0.359 0.291Best Participant RunApproach0-P300*M 0.507 0.342 0.441 0.3430.555 0.361 0.488 0.362DPRL-ltrall¶ 0.738 0.525 0.542 0.4950.445 0.216 0.333 0.228Best Participant Run (year 2020)DPRL-Tangent-CFTED *0.563 0.388 0.436 0.372----

¶ submitted primary run * submitted alternate run M manual run † using H+M binarization Our simple implementation suffers from the fact that math tokens handle only a pair of adjacent symbols at a time. For a longer expression, such as 𝐴 + 𝐵 × , the overly simplistic approach generates the same set of math tokens as the expression 𝐵 + 𝐴 × 5, failing to consider the priority of operators. nevertheless, we have chosen to take this approach because correct treatment requires that the math formulas be parsed properly, which is difficult to achieve when the input of Tangent-L-Presentation MathML-captures layout only.5 Similar to commutative operations and symmetric relations, the reversion of operands is implemented simplistically over a pair of adjacent symbols at a time. Thus the generated set of math tokens might equally well represent a semantically distinct formula. https://dlmf.nist.gov/LaTeXML Keywords were not intended to be extracted from within formula representations in the original design for ARQMath-1, but turned out to be a valuable "mistake" that helped boost performance. In the submission for ARQMath-1, duplicate terms were extracted, but their weights were not boosted accordingly because of an oversight in our implementation. 9 When extracting the comments, the file Comments.V.1.0.xml is used instead of the more recently released Comments.V.1.2.xml because the former contains approximately three times as many comments as the latter. Note, however, that the former file contains more "noise" that requires cleansing as discussed in Section 2.4. Note, however, that this ignores proximity among keywords and between keywords and formulas. Only question-posts and answer-posts are of concern in the task, so any returned formulas from commentposts are ignored.

Acknowledgments

This research has been funded by the Waterloo-Huawei Joint Innovation Lab and NSERC, the Natural Science and Engineering Research Council of Canada. The NTCIR Math-IR dataset used for earlier benchmarks and as a source of relevant keywords was made available through an agreement with the National Institute of Informatics.

(F. Wm. Tompa) https://www.linkedin.com/in/kiking0501/ (Y. K. Ng); https://uwaterloo.ca/scholar/bkassaie/home (B. Kassaie); http://www.cs.uwaterloo.ca/~fwtompa/ (F. Wm. Tompa) 0000-0002-1907-9535 (F. Wm. Tompa

Overview of ARQMath-2 RZanibbi BMansouri DWOard AAgarwal Second CLEF lab on answer retrieval for questions on math 2021. 2021 12880 CLEF 2021 ): CLEF lab on answer retrieval for questions on math RZanibbi DWOard AAgarwal BMansouri CEUR Workshop Proceedings 2020 2696 CLEF 2020 Dowsing for math answers with Tangent-L YKNg DJFraser BKassaie GLabahn MSMarzouk FWTompa KWang CEUR Workshop Proceedings 2020 2696 CLEF 2020 DPRL Systems in the CLEF 2020 ARQMath Lab BMansouri DWOard RZanibbi CEUR Workshop Proceedings 2020 2696 CLEF 2020 Three is Better than One Ensembling Math Information Retrieval Systems VNovotný PSojka MŠtefánik DLupták CEUR Workshop Proceedings 2020 2696 CLEF 2020 PSU at CLEF-2020 ARQMath Track: Unsupervised Re-ranking using Pretraining SRohatgi JWu CLGiles CEUR Workshop Proceedings 2020 2696 CLEF 2020 ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open? PScharpf MSchubotz AGreiner-Petter MOstendorff OTeschke BGipp CEUR Workshop Proceedings 2020 2696 CLEF 2020 Choosing math features for BM25 ranking with Tangent-L DJFraser AKane FWTompa DocEng 17 10 2018. 2018 Dowsing for math answers YKNg DJFraser BKassaie FWTompa CLEF 2021 2021 12880 ABiałecki RMuir GIngersoll SIGIR 2012 Workshop on Open Source Information Retrieval 2012 Apache Lucene 4 Recognition and retrieval of mathematical expressions RZanibbi DBlostein Int. J. Document Anal. Recognit 15 2012 Lower-bounding term frequency normalization YLv CZhai CIKM'11 2011 Advancing math-aware search: The ARQMath-2 lab at CLEF BMansouri AAgarwal DWOard RZanibbi ECIR 2021 Lecture Notes in Computer Science Springer 2021. 2021 12657 An exploration of proximity measures in information retrieval TTao CZhai SIGIR 2007 2007 The probabilistic relevance framework: BM25 and beyond SRobertson HZaragoza Foundations and Trends in Information Retrieval 3 2009 Term proximity scoring for keyword-based retrieval systems YRasolofo JSavoy Proceedings of the 27th European Conference on IR Research (ECIR 2003) the 27th European Conference on IR Research (ECIR 2003) Springer 2003 2633 Advances in Information Retrieval