Dowsing for Answers to Math Questions: Doing Better with Less Andrew Kane, Yin Ki Ng and Frank Wm. Tompa1 1 David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada, N2L 3G1 Abstract We present our application of a new math-aware search engine to the 2022 ARQMath Lab. This extends the annual submissions from the University of Waterloo’s “MathDowsers,” for which we twice produced the best Task 1 participant run. This year the major improvements result from greatly reducing the number of math tuples used to represent a math formula, which not only saves considerable space in the index and reduces query terms, but also improves retrieval effectiveness as measured by normalized discounted cumulative gain with unjudged documents removed (nDCG’). In addition, we have re-implemented math tuple generation in a stand alone code base and replaced Tangent-L, the previous Lucene-based search engine, with a new engine that simplifies experimentation. Experimental results show that the new system has substantially better performance than its pre- decessor in terms of nDCG’, MAP’, and P’@10, both when trying to find answers to math questions (Task 1) and when trying to find similar formulas (Task 2). From this we conclude that traditional text retrieval systems can easily be enhanced to become effective math-aware search engines using the tools we have developed. Keywords Community Question Answering (CQA), Mathematical Information Retrieval (MathIR), Symbol Layout Tree (SLT), math tuples, Mathematics Stack Exchange (MSE), ARQMath Lab, formula matching 1. Introduction The ARQMath Lab at CLEF 2022 [1] continues previous years’ Labs [2, 3] to examine how best to search a community ques- tion answering (CQA) repository to find answers to questions involving math data. The Labs use a collection of questions and answers from Math Stack Exchange2 (MSE) between 2010 and 2018 consisting of approximately 1.1 million question-posts and 1.4 million answer-posts. Task 1, the CQA task, asks participants to return potential answers to unseen mathematical questions among existing answer-posts in the collection. Task 2 is the formula retrieval task, where formulas in the context of specific unseen questions serve CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ arkane@uwaterloo.ca (A. Kane); kiking0501@gmail.com (Y. K. Ng); fwtompa@uwaterloo.ca (F. Wm. Tompa) € https://www.linkedin.com/in/arkane/ (A. Kane); https://www.linkedin.com/in/kiking0501/ (Y. K. Ng); http://www.cs.uwaterloo.ca/~fwtompa/ (F. Wm. Tompa)  0000-0002-1907-9535 (F. Wm. Tompa) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 2 https://math.stackexchange.com as queries for matching relevant formulas from question-posts and answer-posts in the same collection. Task 3, added this year, asks participants to formulate an answer to an unseen mathematical question using any resources and techniques available to them. In the first year, the Waterloo team of MathDowsers participated in Task 1 only. We demon- strated how tagged mathematical questions can be automatically transformed into formal queries consisting of keywords and formulas and how the resulting formal queries can be effectively executed against a corpus [4]. In particular, the corpus of MSE question-answer pairs was indexed by a traditional math-aware search engine, namely Tangent-L [5] built on top of Lucene using a traditional text-based ranking algorithm (BM25+ [6]) with simple adaptations to handle math formulas within the engine. A key component of this approach is to represent formulas as a set of math tuples that reflect certain features found in the formulas and to treat the tuples as words in the engine. In the second year, we participated again as the MathDowsers team for Task 1 and for Task 2, with the goal to continue exploring the potential of a traditional math-aware query system in tackling both tasks [7]. We demonstrated, among other things, that augmenting the set of math tuples with tuples that capture repeated symbols slightly increases retrieval effectiveness, that reducing noise in the data makes a substantial impact, and that searching a corpus of visually distinct formulas is effective for Task 2. The success of Tangent-L in both ARQMath-1 and ARQMath-2 could result from several factors, but chief among them is (1) the use of a carefully constructed corpus of documents to be searched and (2) the extraction of features that serve well in representing math formulas. It is clear that anyone could adopt the former, and this year we show how anyone can incorporate the latter into their approach. To this end, in this paper we present our re-implementation of the math feature extraction as a standalone code base. We also show that it can be applied universally by using that sub-system with a different, more conventional, text-based search engine. As it turns out, the first author had been developing such an engine in order to explore various aspects of search technology, and the ARQMath Lab provides an opportunity to test the new engine against a realistic application. We hope that with this presentation of our tools, others can use our constructed documents and math tuple generator as a basis on which to apply even smarter processing, regardless of the search engine they choose. In addition to separating the math extraction code, we re-examine the mapping from formulas to sets of math tuples and show that capturing certain features only, and ignoring other features, improves effectiveness for both Task 1 and Task 2. We also show that including some extracted text from the LATEX representation of formulas improves formula matching in Task 2. (We did not participate in Task 3.) This paper first describes the three large processing steps used in our system: 1. document and query construction, described in Section 2; 2. math tuple generation, described in Section 3; and 3. indexing and querying with the core engine, described in Section 4. Next, this year’s experimental results are detailed in Section 5, and the conclusions follow thereafter. 2. Document and Query Construction 2.1. Document Corpus for Task 1 For ARQMath-3, we continue to use enhanced question-answer pairs as indexing units for the document corpus, because worse performance can be observed for the ARQMath-1 and ARQMath-2 benchmarks if the content of the associated question is dropped and only text from each answer is indexed [8]. As before, for every answer-post in the MSE database, we use a series of preprocessing steps3 to create a document consisting of an “enhanced question” (including the Question Title, Question Body Text, Question Tags, Question Comments, Duplicate Post Titles, and Related Post Titles) together with an “enhanced answer” (including the Answer Body Text and the Answer Comments). Figure 1 illustrates the fields indexed as part of each question-answer pair. Figure 1: Structure of a corpus document containing a question-answer pair and its associated data. Note that to be consistent with other search systems and datasets, we use a simple TREC encoding format for documents: each document starts and ends with the tags and and the first element inside a document is the document identifier tagged with ... [9, p. 24]. In the Task 1 corpus, the DOCNO value encodes the year, threadid (id for the question-post), and postid of the answer-post. By using this convention, many documents can be stored in a single (compressed) file and still be appropriately identified, which saves space and avoids extensive file openings and closings. 2.2. Document Corpus for Task 2 Last year we determined that searching a corpus of visually distinct formulas outperforms picking the best formula from the Task 1 results [7]. Therefore, for ARQMath-3, we again 3 https://github.com/kiking0501/MathDowsers-ARQMath create a corpus of visually distinct formulas appearing in either question or answer posts and represented in Presentation MathML. Searching this corpus gives us a ranking 𝑅 of visually distinct formulas, from which any occurrence in a question or answer post can be determined arbitrarily. (In this corpus, the DOCNO value for each document encodes the formula-id, the post-id containing the formula, and the visual-id representing the visually distinct form of the formula.) Because some formulas failed to be converted from LATEX into MathML and because there is potentially useful text buried in math tuples (particularly within tuples generated from text nodes in the symbol layout tree), we build a variant corpus in which each document includes the math tuples for a visually distinct formula together with words selected from the LATEX representation of that same formula (Figure 2). More specifically, given a LATEX formula, the text extractor treats all non-alphanumeric characters as white space and removes all resulting tokens that are shorter than three characters in length. Thus, for example, the expression $x\in \mathbb{Q} \land x=\frac pq \text{ in lowest terms}$ (encoding “𝑥 ∈ Q ∧ 𝑥 = 𝑝𝑞 in lowest terms”) produces “mathbb land frac text lowest terms.” Figure 2: Structure of a corpus document containing the MathML representation for a visually distinct formula and text extracted from its LATEX representation. 2.3. Query Formation As in previous years, we represent queries using a simple XML encoding format as shown in Figure 3, and we convert the given MSE questions (“topics”) to this form using the code originally developed for ARQMath-14 [4]. ... ... ... ... Figure 3: Structure of a query stream. 4 https://github.com/kiking0501/MathDowsers-ARQMath 2 𝑥 ↗ ↗ → → → → → 𝑥 = 3 + 2 𝑥 Figure 4: Symbol Layout Tree for 𝑥2 = 3𝑥 + 2𝑥 with repetitions highlighted. Our Task 2 queries do not use keywords, so they are not generated during conversion. In addition, to search the variant formula corpus, a second processing step augments each query with text extracted from the LATEX corresponding to the given formula-id. 3. Representing Formulas by Math Tuples 3.1. Math Tuples in Tangent-L The ability to match formulas is what makes a search engine “math-aware,” though clearly there are many approaches to implementing this. The mechanism piloted by Tangent [10] and further developed for Tangent-L [5, 7] is to convert a symbol layout tree (expressed in Presentation MathML) into a set of features called math tuples. Since last year’s Lab, the set of features supported by Tangent-L include symbol pairs, terminal symbols, compound symbols, duplicate symbols, and augmented locations, as shown in Table 1 for the symbol layout tree depicted in Figure 4. This year we use the same set of features, but restrict the number of tuples as explained in the remainder of this section. 3.2. Revisiting Tuple Generation for Duplicated Symbols If a symbol repeats 𝑘 times where 𝑘 > 1, then 𝑘2 duplication tuples (math tuples representing (︀ )︀ duplicated symbols) are generated for that symbol. As such, 𝑂(𝑛2 ) math tuples are generated and indexed for a formula containing 𝑛 symbols. This becomes problematic for large polynomials, matrices with repeated symbols in the entries, and many other formulas that include one or more specific symbols many times. In fact, some documents in the ARQMath collection produce huge numbers of repetitions. As a relatively small example, question post #2274526 includes the matrix 𝑋 0 1 2 𝑌 0 𝑝0,0 𝑝0,1 𝑝0,2 1 𝑝1,0 𝑝1,1 𝑝1,2 2 𝑝2,0 𝑝2,1 𝑝2,2 . which contains 8 repetitions of 0, 8 repetitions of 1, 8 repetitions of 2, and 9 repetitions of 𝑝, resulting in 240 duplication tuples and another 240 corresponding location tuples. Including all these tuples deteriorates ranking performance and bloats the index. (We hypothesize that this blowup is the reason that based on the ARQMath-2 Lab, we recommend weighting duplication 5 These were previously called Repeated symbols [7]. Table 1 Math tuples for the formula in Figure 4. Tuple Type Math Tuples Generated Remark Symbol pairs: (𝑥 , 2, ↗) (𝑥 , =, →) Pairs of adjacent symbols with (= , 3, →) (3 , 𝑥, ↗) connecting edge (3 , +, →) (+ , 2, →) {2 , 𝑥, →} Terminal symbols: (2 , 𝜑) (𝑥 , 𝜑) Symbols with no outedges (𝑥 , 𝜑) Compound symbols: (𝑥 , →↗) (3 , →↗) Symbols multiple outedges Duplicate symbols5 : {𝑥, →→↗} {𝑥, →→→→→} Repeated symbols on the same {𝑥, →→→, ↗} {2, →→→→, ↗} path or on two paths from a common ancestor Augmented locations: (𝑥 , 2, ↗, -) (𝑥 , =, →, -) Every tuple also represented (= , 3, →, →) (3 , 𝑥, ↗, →→ ) with its path from the root (3 , +, →, →→) (+ , 2, →, →→→) (2 , 𝑥, →, →→→→) (2 , 𝜑, ↗) (𝑥 , 𝜑, →→→→→) (𝑥 , 𝜑, →→↗) (𝑥 , ↗→, -) (3, ↗→, →→) {𝑥, →→↗, -} {𝑥, →→→→→, -} {𝑥, →→→, ↗, →→} {2, →→→→, ↗ , -} tuples so much lower than other math tuples when scoring how well a query matches a document [7].) Two alternatives for limiting the number of tuples generated when a symbol repeats many times are to limit the number of repetitions to represent (e.g., only extract tuples for all pairs of the first few occurrences of a symbol in a formula), or to choose a subset of duplication tuples such that the cardinality is linear in the size of the formula rather than preserving all pairs of duplicated symbols. For either approach, it is important that small changes to a formula do not cause radically different tuples to be generated so that query formulas that are close to a corpus formula can be matched. After some experimentation with the ARQMath-1 and ARQMath-2 benchmarks, we recom- mend the latter option: selecting a linear number of duplication tuples. More specifically, we perform a recursive traversal of the symbol layout tree (being sure to traverse the out-edges from a node in consistent order) and create a duplication tuple for each adjacent pair of repeated symbols in post-order. Thus, for the tree in Figure 4, the tuples {𝑥, →→→, ↗} and {𝑥, →→↗} would be produced, but {𝑥, →→→→→} would not be produced because it would reflect a repetition between the first and third 𝑥 encountered in a post-order traversal. For the example matrix shown above, this results in 29 duplication tuples plus another 29 corresponding location tuples, almost a 90% reduction. With this change, duplication tuples can be given the same weight as the other math tuples generated, removing the need for a tuning parameter to choose a relative weighting. A second characteristic of duplication tuples is that they include the repeated symbol in the first field. However, in many situations when a duplicated variable is matched in a formula, the specific value for that variable is not important; rather it is the fact that some variable is repeated with this particular pattern of relative paths. For the example in Figure 4, replacing the variable 𝑥 by 𝑦 should still be considered a good match. Thus, instead of storing the specific variable name, it may be better to record that some variable is repeated with a specific relative path. Similarly, it may be more important to record that some operator repeats or some number repeats than to record which specific number or operator is duplicated. To accommodate this, a second modification to duplication tuples is to augment (or replace) the tuple with one that includes a wildcard that reflects the type of the symbol that is duplicated. For example, the wildcard equivalents of the selected duplicated tuples in Table 1 are {?𝑉 , →→↗}, {?𝑉 , ↗, →→→}, and {?𝑁 , ↗, →→→→}, as well as these tuples augmented with location. 3.3. Revisiting Tuple Generation for Augmented Locations The bag-of-words model implemented by BM25 scoring means that the order of math tuples is immaterial. As a result, only the tuples that include locations (the path from the root to the feature being represented) anchor the feature with respect to the formula. Using the NTCIR benchmarks, we found that including location tuples improves ranking performance [5], although it doubles the index space for math tuples and increases the length of all documents that include formulas. However, at that time we did not explore whether all tuples should be so augmented or whether the benefit arises from having only some tuples augmented by location. In fact, the original description states that in addition to indexing math tuples for symbol pairs: “we recommend including features to reflect terminal symbols, compound symbols, and locations of symbol pairs” [5], which would generate only the first seven math tuples for augmented locations in Table 1. However, the implementing code referenced by that paper augments tuples representing terminal symbols and compound symbols as well! Thus the first variant is to capture this distinction by allowing a user to specify which subset of the three types of tuple (symbol pairs, terminal symbols, or compound symbols) should be augmented with location tuples. A second observation is that as features appear farther from the root of a query formula, they are increasingly less likely to be located in exactly the same locations in a closely matching document formula. Furthermore, those features are increasingly less likely to appear in any formula, and so when there happens to be a match, the inverse document frequency might be overwhelmingly high. Thus long location paths might be problematic in two ways: they might not match when they should and they might strongly match when they shouldn’t. A solution is to generate augmented location tuples only for features that do not meet or exceed some maximum path length for their locations. If the cutoff for location paths is set at 4, for example, then location paths must have fewer than three nodes (and thus, two edges). With this setting, four of the location tuples shown in Table 1 would not be generated. A related observation is that math tuples need not be anchored with respect to the root Table 2 Location tuples for the formula in Figure 4. Anchored at root only Anchored at root or = operator (𝑥 , 2, ↗, -) (𝑥 , =, →, -) (𝑥 , 2, ↗, -) (𝑥 , =, →, -) (= , 3, →, →) (3 , 𝑥, ↗, →→ ) (= , 3, →, -) (3 , 𝑥, ↗, → ) (3 , +, →, →→) (3 , +, →, →) (+ , 2, →, →→) (2 , 𝜑, ↗) (2 , 𝜑, ↗) (𝑥 , 𝜑, →↗) (𝑥 , ↗→, -) (3, ↗→, →→) (𝑥 , ↗→, -) (3, ↗→, →) {𝑥, →→↗, -} {𝑥, →→→, ↗, →→} {𝑥, →↗, -} {𝑥, →→→, ↗, →} {2, →→→→, ↗ , -} {2, →→→, ↗ , -} of the formula only: other locations within the formula might well also serve as anchors for determining paths. One such set of potential anchors is the set of relational operators (i.e., =, <, ̸=, ≡, etc.). If relational operators can also serve as anchors, then symbols on the left side of the equation in Figure 4 would be anchored with respect to the root of the formula, but those on the right side of the equation would be anchored with respect to the location of the equality operator. Table 2 contrasts the augmented location tuples for root-only anchors vs. anchoring at relational symbols as well, assuming that the cutoff location length is set at 4 in both cases. Notice that all features on the left of the equals sign remain the same, but those on the right have (in this case) one fewer node preceding them. One side-effect of enabling additional anchors is that the “relative positions” stored in duplication tuples are now modified to reflect the fact that there are multiple anchor positions and each symbol is measured with respect to its nearest ancestor anchor. For example, because its path is measured from the equals sign, the superscript 𝑥 on the right side is treated as if it were located one step closer to the 𝑥 on the left side. A second side-effect is that more of the location tuples fall within the maximum path length threshold. 3.4. Converting With the Tuple-Generator To support this year’s experiments, the mathtuples program released last year was updated to support the extensions described in the previous two subsections of this paper. The revised program (Version 2) is now available,6 and its interface is depicted in Table 3. For each type of math tuple (e.g., symbol pair, terminal symbol), the user can specify whether or not that tuple should be included and whether or not it should be augmented by a corre- sponding location tuple. Thus, for example, -T 0 indicates that no terminal symbols should be included; since all paths from the root include at least one node, -T 1 indicates that terminal symbols should be included but not augmented with location tuples; because we reserve the value 99 to represent “unlimited,” -T 99 indicates that terminal symbols should be included and all should be augmented by location tuples; and (the default) -T 8 indicates that terminal symbols should be included, and for each terminal symbol tuple, if the path from the root to the corresponding symbol includes fewer than 8 nodes, it should be augmented by a location tuple. 6 https://github.com/fwtompa/mathtuples Table 3 The math tuple generator. Optional Arguments Explanation Default -W size Size of the window for symbol pairs (99 ⇒ unlimited) 1 -S loc_lim Include Symbol pairs* 8 -R loc_lim Include Relationship edge pairs* 0 -T loc_lim Include Terminal symbols* 8 -E loc_lim Include End-of-line symbols* 0 -C loc_lim Include Compound symbols* 8 -L loc_lim Include Long pairs without relationships* 0 -A loc_lim Include Abbreviated relationships for long pairs* 0 -D loc_lim Include Duplicate symbols* 8 -docid tag String preceding each document identifier "" -a e|d Enable/disable “equality” operators as anchors e -c Return the math tuples in context tuples only -d dups Include duplication tuples for subset of ‘VNOMFRTW’** all -s Expand all tuples to include wildcard synonyms no expansion -w wild_dups Include wild dupl’n tuples for subset of ‘VNOMFRTW’** all *tuple inclusion: (loc_lim ≤ 0) ⇒ include no tuples of this type (0 < loc_lim < 99) ⇒ augment with location tuples whenever path has fewer than loc_lim nodes (loc_lim ≥ 99) ⇒ augment with all location tuples **node types: V(ariables), N(umbers), O(perations), M(atrices and par- enthetical expressions), F(ractions), R(adicals), T(ext), W(ildcard of unknown type) Specifying which tuples to include for duplicate symbols is more complex because several options are involved. If -D is specified to be greater than 0, then the argument for -d signals which tuples should be created for duplication nodes. For example, -d VN specifies that duplication tuples for Variables and Numbers should be generated with the specific variable or number that is repeated being included in the first field. Similarly, the argument for -w signals which tuples with wildcards in the first field should be created for duplication nodes. For example, -w VOT specifies that duplication tuples for Variables, Operators, and Text should be generated with the corresponding typed wildcard in the first field. Finally, the argument for -D indicates whether or not the corresponding location tuples should also be generated. Other arguments for the tuple generator indicate what window size to use for symbol pairs (-W) [5], whether (and where) document identifiers are included in the input (-docid), whether or not relational operators (equality et al.) are to serve as location anchors (-a), whether all the text outside each MathML expression is to be preserved (-c) or mathtuples only should be returned without the text within which the math expressions appear, and whether or not wildcard equivalents for every tuple should be generated (during indexing) so that wildcards in a query can be matched (-s) [5]. With these changes, every run in this year’s experiments includes the math tuple generator in its pipeline, using arguments that match the experimental conditions being investigated. 4. Search System Architecture The goal of the 𝜇TextSearch engine7 is to produce state-of-the-art research level performance (both efficiency and effectiveness) with a very small amount of code. However, the variant used here only partially realizes this goal. In particular, query execution efficiency and extreme index compression have not yet been addressed. The value in such a search engine is more than just demonstrating that complicated search engines might not add much value. Having an engine implemented in a small amount of code also makes it easier to understand the entire engine and, it is hoped, to implement changes. As such, this engine should make it easier to try new research experiments, as it did for us during the work presented in this paper. 4.1. Core Search Engine Components The core 𝜇TextSearch engine is very simple and is constructed as a series of components that can be implemented (and understood) separately. The mstrip component removes data that should not be indexed, such as tags and comments. A command line parameter allows interpretation of queries, converting each to a single line starting with the query identifier followed by a semicolon and then a list of tokens. The minvert component splits the input data into documents and then into a stream of tokens of two types: text (stemmed using the Porter stemmer [11]) and math (each surrounded by # symbols). Document names and sizes are stored in a dictionary. The tokens are added to their respective in-memory postings lists stored in another dictionary. To reduce memory fragmenta- tion, the dictionaries store a copy of the token keys flattened into a set of large sequential byte arrays. Each in-memory postings list is a byte array that grows by doubling its size and encodes a header followed by a variable byte (vbyte [12]) compressed list of {delta-id,frequency} pairs, each indicating the increment for the next document id and the number of occurrences of the given term in that document. When the input data ends, the document names and sizes are output, followed by all pairings of tokens with their corresponding postings lists. Since each postings list is already stored compressed in a byte array, the relevant portion of that array can be written directly to the output stream. The format of the resulting mindex allows for the simple implementation of a mmerge component that combines indexes produced from separate data partitions. The mencode component reads in an mindex file and outputs an mindex.meta file containing the total token sizes and a dictionary mapping tokens into the location of the corresponding postings list within the mindex file. The mindex.meta file allows the search engine to quickly start processing queries instead of having to read the entire mindex file at startup. Note that by splitting the indexing into two steps, both the inversion and merge components are simplified. In addition, multiple variants of the encoding step can be implemented without changing the previous steps. For example, the vbyte encoding of the mindex file is compact enough for most purposes, but the encoding step could output the index to a separate file using a different compression format if needed. For faster query execution, skips over the postings lists could be added to the mindex or mindex.meta files. 7 https://github.com/andrewrkane/mtextsearch database construct minvert mmerge mencode extract mstrip .mindex .mindex.meta queries construct msearch results 𝜇TextSearch Figure 5: Indexing and querying pipelines. The msearch component takes the mindex and mindex.meta files on the command line and loads the meta information into memory. Queries are read from stdin, one per line, then the postings lists’ disk locations are found in the meta dictionary and read into memory. Next, the query is executed as an exhaustive-OR query,8 and the top-1000 results are written to stdout. In order to make the core engine “math-aware,” the tokenizer code (shared between the minvert and msearch components) must recognize the encoded math tuples. Additionally, the search ranking code must be extended to support the 𝛼 tuning parameter described in Section 4.3. Finally, to participate in the Lab, the query results must be converted to the specified formats. 4.2. Indexing and Querying Pipelines Our system comprises several series of processing steps that can be piped together, connecting stdout from one step to stdin for the next. As a result, the output of any step can also be easily saved as a file for later reuse. The intermediate files passed from step to step can also be compressed using gzip and piped to the next step using gunzip -c. The separation of our pipelines into steps allows some parallel processing, since each step is run in a separate operating system process. In addition, if the data is extracted in separate partitions, for example by year, then those partitions can be processed in parallel for many steps, then joined back together later in the pipeline. Recall from Section 2 that documents are enclosed by TREC-like tags. With this convention in mind, the processing steps for indexing, as depicted in Figure 5, are as follows: 1. construct documents : ARQMath data files → trec(html+mathml) 2. extract math features : trec(html+mathml) → trec(html+mathtuples-mathml) 3. core engine indexing 3.1. mstrip unused (tags,comments) : trec(html+mathtuples) → trec(text+mathtuples) 3.2. minvert and compress : trec(text+mathtuples) → .mindex file 3.3. mmerge : (.mindex file)+ → .mindex file 3.4. mencode metadata : .mindex file → .mindex.meta file 8 that is, scoring all documents found on all inverted lists associated with query terms Our query pipeline follows similar steps to indexing. As depicted in Figure 5, the processing steps for searching are as follows: 1. construct queries : ARQMath query files → xml(qid+text+mathml) 2. extract math features : xml(qid+text+mathml) → xml(qid+text+mathtuples-mathml) 3. core engine searching 3.1. mstrip unused (tags,comments) : xml(qid+text+mathtuples) → (qid+text+mathtuples)* 3.2. msearch : (qid+text+mathtuples)* →.mindex + .mindex.meta queryresults 4.3. Ranking Unlike Tangent-L which uses BM25+ [6], the msearch engine ranks documents using standard BM25 [13]. It has been shown elsewhere that this difference does not likely lead to significant differences in ranking effectiveness [14]. Given a collection of documents 𝐷 containing 𝑁 documents and a query 𝑞 consisting of a set of query terms, the BM25 score for a document 𝑑 ∈ 𝐷 is defined as the sum of scores for each query term as follows: (︂ )︂ ∑︁ 𝑁 − df 𝑡 + 0.5 tf td · (𝑘1 + 1) BM25(𝑞, 𝑑) = ln +1 · (︁ (︁ )︁)︁ (1) 𝑡∈𝑞 df 𝑡 + 0.5 tf td + 𝑘1 · 1 − 𝑏 + 𝑏 · 𝐿𝐿avg𝑑 where the log factor is the Inverse-Document-Frequency component with df 𝑡 being the docu- ment frequency for 𝑡, the number of documents in 𝐷 containing term 𝑡; and the other factor is the Term-Frequency component with tf td being the term frequency of 𝑡 in document 𝑑; 𝐿𝑑 being the document length; 𝐿𝑎𝑣𝑔 being the average document length; and 𝑘1 and 𝑏 are constants chosen to be 1.2 and 0.75, respectively, in the absence of specific data training. Because the scoring function is a sum over the query terms, it can be split into two sums, one for math terms and the other for text terms, and a weighting parameter 𝛼 can be used to determine how heavily to rely on matching math terms: BM25w (𝑞𝑚 ∪ 𝑞𝑡 , 𝑑) = 𝛼 · BM25(𝑞𝑚 , 𝑑) + (1 − 𝛼) · BM25(𝑞𝑡 , 𝑑) (2) where 𝑞𝑚 is the set of math terms in a query 𝑞, and 𝑞𝑡 is the set of text terms in 𝑞. It turns out that setting an appropriate value for 𝛼 is critical to retrieval effectiveness. Note: the 𝛼 used here covers all math tuples, as compared to our work from the previous year which split some math tuples into a separate tuning parameter. 4.4. Calculating Document Lengths The calculation of BM25 depends on relative document length in the denominator of the term frequency factor. The rank should reflect the unexpectedness of seeing tf td instances of term 𝑡 in document 𝑑, and the inclusion of 𝐿𝐿avg 𝑑 is intended to reflect the fact that a longer document can be expected to have a higher term frequency. In standard search engines, the length of a document 𝐷 is measured by the number of terms in 𝐷. However, when there are two types of terms (math tuples and text tokens) and the formula is split into two sums as in Equation 2, should the number of text terms influence the likelihood of seeing tf td instances of a math tuple and the number of math tuples influence the likelihood of seeing tf td instances of a text term? Better effectiveness might result from using the number of math tuples in 𝐷 for the document length (and average document length) when calculating BM25 for 𝑞𝑚 and from using the number of text terms in 𝐷 when calculating BM25 for 𝑞𝑡 . Disappointingly, this change does not seem to give improved ranking performance with either the ARQMath-1 or the ARQMath-2 benchmarks. 5. Experimental Results 5.1. Conventions for Naming MathDowser Runs in ARQMath-3 We refer to experimental runs using the following naming convention that reflects the parameter settings used. L𝑖: All default values for the math tuple generator are used (Table 1), except that location tuples are created for symbol pairs, terminal symbols, compound symbols, and duplicate symbols for paths having fewer than 𝑖 nodes. Thus “L1” has no location tuples, “L8” uses all default values, and “L99” includes augmented location tuples for all symbol pairs, terminal symbols, compound symbols, and duplicate symbols. L𝑖(𝑋1 =𝑗1 ,...,𝑋𝑛 =𝑗𝑛 ): Settings are like L𝑖, but with argument 𝑋𝑘 specified with value 𝑗𝑘 . For example, “L8(D=0)” is like “L8” but with no duplicate symbols (i.e., -D 0) and “L99(a=d,s)” is like “L99” but with anchoring on equality operators disabled and all wildcard synonyms included (i.e., -a d -s). For some experiments, all queries are executed against an index that was created with different parameters. In this case, the name of the run specifies the query parameters, followed by “on” and the parameters used for indexing. For example, “L1on8” indicates a run in which queries use parameters for “L1” against an index that was built using parameters for “L8,” and “L99(a=d)on99(a=d,s)” indicates a query with parameters for “L99(a=d)” run against an index built with parameters “L99(a=d,s).” Note that experiments run in the ARQMath-2 Lab with Tangent-L [7] use a configuration that is close to, but not identical to, “L99(a=d,w=)on99(a=d,s)”: all four types of tuples are augmented with arbitrarily long locations, locations are anchored at the root only, duplicate tuples include the symbol that is duplicated but not the wildcard equivalents, and the index additionally includes all available wildcard variants (even though these are never used in ARQMath queries). Tangent-L also includes all possible pairings when generating tuples for duplicate symbols and creates tuples that attempt to account for commutative operations and symmetric relations,9 but these options have not been made available through the math tuple generator. In addition to setting parameters for the math tuple generator, a run must set the value of 𝛼 to specify the relative weight of math tuples vs. text terms (Equation 2). This is indicated in the 9 The superficial approach for handling commutative operations and symmetric relations considered last year has negligible effect on performance and is therefore deemed not worth re-implementing. run’s name by appending “_a0𝑖𝑗” for 𝛼 = 0.𝑖𝑗. For example, “L8_a018” indicates a run using “L8” for generating math tuples and 𝛼 = 0.18. 5.2. Task 1: Finding Answers to Math Questions For the main task, participants are given mathematical questions selected from MSE posts from year 2019 (for ARQMath-1), year 2020 (for ARQMath-2), or year 2021 (for ARQMath-3). Each question is formatted as a topic that contains a unique identifier, the title, the question body text, and the tags. Participant systems should return the top-1000 potential answer-posts from the MSE collection for each of the topics. 0.55 L8 L8on99(a=d,s) L99(D=0)on99(a=d,s) L1on8 − previous year (0.14,0.513) ● 0.50 nDCG' (nDCG'=0.457) 0.45 0.40 0.1 0.2 0.3 0.4 0.5 α Figure 6: Effect of 𝛼 on performance for ARQMath-1 Task 1 benchmark. Experiments with the ARQMath-1 and ARQMath-2 benchmarks show that (1) when using a linear subset of tuples for duplicated nodes, storing both the form with the symbol that is duplicated and the form with the wildcard equivalent for that symbol gives the best performance; 0.55 L8 L8on99(a=d,s) L99(D=0)on99(a=d,s) L1on8 − previous year (0.18,0.510) ● (0.30,0.507) ● 0.50 nDCG' (nDCG'=0.462) 0.45 0.40 0.1 0.2 0.3 0.4 0.5 α Figure 7: Effect of 𝛼 on performance for ARQMath-2 Task 1 benchmark. and (2) limiting the augmented locations for symbol pairs, terminal symbols, compound symbols, and duplicate symbols to having fewer than 8 nodes on the locating path outperforms other settings for locations. Figures 6 and 7 show the results for four of the experiments. (For comparison, the nDCG’ value of the best performing configuration from 2021 is also included.) It is apparent from these figures that the value of 𝛼 has a substantial effect on the average nDCG’ (and, although not illustrated here, on the average MAP’ and P’@10 measures as well). Examination of the effect of 𝛼 on individual queries reveals that larger values are better for some queries and smaller values are better for others. However, as in the past [8, 15], we could find no correlation between the best value for 𝛼 and the number of formulas in a query, the relative number of text terms in a query, the sizes of query formulas, the frequency of occurrence of math tokens or their inverse document frequencies, or any other query characteristics. For now, we have no better approach than choosing a value for 𝛼 heuristically, based on observed average performance on some training data. Because re-indexing the corpus is quite time-consuming, our initial experiments varied the location length cutoff at query time, but ran against an index storing all locations. When we found that using a cutoff value of 8 improves performance, we created an index with that setting and found that running queries with cutoff 8 against an index with cutoff 8 (L8) outperformed running queries with cutoff 8 against an index with all locations stored (L8on99). However, we found that running queries with cutoff 1 against an index with cutoff 1 (L1) performed worse than running those queries against an index with cutoff 8 (L1on8)! We are surprised that using an index containing tuples that are never queried can substantially alter ranking performance.10 Looking at the BM25 ranking formula, the only explanation can be that the somewhat increased document sizes and average document size in the larger index have a noticeable effect on ranking scores. This portion of the ranking algorithm is also affected by the 𝑘1 and 𝑏 constants (Equation 1), where 𝑏 balances the document length normalization. Scoring matches instead with Anserini’s default values (namely, 𝑘1 = 0.9, 𝑏 = 0.4)11 [17] produces similar performance for ARQMath-1, slightly better performance for ARQMath-2, and slightly worse performance for ARQMath-3. It would seem that, in spite of the disappointing results reported in Section 4.4, more experimentation regarding how to calculate document length and the BM25 constants is justified. Figure 6 indicates that using L8 with 𝛼 ∈ [0.13, 0.20] might be suitable for ARQMath-3: on the ARQMath-1 benchmark, these achieve improvements of at least 0.05 over last year’s best nDCG’ value. Figure 7 indicates that “L8_a018” may be a good configuration with an improvement near 0.05 over last year’s best nDCG’ value for the ARQMath-2 benchmark. To vary our submitted runs, we also include “L8_a014” to cover the range indicated by the ARQMath-1 benchmark. Surprisingly, “L1on8” performs almost as well on the ARQMath-2 benchmark, even though it does not perform very well on the ARQMath-1 benchmark. In case the ARQMath-3 benchmark is more similar to the latter than it is to the former, we include “L1on8_a030” as another alternative. In summary, for ARQMath-3, we prepared three automatic runs: L8_a018: The primary submitted run with all default values for the math tuple generator (including augmented location tuples for paths having fewer than 8 nodes and 𝛼 = 0.18; L8_a014: An alternate run that uses the same setup as the primary run, but with 𝛼 = 0.14; and L1on8_a030: An alternate run that indexes the corpus using the default values (including location tuples for paths having fewer than 8 nodes), but searches it with no location tuples generated for the queries (i.e., “L1” for querying but “L8” for the index). For this configuration, 𝛼 is set to 0.30. The results of these runs for all three years are shown in Table 4. In general, after parameter selection based on the ARQMath-1 and ARQMath-2 benchmarks, our updated system produces 10 This may be a similar effect to changes in ranking performance observed by others after text cleaning, i.e., removing various forms of non-content-related markup [16]. 11 These are also the values used to compare BM25 implementations [14]. Table 4 Task 1: Evaluation of the MathDowsers runs in three years’ ARQMath Labs and the best runs for other participants in ARQMath-3. Italics indicate MathDowser runs for which parameters were tuned on the same data. ARQMath-1 (77 Topics) ARQMath-2 (71 Topics) ARQMath-3 (78 Topics) nDCG′ MAP′ † P′ @10† nDCG′ MAP′ † P′ @10† nDCG′ MAP′ † P′ @10† MathDowsers ¶L8_a018 0.511 0.261 0.307 0.510 0.223 0.265 0.474 0.164 0.247 *L8_a014 0.513 0.257 0.313 0.504 0.220 0.265 0.468 0.155 0.237 *L1on8_a030 0.482 0.241 0.281 0.507 0.224 0.282 0.467 0.159 0.236 MathDowsers (year 2021) duplicateTerms 0.457 0.207 0.267 0.462 0.187 0.241 0.447 0.159 0.236 ¶primary 0.433 0.191 0.249 0.434 0.169 0.211 - - - holisticSearch 0.405 0.192 0.266 0.414 0.167 0.225 - - - *proximityReRank 0.373 0.117 0.131 0.335 0.081 0.049 - - - MathDowsers (year 2020) *alpha05-noR 0.345 0.139 0.162 - - - - - - *alpha02 0.301 0.069 0.075 - - - - - - *alpha05-trans M 0.298 0.074 0.079 - - - - - - ¶alpha05 0.278 0.063 0.073 - - - - - - *alpha10 0.267 0.063 0.079 - - - - - - Best runs for each other participating team (year 2022) ¶Approach0 M 0.462 0.244 0.321 0.460 0.226 0.296 0.514 0.219 0.349 ¶MSM 0.422 0.172 0.197 0.381 0.119 0.152 0.504 0.157 0.241 ¶MIRMU 0.466 0.246 0.339 0.487 0.233 0.316 0.498 0.184 0.267 ¶TU-DBS 0.446 0.268 0.392 0.454 0.228 0.321 0.436 0.158 0.263 ¶DPRL 0.508 0.467 0.604 0.533 0.460 0.596 0.283 0.067 0.101 *DPRL 0.587 0.519 0.625 0.582 0.490 0.618 0.278 0.055 0.022 ¶SCM 0.254 0.102 0.182 0.197 0.059 0.149 0.257 0.060 0.119 Hybrids of primary runs Approach0400 +MSM M - - - - - - 0.594 0.234 0.345 MIRMU500 +L8_a018 - - - - - - 0.554 0.201 0.267 ¶ submitted primary run * submitted alternate run M manual run † using H+M binarization results that have a statistically significant (p<.001)12 improvement in nDCG’, and substantially improved results for MAP’ and P’@10 compared with those from last year’s system over the previous two years’ topics. Although the performance of the system is nearly identical for ARQMath-1 and ARQMath-2, it apparently deteriorates slightly for all three configurations and with respect to all three effectiveness measures when used with the ARQMath-3 benchmark. Nevertheless, all three 12 Throughout this section, statistical significance is determined by a paired t-test comparing the nDCG’ scores for individual queries, assuming those scores are normally distributed. The p value indicates the likelihood that there is no difference in the average value over all possible queries. runs significantly outperform a run using last year’s Tangent-L configuration (duplicateTerms, but run on this year’s corpus), with p=.002, p=.018, and p=.016, respectively. The changes introduced this year are indeed effective in improving ranking performance. The apparent deterioration in performance might be caused by the parameter values being overfit to the training data. However, an examination of the setting of 𝛼 for L8 and L1on8 shows performance curves shaped essentially like those in Figure 7, but with uniformly lower nDCG’ values. We conclude that the choice of 𝛼 is appropriate (at or near the peak performance), and so not overfit to the training data. Comparing to other systems, we find that Approach0, which outperformed our system on Task 2 last year, also outperforms our system on Task 1 this year, as do runs from MSM and MIRMU. The differences in nDCG’ scores (as compared to L8_a018) on the ARQMath-3 benchmark are statistically significant for Approach0 (p=.029) and for MSM (p=.037), but not for MIRMU (p=.103). Similarly, the difference in performance with respect to TU-DBS is statistically significant (p=.031). The nDCG’ scores vary substantially from topic to topic (with L8_a018 having a low of 0.028 for Topic A.327, a high of 0.781 for topic A.325, a mean of 0.474, and standard deviation of 0.165). With a closer look at the effectiveness breakdown by topic category in Table 5, we observe that, in spite of a different set of math topics being evaluated, the observed strengths and weaknesses are not much different from those of our best participant runs from previous years [4, 7], where we found that Text, Both, Concept, and Medium queries lagged the average nDCG’ performance. Nevertheless, it appears that the system’s performance substantially improves for topics that depend primarily on text and for topics concerned with mathematics of medium difficulty, and that it performs very well on topics fitting the new category Proof and Concept. A possible hypothesis that the decrease in performance is due to the increase in topics of medium difficulty (now 43 of 78 judged queries) is thus unjustified. There is a moderate correlation (with Pearson coefficient of 0.573), however, between the number of judged answers among the top-1000 results from L8_a018 and nDCG’; perhaps the scores would be higher if more of our submitted answers had been judged. Table 4 shows that on the ARQMath-2 benchmark, some systems outperform others with respect to P’@10 yet have poorer nDCG’ ratings. In light of the fact that ensemble runs often perform better than their constituent systems [18], this leads us to propose an alternative ensemble scheme: combine two runs by taking the top-𝑘 from the first run and the remainder from the second run (with duplicates removed and results limited to 1000). When exploring this simple approach on the primary runs of the different teams, we note that the nDCG’ values improve substantially with 𝑘 values as high as 500. Two example hybrid runs are presented at the bottom of Table 4, giving the hybrid configurations with the best nDCG’ scores from our exploration containing either a manual run or two automatic runs. We acknowledge that these particular hybrid configurations may be overfit to the ARQMath-3 benchmark, but note there are improvements from all combinations we explored between group runs. There seems to be no substantial gain from combining two of our own runs, so the inter-team nDCG’ improvement implies that the various teams’ runs include many non-overlapping relevant results. Continued exploration of ensemble methods is certainly warranted. Table 5 Category performance of the L8_a018 run in ARQMath-3. The best performance measure for each sub-category and each evaluation measure is highlighted in bold. Topic L8_a018 Count nDCG′ MAP′ P′ @10 Overall 78 0.474 0.164 0.247 Dependency Text 10 0.544 0.144 0.190 Formula 22 0.516 0.231 0.354 Both 46 0.439 0.137 0.209 Topic Type Computation 21 0.481 0.202 0.286 Concept 16 0.491 0.151 0.244 Proof 36 0.455 0.147 0.222 Proof and Concept 5 0.536 0.175 0.280 Difficulty Low 17 0.489 0.226 0.312 Medium 43 0.472 0.147 0.233 Hard 18 0.467 0.149 0.222 5.3. Task 2: In-context Formula Retrieval For Task 2, participants are asked to retrieve the top matching formulas, together with their associated posts, for each formula query chosen from the set of topics used for Task 1. As in 2021, the relevance of a retrieved formula is evaluated in context: both the associated post of a retrieved formula and the associated topic content of the formula query are presented to the assessors for evaluation. Assessments are then aggregated so that each visually distinct formula is assigned the maximum of the relevance scores of the instances for that formula in context. Thus, if any instance is judged to be relevant, the formula is deemed to be relevant: the performance of a system is determined by its performance with respect to visually distinct formulas only. For ARQMath-3, we include three automatic runs: L8: The primary submitted run over a corpus of visually distinct formulas only (thus 𝛼 is not needed, as there are no text terms), with the default settings for the math tuple generator; for each matching visually distinct formula, we simply return the first formula_id and post_id we encounter for that formula during our extraction; L8_a035: An alternate run with the same settings as the primary run, but running over a corpus of visually distinct formulas augmented with text extracted from the LATEX encoding and 𝛼 = 0.35, the value that maximized nDCG’ for ARQMath-1; and L8_a040: An alternate run with the same setup as above, but with 𝛼 = 0.40, the value that maximized nDCG’ for ARQMath-2. Table 6 Task 2: Evaluation of MathDowsers runs, the best participant runs, and baseline runs. Italics indicate MathDowser runs for which parameters were tuned on the same data. ARQMath-1 ARQMath-2 ARQMath-3 nDCG′ MAP′ † P′ @10† nDCG′ MAP′ † P′ @10† nDCG′ MAP′ † P′ @10† Baseline Tangent-S 0.691 0.446 0.453 0.492 0.272 0.419 0.540 0.336 0.511 MathDowsers L8_a040 (cleaned) M - - - - - - 0.682 0.485 0.601 L8_a035 (cleaned) M - - - - - - 0.681 0.485 0.601 L8 (cleaned) M - - - - - - 0.672 0.477 0.601 *L8_a040 0.657 0.460 0.516 0.624 0.412 0.524 0.640 0.451 0.549 *L8_a035 0.659 0.461 0.516 0.619 0.410 0.522 0.640 0.450 0.549 ¶L8 0.646 0.454 0.509 0.617 0.409 0.510 0.633 0.445 0.549 MathDowsers (year 2021) ¶formulaBase 0.562 0.370 0.447 0.552 0.333 0.450 - - - *docBase 0.404 0.251 0.386 0.433 0.257 0.359 - - - Best runs for each other participating team (year 2022) ¶Approach0 M 0.647 0.507 0.529 0.652 0.471 0.612 0.720 0.568 0.688 ¶DPRL 0.648 0.480 0.502 0.569 0.368 0.541 0.694 0.480 0.611 *DPRL 0.733 0.532 0.518 0.550 0.333 0.491 0.575 0.377 0.566 *XYPhoc 0.492 0.316 0.433 0.448 0.250 0.435 0.472 0.309 0.563 ¶JU_NITS 0.238 0.151 0.208 0.178 0.078 0.221 0.161 0.059 0.125 Best participant run (year 2021) *Approach0 M 0.507 0.342 0.441 0.555 0.361 0.488 - - - *DPRL 0.738 0.525 0.542 0.445 0.216 0.333 - - - Best participant run (year 2020) *DPRL 0.563 0.388 0.436 - - - - - - ¶ submitted primary run * submitted alternate run M manual run † using H+M binarization The results of all three runs over the three years’ benchmarks are shown in Table 6, together with the baseline run, the best participant runs during the previous years’ Labs, and the best run from each team for ARQMath-3. The differences in the nDCG’ values among the best runs from each participating team are all statistically significant (p<.05), except that the difference between the best run from Approach0 and that from DPRL is not statistically significant (p=.106). The three unsubmitted MathDowser runs, annotated as “cleaned,” manually correct an oversight in processing the ARQMath-3 queries, where several of the SLT formulas include invalid MathML (namely, instances of &, <, and > symbols that are not escaped). For all three configurations, performance is improved after cleaning: not enough to change the rankings, but the differences in performance of L8_a040 (cleaned) with respect to DPRL and Approach0 are no longer statistically significant (p=.243 and p=.053, respectively). The improvements to the selection of math tuples this year results in L8 achieving better scores on the ARQMath-1 and ARQMath-2 benchmarks than any of last year’s systems, includ- ing Tangent-L’s formulaBase approach which similarly searches a corpus of visually distinct formulas. Comparing the three MathDowser runs, however, shows that including text from the LATEX encoding of the formulas further improves performance, with statistically significant differences (p=.009 for latex_L8_a040 and p=.026 for latex_L8_a035). The success of the Ap- proach0 team, which augments query formulas with manually chosen text terms, supports the hypothesis that augmenting math tuples with suitably chosen text is beneficial. We recommend exploring the inclusion of text terms that appear within formulas for Task 1, where some benefit might also accrue from cross-matching extracted text with terms found elsewhere in the documents. It is possible that the approach of arbitrarily choosing a single instance of each formula to be assessed is overly naive. Perhaps choosing candidate instances for assessment based on matching text, or even just presenting additional candidate instances chosen at random, would have resulted in more favourable assessments for the formulas matched by our system and thus higher scores. Unfortunately, this hypothesis can only be tested by performing additional assessments. 5.4. Efficiency All experiments were run on a Mac OSX 10.11.6 laptop with an Intel Core i7-4770HQ Processor (4 Cores 8 Threads, 2.2GHz up to 3.4Ghz) and 16GB RAM with a 256GB flash disk. Preparing the data for Task 1 using the L8 configuration is quite time-consuming, but the indexing itself is relatively efficient, as shown in the following table: Step Processing See Section Indexing Speed (sec) 1. Construct - generate corpus 2.1 46581 2. Extract math tuples 3.4 15272 3a. mstrip 4.1 2376 3b. minvert 4.1 4176 3c. mencode 4.1 154 Clearly, many steps in our indexing pipeline can take a long time to run. However, the data can be partitioned so that most individual steps can be split and run in parallel. In addition, multiple steps can run in parallel, each as a separate process. For the experiments presented in this paper, we used the same constructed output stored in compressed files, so step 1 needs to be re-run only when the data itself changes. The corpus construction steps for Task 1 output 15.9GB of data, which compresses down to 1.6GB using gzip. The resultant index size includes 1.9GB output from minvert, plus an additional 174MB of metadata output from mencode. This encoding step allows the msearch process to load the index information in less than one second. Currently the msearch exhaustive-OR query execution is slow: to run all 298 queries from the three ARQMath years on this L8 index requires 3326 seconds giving an average of 11 seconds per query. 6. Conclusions and Further Work Replacing Tangent-L by a new search engine proved to be fairly straightforward, and the particular search engine we used proved itself to be effective. The indexing and query pipelines that we describe in this paper can be used with any search engine to make it “math-aware” and suitable for addressing both Task 1 and Task 2. Furthermore, using compressed data between each step of the indexing pipeline is a straightforward technique that greatly improves the efficiency of experiments when intermediate results are reused. There are also several opportunities to improve the engine’s efficiency, including more compact representations for postings lists, smarter algorithms for combining postings lists, and incorporation of parallel and distributed engine technology. Comparing this year’s results to experiments from previous years shows that trying to match too many features for each formula can be detrimental to effectiveness, as well as wasting space in the index and processing time to make those matches. Furthermore, we have shown that the choice of the tuning parameter 𝛼 to balance the weight of math tuples against conventional text terms is crucial to effectiveness, and the optimal value depends on parameter settings used for generating tuples. Unfortunately, determining which characteristics of the data and the queries affect the optimal value of 𝛼 remains somewhat of a mystery, and thus further work is required to understand the dependency. We believe that we have now pushed the traditional IR approach as far as we can, and that substantial gains in effectiveness will require new techniques rather than merely better tuning of the parameters we have introduced. Because 𝛼 is tuned based on experiments using earlier benchmarks, choosing an appropriate value tailored to each query may be a perfect opportunity to apply machine learning to improve math-aware search systems. Ensemble methods to combine the results from several diverse approaches also holds much promise. Acknowledgments The NTCIR Math-IR dataset used as a source of relevant keywords when translating topics to queries for Task 1 [4] was made available through an agreement with the National Institute of Informatics. We thank the anonymous ARQMath participant reviewers for their constructive comments, which helped us to improve the explanations in this paper. References [1] B. Mansouri, V. Novotný, A. Agarwal, D. W. Oard, R. Zanibbi, Overview of ARQMath-3 (2022): Third CLEF lab on Answer Retrieval for Questions on Math, in: CLEF 2022, volume 13390 of LNCS, Springer, 2022. [2] R. Zanibbi, D. W. Oard, A. Agarwal, B. Mansouri, Overview of ARQMath 2020 (updated working notes version): CLEF lab on answer retrieval for questions on math, in: CLEF 2020, volume 2696 of CEUR Workshop Proceedings, 2020. [3] B. Mansouri, R. Zanibbi, D. W. Oard, A. Agarwal, Overview of ARQMath-2 (2021): second CLEF lab on answer retrieval for questions on math, in: CLEF 2021, volume 12880 of LNCS, Springer, 2021, pp. 215–238. [4] Y. K. Ng, D. J. Fraser, B. Kassaie, F. W. Tompa, Dowsing for math answers, in: CLEF 2021, volume 12880 of LNCS, 2021. [5] D. J. Fraser, A. Kane, F. W. Tompa, Choosing math features for BM25 ranking with Tangent-L, in: DocEng 2018, 2018, pp. 17:1–17:10. [6] Y. Lv, C. Zhai, Lower-bounding term frequency normalization, in: CIKM’11, 2011, pp. 7–16. [7] Y. K. Ng, D. J. Fraser, B. Kassaie, F. W. Tompa, Dowsing for math answers to math questions: Ongoing viability of traditional MathIR, in: CLEF 2021, volume 2936 of CEUR Workshop Proceedings, 2021. [8] Y. K. Ng, Dowsing for Math Answers: Exploring MathCQA with a Math-aware Search Engine, Master’s thesis, University of Waterloo, Cheriton School of Computer Science, 2021. [9] S. Büttcher, C. L. A. Clarke, G. V. Cormack, Information Retrieval - Implementing and Evaluating Search Engines, MIT Press, 2010. [10] D. Stalnaker, Math expression retrieval using symbol pairs in layout trees, Master’s thesis, Rochester Institute of Technology, Golisano College of Computer and Information Sciences, 2013. [11] M. F. Porter, An algorithm for suffix stripping, Program: Electron. Lib. and Infor. Sys. 14 (1980) 130–137. [12] H. E. Williams, J. Zobel, Compressing integers for fast file access, Comput. J. 42 (1999) 193–201. [13] S. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and beyond, Foundations and Trends in Information Retrieval 3 (2009) 333–389. [14] C. Kamphuis, A. P. de Vries, L. Boytsov, J. Lin, Which BM25 do you mean? A large-scale reproducibility study of scoring variants, in: ECIR 2020, Part II, volume 12036 of Lecture Notes in Computer Science, Springer, 2020, pp. 28–34. [15] D. J. Fraser, Math Information Retrieval using a Text Search Engine, Master’s thesis, University of Waterloo, Cheriton School of Computer Science, 2018. [16] D. Roy, M. Mitra, D. Ganguly, To clean or not to clean: Document preprocessing and reproducibility, ACM J. Data Inf. Qual. 10 (2018) 18:1–18:25. [17] P. Yang, H. Fang, J. Lin, Anserini: Enabling the use of lucene for information retrieval research, in: SIGIR 2017, ACM, 2017, pp. 1253–1256. [18] V. Novotný, P. Sojka, M. Štefánik, D. Lupták, Three is Better than One Ensembling Math Information Retrieval Systems, in: CLEF 2020, volume 2696 of CEUR Workshop Proceedings, 2020.