=Paper=
{{Paper
|id=Vol-2696/paper_235
|storemode=property
|title=Three is Better than One: Ensembling Math Information Retrieval Systems
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_235.pdf
|volume=Vol-2696
|authors=Vít Novotný,Petr Sojka,Michal Štefánik,Dávid Lupták
|dblpUrl=https://dblp.org/rec/conf/clef/NovotnySSL20
}}
==Three is Better than One: Ensembling Math Information Retrieval Systems==
Three is Better than One Ensembling Math Information Retrieval Systems Vít Novotný, Petr Sojka, Michal Štefánik, and Dávid Lupták Math Information Retrieval Research Group https://mir.fi.muni.cz/ Faculty of Informatics, Masaryk University Botanická 68a, 602 00 Brno, Czech Republic sojka@fi.muni.cz, {witiko,stefanik.m,dluptak}@mail.muni.cz Abstract. We report on the systems that the Math Information Retrieval group at Masaryk University (MIRMU) prepared for tasks 1 (find answers) and 2 (formula search) of the ARQMath lab at the CLEF conference. We prototyped three primary MIR systems, proposed several math represen- tations to tackle the lab tasks, and evaluated the proposed systems and representations. We developed a novel algorithm for ensembling infor- mation retrieval systems that outperformed all our systems on task 1 and placed ninth out of the 23 competing submissions. Out-of-competition en- sembles of all non-baseline primary submissions in the competition made available by the participants placed first on task 1 and third on task 2. Our prototypes will help to understand the challenging problems of an- swer and formula retrieval in the STEM domain and bring the possibility of accurate math information retrieval one step closer. Keywords: Math information retrieval · question answering · math rep- resentations · word embeddings · ensembling “I do not demand that you make me happy; my happiness does not lie in you.” Anthony de Mello 1 Introduction The Math Information Retrieval (MIR) group at Masaryk University (MIRMU) is interested in challenges of MIR already for more than a decade. The challenges are numerous: 1. How to represent and index the meaning of formulae? 2. How to collate the representation of text and formulae together to grab and disambiguate the meaning of symbols in the formulae? Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22–25 September 2020, Thessaloniki, Greece. 3. How to pick the canonical representation of formulae in context for formu- lae retrieval? 4. How to integrate the inference into the information retrieval or question answering system? In the ARQMath competition, we tackled both task 1 (find answers) and task 2 (formula search). We researched and evaluated several ways of tackling the challenges [41] to push the boundaries of knowledge in MIR. We prototyped and submitted results for three primary systems, MIaS, SCM, and C OMPU BERT, and a secondary system, Formula2Vec.1 Our main objective was to gain insight into the problems above and we submitted five result lists for both tasks. We think that the key in solving the challenges lies in the accurate representation of meaning of texts and formulae in context—thus we opted to prototype systems that primarily tackle the math representation problem. We preprocessed the data and represented both given data and other STEM corpora as arXiv to learn the joint representations of text and math. We have compared several math formulae representations and used unsupervised ap- proaches for representation learning. Finally, we ensembled our primary sub- missions into a committee of MIR systems that was able to achieve better result on task 1 than any of the ensembled method alone. In this report we report in detail about our achievements: Section 2 describes in detail the resources we used and our math representations. Section 3 reports about using our venerable old MIaS system for ARQMath tasks. In sections 4 and 5, systems based on the embeddings of shallow multilayer perceptrons are evaluated. To push the envelope and catch up with the recent trends in the NLP , we experiment with using the Transformer architecture to produce math representations with our C OMPU BERT system in Section 6. The description of our ensembling algorithm can be found in Section 7. Our results, insights we got, and future directions are thoroughly discussed in sections 8 and 9. “Don’t ask the world to change. . . you change first.” Anthony de Mello 2 Methods In this section, we will describe the math representations ingested by our infor- mation retrieval systems, the corpora used for training the models that power our systems, and the relevance judgements we used for parameter optimiza- tion, model selection, and performance estimation. 2.1 Math Representations Our math information retrieval systems ingest math formulae in a variety of formats: LATEX, Presentation MathML, Content MathML, Symbol Layout Tree 1 Formula2Vec is considered secondary, because it is a direct application of the Doc2Vec DBOW model [15] without any adaptation to the tasks of the ARQ Math 2020 competi- tion. It serves as a baseline for the architecturally-similar primary system of the SCM. (SLT), Operator Tree (OPT), prefix, and infix. Figure 1 shows how the individual math representations are derived from LATEX. Symbol Layout Tree Presentation MathML M-Terms LaTeX Content MathML Operator Tree Prefix Infix Fig. 1. Dependencies between math representations ingested by our systems. LATEX As the most direct math representation, we used LATEX, which is the stan- dard authoring format for math. Although LATEX is easy to type, it encodes mostly just the presentation aspects of a math formula, not its content. LATEX is also a Turing-complete language and therefore impossible to statically parse in the general case. As a result, each formula is represented as a single token in the LATEX representation. LATEX is useful as a baseline math representation and as a basis for deriving more fine-grained math representations described in the following sections. Al- though having each formula represented as a single token may not seem useful, two of our four systems (the SCM and C OMPU BERT) model subwords, which allows them to extract symbols out of the formulae. To give an example, the formula x!! − y2 = 0 would be represented as a single token $x!! - y^2 = 0$ in LATEX. MathML Using LATEXML v0.8.42 [43] and MathML Canonicalizer,3 [8] we con- verted math formulae from LATEX to MathML 3.0, [1] which contains two mark- ups: Presentation MathML (PMML) and Content MathML (CMML). Like LATEX, PMML encodes mostly just the presentation aspects of a math formula. Unlike LATEX, PMML is tokenized to individual math symbols. Unlike LATEX and like PMML, CMML is tokenized. Unlike PMML, CMML is independent 2 https://dlmf.nist.gov/LaTeXML 3 https://github.com/MIR-MU/MathMLCanx !! double-factorial x y 2 superscript = y 0 2 0 Fig. 2. The PMML (left) and CMML (right) representations of the math formula x!! − y2 = 0. Notice how the double factorial operator is recognized and how the numerical con- stants () are differentiated from variable names ( ) in CMML. a N!2 n n n V!x !! – V!y n n = N!0 Fig. 3. The SLT representation of the math formula x!! − y2 = 0, where V! prefixes vari- able names, N! prefixes numerical constants, n stands for next, and a stands for above. 0 O!double-factorial V!x 0 O!minus 1 0 0 U!eq O!SUP V!y 1 N!0 1 N!2 Fig. 4. The OPT representation of the math formula x!! − y2 = 0, where U! prefixes com- mutative operators, O! prefixes non-commutative operators, V! prefixes variable names, N! prefixes numerical constants, 0 stands for the first operand, and 1 stands for the sec- ond operand. of the presentation aspects of a formula and encodes visually distinct but se- mantically equivalent formulae the same. To give an example, the formula x!! − y2 = 0 would be represented as the PMML and CMML documents in Figure 2. M-Terms Using MIaSMath,4 we converted math formulae from MathML to mathematical tokens (M-Terms) by ordering, tokenization, and unification. [42] Although M-Terms can be extracted from both PMML and CMML, the CMML representation has produced the best results in previous MIR competitions. [30] To give an example, the formula x!! − y2 = 0 would be represented as the following comma-separated bag of tokens in MathML: x!! − y2 = 0, x!! − y2 , 0, =, x!!, y2 , −, x, !!, y, 2, id1 !! − id2 2 = 0, id1 !! − id2 2 , id1 !!, id2 2 , id1 , id2 , x!! − yconst = const, x!! − yconst , yconst , id1 !! − id2 const = const, id1 !! − id2 const , id2 const . Symbol Layout Tree and Operator Tree Using our fork of Tangent-CFT5 [19] that recognizes all math operators in our corpora (see Section 2.2), we converted math formulae from PMML and CMML to the Symbol Layout Tree (SLT) and Operator Tree (OPT) math representations, respectively. SLT corresponds to PMML with positional relations between math symbols. In our systems, we tokenize SLT into paths in depth-first-search order. A path in the SLT is represented as a 4-tuple (initial node of the path, terminal node of the path, spatial relations between symbols on the path, spatial relations between symbols on the path from the SLT root to the initial node). OPT corresponds to CMML with additional information about the types of math symbols. In our systems, we tokenize OPT into paths in depth-first-search order. A path in the OPT is represented as a 4-tuple (initial node of the path, terminal node of the path, edge numbers on the path, edge numbers on the path from the SLT root to the initial node). To give an example, the formula x!! − y2 = 0 would be represented as the SLT tree in Figure 3 and the OPT tree in Figure 4. Prefix and Infix To linearize the OPT, we converted math formulae from OPT into the prefix and infix notations. The prefix notation corresponds to the list of visited nodes in the OPT in depth-first-search order, i.e. the topological sorting of the OPT. By parenthesizing operands and by recognizing infix operators, we also produce the infix notation. Like LATEX, the prefix and infix notations are easy to type. Unlike LATEX, the prefix and infix notations are tokenized into math symbols and independent on the presentation aspects of a formula. In TF - IDF-based systems, the prefix and infix notations are virtually equivalent. To give an example, the formula x!! − y2 = 0 would be represented as the following space-separated list of tokens in the prefix notation: U!eq O!minus 4 https://github.com/MIR-MU/MIaSMath 5 https://github.com/MIR-MU/TangentCFT O!double-factorial V!x O!SUP V!y N!2 N!0 and as the following space-separated list of tokens in the infix notation: ( ( O!double-factorial ( V!x ) O!minus O!SUP ( V!y , N!2 ) ) U!eq N!0 ). 2.2 Corpora For training our models, we used the arXMLiv and Math StackExchange cor- pus. Our data preprocessing code is available online.6 ArXMLiv The arXMLiv 08.2019 corpus [9] contains 1,374,539 articles from the arXiv.org open-access archive converted from LATEX to HTML5 and MathML. The corpus is split into four subsets: no_problem (150,701 articles), warning_1 (500,000 articles), warning_2 (328,127 articles), and error (395,711 articles), ac- cording to the severity of errors encountered when converting LATEX to HTML5. For training our models, we only used the no_problem, warning_1, and warn- ing_2 subsets of the corpus. Math StackExchange The Math StackExchange corpus V1.2 provided by the organizers of the ARQMath 2020 competition contains 2,466,080 posts (ques- tions and answers) from the Math StackExchange question answering website in HTML5 and LATEX. In our conversion from LATEX to MathML, we successfully converted 28,320,920 math formulae from LATEX to 26,705,527 math formulae in CMML (94.3% success rate) and to 27,232,230 math formulae in PMML (96.16% success rate). Posts in the Math StackExchange corpus are layered and contain not only the body text, but also the title, tags, comments, up- and downvotes, view count, authorship information, etc. Each of our systems used different parts of the corpus, which will be specified in detail in the corresponding sections. 2.3 Competition Tasks In the ARQMath 2020 competition, we competed in tasks 1 and 2. Task 1: Find Answers Given a posted question as a query, task 1 is to search all answer posts in the Math StackExchange corpus and return relevant answers. There exist 1.4 million answers in Math StackExchange. For task 1, we submitted three primary submissions, which correspond to the three primary systems that we developed (MIaS, SCM, and C OMPU BERT), and two alternative submissions, which correspond to the Formula2Vec sec- ondary system and an ensemble of the three primary submissions. 6 https://github.com/MIR-MU/ARQMath-data-preprocessing Task 2: Formula Search Given a formula from a question as a query, search all question and answer posts in the Math StackExchange corpus for relevant formulae. There exist 28.3 million math formulae in Math StackExchange. For task 2, we submitted two primary submissions, which corresponded to the two systems designed with task 2 in mind (SCM and Formula2Vec), and three alternative submissions, which correspond to different configurations of SCM and Formula2Vec, and an ensemble of the two primary submissions. Both SCM and Formula2Vec were straightforwardly adapted from task 1 to task 2 by changing the indexing unit from an answer to a formula. Neither search engine takes the textual context of the indexed formulae into account. 2.4 Relevance Judgements In our experiments, we used two sets of relevance judgements, automatic for parameter estimation and model selection, and human-annotated for perfor- mance estimation. Automatic Since there were no relevance judgements for task 1 available dur- ing the development of our systems, we produced our own from the Math StackExchange corpus.7 The judgements served as a proxy for task 1 during parameter estimation and model selection in our systems. Each question in the Math StackExchange corpus is a topic, and its judged documents are all its answers with max(0, the number of upvotes minus the number of downvotes) as the gain. Answers to different questions are not judged. Out of all 797,652 questions in the Math StackExchange corpus with an- swers, we used 628,442 (78.79%) for training, 85,840 (10.76%) for parameter optimization, and 83,370 (10.45%) for model selection, i.e. for deciding which systems will get submitted to the competition. Human-Annotated After the submission of our systems, official human-anno- tated task 1 and 2 relevance judgements produced by eight annotators with fair agreement (κ = 0.34) [13] were released.8 [47, Section 4.4] We used the relevance judgements for performance estimation of system configurations that we did not submit to the competition. Task 1 For each of 77 questions excluded from the Math StackExchange corpus, the judged documents are on average 460 answers from Math StackExchange that were evaluated by human annotators with a range from 0 (not relevant) to 3 (highly relevant) as the gain. The relevance judgements are highly imbalanced in favor of non-relevant answers: Out of all 39,124 judgements, there exist as many as 35,051 judgements (89.59%) with gain 0, 2,269 judgements (5.8%) with gain 1, 1,071 judgements 7 https://github.com/MIR-MU/ARQMath-eval, files votes-qrels-*.V1.0.tsv 8 https://github.com/MIR-MU/ARQMath-eval, files qrel_task{1,2}.tsv (2.74%) with gain 2, and only 733 (1.85%) judgements with gain 3. The small number of relevant answers impedes the usefulness of the human-annotated relevance judgements for training supervised models for use in future systems. Out of all 39,124 judgements, we used 31,541 (80.62%) for training out-of- competition system configurations, 3,560 (9.10%) for parameter optimization, and 4,023 (10.28%) for performance estimation.9 Task 2 For each of 45 math formulae from questions excluded from the Math StackExchange corpus, the judged documents are on average 256 formulae from questions and answers in Math StackExchange that were evaluated by human annotators with a range from 0 (not relevant) to 3 (highly relevant) as the gain. The relevance judgements are slightly imbalanced in favor of non-relevant formulae: Out of all 12,116 judgements, there exist 7,891 judgements (65.13%) with gain 0, 718 judgements with gain 1 (5.93%), 553 judgements (4.56%) with gain 2, and 2,954 judgements (24.38%) with gain 3. 2.5 Evaluation Measures To measure information retrieval accuracy for parameter optimization, model selection, and performance estimation, we used the Normalized discounted cu- mulative gain prime (nDCG’) and Spearman’s ρ. Normalized Discounted Cumulative Gain Prime The nDCG’ [32] is an evalu- ation measure specifically designed for information retrieval with incomplete relevance judgements. nDCG’ is defined as follows: | REL t | | RES’t | DCG ’t gaint ( REL t,i ) gaint ( RES’t,i ) nDCG’ = avg , IDCG = ∑ , DCG’ = ∑ , (1) t∈ T IDCG t i =1 log 2 ( i + 1 ) i =1 log2 (i + 1) where T are the topics for a task, REL t is a list of relevant documents for topic t in the descending order of their gain up to position 1,000, RES t is a list of results produced for topic t our system up to position 1,000, RES’t = REL t ∩ RES t , and gaint ( R) is the gain of result R for topic t as specified by relevance judgements. To see if two nDCG’ values are significantly different, we construct two 95% Student’s t [44] confidence intervals from the values of ( DCG0t/IDCGt )t∈T . If the in- tervals do not overlap, we say that the higher nDCG’ value is significantly better. Spearman’s ρ Spearman’s ρ is a general non-parametric measure of rank cor- relation. Spearman’s ρ between random variables X and Y corresponds to Pear- son’s r between the rank values rgX and rgY of the random variables: cov(rgX , rgY ) ρ= and cov(rgX , rgY ) = E (rgX − E[rgX ])·(rgY − E[rgY ]) . (2) σrgX · σrgY 9 https://github.com/MIR-MU/ARQMath-eval, files qrel_task1-*.tsv Unlike nDCG’, Spearman’s ρ can not handle incomplete relevance judgements. Therefore, we only used Spearman’s ρ to evaluate our systems with automatic relevance judgements, which are always complete for all answers of a question. Throughout the paper, we refer to both nDCG’ and Spearman’s ρ as mea- surements of our systems’ accuracy. This should not be confused with the bi- nary classification accuracy measure, which is computed as the proportion of correct predictions to all predictions and which we do not use. “When you get rid of your fear of failure, your tensions about succeeding. . . you can be yourself. Relaxed. You’ll no longer be driving with your brakes on.” Anthony de Mello 3 Math Indexer and Searcher In 2011, we have developed and open-sourced the Math Indexer and Searcher (MIaS) system [40] based on Apache Lucene: [2] a high-performance full-text search engine library. We have deployed MIaS in the European Digital Math- ematical Library10 , making it historically the first MIR system deployed in a digital mathematical library. [46] The architecture of MIaS is shown in Figure 5. In this section, we will describe MIaS and its results on task 1 of the ARQMath 2020 competition. Our experimental code is available online.11 10 https://eudml.org/search 11 https://github.com/MIR-MU/SCM-at-ARQMath, file MIaS-at-ARQMath.ipynb input document indexing canonicalized handler document xt m te indexing at h Lucene indexer tokenization math processing s m unification index ter canonicalization ordering canonicalization res quer math processing ult y s th ma tokenization searcher text weighting input query variables unification searching constants unification searching Fig. 5. Top-level architecture of our Math Indexer and Searcher (MIaS) system. 3.1 Configuration We represented answers from the Math StackExchange corpus as the text and math in their body text and comments, as produced by the HtmlGenerator. generate_answer method of the code provided by the ARQMath 2020 organiz- ers.12 We represented math in answers using M-Terms extracted from CMML with structural unification disabled, since experimental results have shown that the it is not suitable for general use. [28, 31] Since the TF - IDF scoring function of MIaS [40, Section 3.1] is best suited for short and precise queries due to a lack of logarithmic TF factoring, [38, Section 2] we represented topics as the text and math in their title, ignoring the body text and tags. Like in the indexing of answers, we represented math in topics using M-Terms extracted from CMML. To improve the recall of MIaS, we use the Leave Rightmost Out (LRO) querying strategy [30, 16, 28] that expands the original query and produces a set of subqueries by leaving out text keywords and math subformulae. After expansion, we submit the subqueries to MIaS and interleave the partial results into a final result list. We submitted the described configuration of MIaS as a primary submission. 3.2 Results In this section, we will discuss the accuracy of the MIaS using nDCG’ on the of- ficial human-annotated relevance judgements for task 1, and the speed of MIaS. Accuracy In the described configuration, the MIaS system achieved the nDCG’ 0.155 on task 1, which is far from the best competing system, but close to the Tangent-S system that served as a baseline. [47, Table A1] Speed For task 1, the average run time of MIaS over all topics was 1.24 sec- onds, with the minimum of 0.1 seconds for topic A.55, and the maximum of 7.27 seconds for topic A.80. MIaS ran on a machine with 2 TiB free disk space, eight Intel Xeon™ X7560 2.26 GHz processors with a total of 32 CPU cores, 252 GiB RAM and no GPUs. “You have to understand, my dears, that the shortest distance between truth and a human being is a story.” Anthony de Mello 4 Soft Cosine Measure Since the seminal work of Mikolov et al., [21] unsupervised word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from TF - IDF and un- supervised word embeddings, such as the Soft Cosine Measure (SCM), [36] are 12 https://github.com/ARQMath/ARQMathCode, file generate_html_file.py fast [22] and achieve strong performance on semantic text similarity, [6] infor- mation retrieval, [10] entrance exam question answering, [36] and text classifi- cation [23] tasks, among others. In this section, we will describe the Soft Cosine Measure (SCM) system, which combines TF - IDF with unsupervised word embeddings to produce in- terpretable representations of math documents and math formulae that enable fast and interpretable information retrieval. We will also report the results of the SCM on tasks 1 and 2 of the ARQMath 2020 competition. Our experimental code is available online.13 4.1 Orthogonalized Joint Word Embeddings In mathematical discourse, math formulae are often more important than words for understanding. [14] Using a single representation for both text and math makes it possible to exploit zettabytes of text for training high-quality unsuper- vised word embeddings that can capture the relations between text with math. Drawing inspiration from the existing work about unsupervised word em- beddings for math, [12, 19] we produce unsupervised joint word embeddings for text and math tokens by training a fastText skipgram model [3] on text and math in a corpus. For the Math StackExchange corpus, we use the body text of posts. For the arXMLiv corpus, we use the full texts. We experiment with different representations of math: LATEX, SLT, OPT, prefix, and infix. Unlike the embeddings produced by Krstowski and Blei [12] and Man- souri et al., [19] our fastText embeddings are share weights between tokens that have subwords in common. This speeds up training convergence [3] and enables inference of embeddings for text and math tokens not seen during train- ing. To differentiate the token type (text or math) of subwords, we use a crude heuristic that lower-cases text tokens and upper-cases math tokens. Following the work of Novotný et al., [23, Section 4.2] we orthogonalize the word embeddings in order to improve retrieval accuracy and speed. 4.2 Document Similarity Measure The TF - IDF vector space model (VSM) [33] is a distributional semantics model that is fundamental to information retrieval. The VSM represents documents as coordinate vectors relative to an orthogonal basis, where the similarity between documents or formulae in tasks 1 or 2, respectively, is the cosine similarity be- tween their coordinate vectors. Although the VSM is fast and interpretable, it is highly susceptible to pol- ysemy, since distinct text and math terms correspond to mutually orthogonal basis vectors. Therefore, documents that use different terminology will always be regarded as dissimilar. To better model the polysemy, we use the TF - IDF soft vector space model (soft VSM). [36, 22] Unlike the VSM, the soft VSM assumes that documents are 13 https://github.com/MIR-MU/SCM-at-ARQMath, file json_to_fasttext_and_scm.py Fig. 6. The representation of two documents, “Hi, world” and “Hello, world” in the TF - IDF vector space model ( VSM , left) and in the TF - IDF soft vector space model (soft VSM , right). In the VSM , different terms correspond to orthogonal axes, making the doc- ument representations distant despite their semantic equivalence. In the soft VSM, dif- ferent terms correspond to non-orthogonal axes, where the angle between the axes is proportional to the similarity of terms in a word embedding space (middle). represented in a non-orthogonal basis, i.e. similar text and math terms corre- spond to basis vectors that make acute angles, see Figure 6. Following the work of Novotný et al., [23, Section 3.2] we use the orthogo- nalized joint word embeddings as the source of similarity between terms. We use the implementation of fastText and the soft VSM in Gensim.14 [25] For task 1, we represent answers from the Math StackExchange corpus as the text and math in their body text. For math, we experiment with different representations: LATEX, SLT, OPT, prefix, and infix. For task 2, we represent for- mulae from questions and answers in the Math StackExchange corpus as the math in the optimal representation selected for task 1. For both tasks 1 and 2, we represent topics as the text and math in their title, body text, and tags. 4.3 Parameter Optimization To select the optimal parameters for the SCM on task 1, we used the nDCG’ on our relevance judgements (see sections 2.4 and 2.5) as the objective. Since we lacked relevance judgements for task 2, we used the optimal model for task 1 in both tasks. Initialization To produce the word embeddings, we reproduced the experi- mental setup of Bojanowski et al.: [3, Section 4] hash table bucket size 2 · 106 , 300 vector dimensions, negative sampling loss with five negative samples, ini- tial learning rate 0.05 with a linear decay to zero, sampling threshold 10−4 , a window size of five, a minimum of five occurences of a term, and character n-grams with n in {3, . . . , 6}. We did not model phrases15 [20, Section 4] and 14 https://radimrehurek.com/gensim/models/fasttext.html and https://radimrehurek.com/gensim/similarities/termsim.html 15 https://radimrehurek.com/gensim/models/phrases.html trained on the text and the no_problem subset of arXMLiv in the prefix nota- tion for five epochs. To produce the orthogonalized word embeddings and the term similarity matrix, we used the optimal parameters suggested by Novotný et al.: [23, Ta- ble 2] symmetric (Sym = 3) and strictly diagonally dominant (Dom = 3) term similarity matrix with a maximum of C = 100 non-zero elements in a row/column, and term similarity with threshold t = −1 and exponent o = 4. For the TF - IDF soft VSM, we used the dtn.dtn SMART weighting scheme. Tree Search To optimize the parameters, we performed tree search: assuming the parameters were mutually independent, we varied only one of the param- eters at each step of the search, and we selected the best value of the parameter using the nDCG’ on our relevance judgements. In each step, we used the op- timal values for parameters from previous steps and the default values (see Section 4.3) for parameters from future steps. For each step, we report the best parameter value together with the corresponding nDCG’ score italic. None of the differences in nDCG’ were statistically significant. In the first step, we selected the optimal representation of math from re- moving math tokens (0.7600), LATEX (0.7602), OPT (0.7606), SLT (0.7607), and pre- fix/infix (0.7612). We were delighted to learn that whereas removing math to- kens and just using LATEX representation produced the worst results, the novel prefix and infix notations led to improvements in the accuracy. In the second step, we selected the optimal maximum length of phrases from one (0.7612), two (0.7613), three (0.7614), four (0.7612), five (0.7610), and concatenating all consequent math tokens into a single math token (0.7603). The optimal value of three reproduces the result of Mikolov et al. [20, Section 4] who suggest using 2–4 iterations of bigram merging when modeling phrases. Notice also the similar accuracy received when concatenating all math tokens in prefix/infix notation (0.7603) and when using LATEX (0.7602). In the third step, we optimized the hash bucket size from 1 · 106 (0.7612), 2 · 106 (0.7614), 4 · 106 (0.7612), 8 · 106 (0.7611). Despite our worries, the default bucket size of 2 · 106 appears to be sufficient to store character n-grams for both text and math tokens. In the fourth step, we performed an exhaustive grid search of the Sym ∈ {3, 7}, Dom ∈ {3, 7}, and C ∈ {0, 50, 100, 200, 400, 800, 1600} parameters, see Table 1. To break the tie, we selected the default parameters: Sym = 4, Dom = 4, and C = 100 (0.7614). 4.4 Evaluation We trained the tiny, small, medium, and large models, and used them to pro- duce the competition submissions. Tiny model For the task 2 alternative submission, we used the model produced via parameter selection. Table 1. The results of a grid search of the Sym ∈ {3, 7}, Dom ∈ {3, 7}, and C ∈ {0, 50, 100, 200, 400, 800, 1600} parameters using the nDCG’ on our relevance judgements as the objective. Optimal parameter values are bold. However, only the values that are italic were available before the competition due to time constraints. Sym Dom C nDCG’ Sym Dom C nDCG’ Sym Dom C nDCG’ Sym Dom C nDCG’ 3 3 0 0.7613 7 3 0 0.7613 3 7 0 0.7613 7 7 0 0.7613 3 3 50 0.7614 7 3 50 0.7613 3 7 50 0.7613 7 7 50 0.7612 3 3 100 0.7614 7 3 100 0.7613 3 7 100 0.7610 7 7 100 0.7610 3 3 200 0.7614 7 3 200 0.7613 3 7 200 0.7611 7 7 200 0.7610 3 3 400 0.7612 7 3 400 0.7613 3 7 400 0.7613 7 7 400 0.7609 3 3 800 0.7614 7 3 800 0.7613 3 7 800 0.7609 7 7 800 0.7609 3 3 1600 0.7612 7 3 1600 0.7613 4 8 1600 0.8104 7 7 1600 0.7609 Small model For the task 1 and 2 primary submissions, we trained fastText for ten epochs on both the Math StackExchange corpus and the no_problem sub- set of arXMLiv, and we used character n-grams with n in {4, 5} shown to be optimal for English by Bojanowski et al. [3, Table 4] We also used the dtb.nnn SMART weighting scheme with slope s = 0.2 for the TF - IDF soft VSM instead of dtn.dtn, as suggested by Singhal, [39, Table 1] and we replaced cosine nor- malization in the soft VSM document similarity function with a normalization function that maintains the `2 norm of a document vector during a change of basis,16 which we theorize is a better fit for the pivoted normalization [37] in the dtb.nnn SMART weighting scheme. Medium and large models After the submission deadline, we trained fastText for two epochs on both the Math StackExchange corpus and the no_problem and warning_1 subsets of arXMLiv (medium), and for ten epochs on both the Math StackExchange corpus and the no_problem, warning_1, and warning_2 subsets of arXMLiv (large). The medium and large models were not submitted to the competition and only serve for comparison. For task 1, the results are the closest 1,000 answers in the soft VSM. For task 2, the results are the closest 1,000 formulae in the soft VSM that did not originate from a comment. 4.5 Results In this section, we will discuss the accuracy of the SCM using nDCG’ on the official human-annotated relevance judgements for tasks 1 and 2, and the speed of the SCM. 16 https://github.com/RaRe-Technologies/gensim/pull/2783 Accuracy For task 1, only the small model was submitted out of the tiny (0.178), small (0.224), medium (0.237), and large (0.231). We conjecture that the poorer performance of the large model was due to training on the warning_2 subset of arXMLiv, which contains highly malformed LATEX documents. For task 2, both the tiny and small models were submitted out of the tiny (0.119), small (0.059), medium (0.078), and large (0.075). We conjecture that the poorer performance of all models larger than tiny was either due to the change from cosine normalization to pivoted normalization, or due to the reduced range of modeled character n-gram sizes. Speed For task 1, the average run time of the small, medium, and large models over all topics was 58.46 seconds with the minimum of 30.52 seconds for topic A.88 and the maximum of 502.84 seconds for topic A.35. For task 2, the average run time of the small, medium, and large models over all topics was 108.86 seconds with the minimum of 54.81 seconds for topic B.88 and the maximum of 2720.14 seconds for topic B.25. For the tiny model, the speed may be different due to the different normalization in the soft VSM document similarity function. The SCM ran on a machine with 2 TiB of free disk space, eight Intel Xeon™ X7560 2.26 GHz processors with a total of 32 CPU cores, 252 GiB RAM and no GPU s. For the SCM, all documents were stored in the RAM , and queries were processed using matrix multiplication on one CPU core without any index. In a production system, the SCM can be reduced to the run time of a TF - IDF-based system, [27, 29, 22] see the performance of MIaS in Section 3.2. “Thought can organize the world so well that you are no longer able to see it.” Anthony de Mello 5 Formula2Vec Many machine learning algorithms require the input to be represented as a fixed-length feature vector. Architecturally similar to the fastText model used in Section 4 by the SCM, the Doc2Vec DBOW model [15] is a staple in the area of text classification. The shallow multilayer perceptron can be trained on large volumes of text and infers embeddings for unseen retrieval units. In this section, we will describe the Formula2Vec system that uses Doc2Vec to infer document and formula embeddings for tasks 1 and 2, respectively. We will also report the results of Formula2Vec on tasks 1 and 2 of the ARQMath 2020 competition. Our experimental code is available online.17 5.1 Document Similarity Measure In Formula2Vec, documents and formulae are represented by document and formula embeddings produced by traning the Doc2Vec DBOW model [15] on text and math in a corpus. For the Math StackExchange corpus, we use the body 17 https://github.com/MIR-MU/SCM-at-ARQMath, file json_to_doc2vec.py text of posts. For the arXMLiv corpus, we use the full texts. To differentiate the token type (text and math) of n-grams extracted by fastText, we lower-case all text tokens and upper-case all math tokens. For task 1, we represent answers from the Math StackExchange corpus as the text and math in their body text. For task 2, we represent formulae from questions and answers in the Math StackExchange corpus as the math. For both tasks 1 and 2, we represent topics as the text and math in their title, body text, and tags. 5.2 Parameter Optimization Due to the architectural similarities between the SCM and Doc2Vec, we reused the results of the parameter optimization of the SCM, see Section 4.3: 300 vector dimensions and a minimum of five occurences of a term. We modeled phrases with a maximum length of two, and trained on the no_problem subset of arXM- Liv in the prefix notation for five epochs. The remaining parameters were taken from the searchisko information re- trieval system:18 300 vector dimensions, negative sampling loss with twelve negative samples, initial learning rate 0.1 with a linear decay to zero, and a window size of eight. We use the implementation of Doc2Vec in Gensim.19 [25] 5.3 Evaluation We trained the tiny, small, medium, and large models, and used them to pro- duce the competition submissions. Tiny model For the task 2 alternative submission, we used the model produced via parameter selection. Small model For the task 1 alternative submission and the task 2 primary sub- mission, we trained Doc2Vec for ten epochs on both the Math StackExchange corpus and the no_problem subset of arXMLiv. Medium and large models After the submission deadline, we trained Doc2Vec for two epochs on both the Math StackExchange corpus and the no_problem and warning_1 subsets of arXMLiv (medium), and for ten epochs on both the Math StackExchange corpus and the no_problem, warning_1, and warning_2 subsets of arXMLiv (large). The medium and large models were not submitted to the competition and only serve for comparison. For task 1, the results are the closest 1,000 answers in the soft VSM. For task 2, the results are the closest 1,000 formulae in the soft VSM that did not originate from a comment. 18 https://github.com/searchisko/searchisko, file Doc2Vec_wrapper.py 19 https://radimrehurek.com/gensim/models/doc2vec.html 5.4 Results In this section, we will discuss the accuracy of Formula2Vec using nDCG’ on the official human-annotated relevance judgements for tasks 1 and 2, and the speed of Formula2Vec. Accuracy For task 1, only the small model was submitted out of the tiny (0.101), small (0.050), medium (0.054), and large (0.074). We conjecture that the poorer performance of all models larger than tiny was due to training on the arXMLiv corpus, which contains scientific articles that are significantly longer and lexi- cally different compared to the short-form questions and answers in the Math StackExchange corpus. For task 2, both the tiny and small models were submitted out of the tiny (0.077), small (0.108), medium (0.063), and large (0.078). We conjecture that the poorer performance of the tiny and medium models is due to the small number of epochs (five and two, respectively) used for their training. In the Gensim implementation of Doc2Vec, the same number of epochs is also used to infer embeddings of unseen retrieval units. By contrast, both the small and large models used ten epochs for both training and inference. We conjecture that the poorer performance of the large model was due to training on the warning_2 subset of arXMLiv, which contains highly malformed LATEX documents. Speed For task 1, the average run time of the large model over all topics was 3.23 seconds with the minimum of 3.14 seconds for topic A.11 and the maxi- mum of 7.87 seconds for topic A.1. For task 2, the average run time of the large model over all topics was 164.5 seconds with the minimum of 61.61 seconds for topic B.26 and the maximum of 5448.65 seconds for topic B.20. For the other models, the speed may be different due to the different number of epochs used for inference. Like the SCM, Formula2Vec ran on a machine with 2 TiB free disk space, eight Intel Xeon™ X7560 2.26 GHz processors with a total of 32 CPU cores, 252 GiB RAM and no GPUs. Like for the SCM, all documents were stored in the RAM , and queries were processed using matrix multiplication on one CPU core without any index. In a production system, Formula2Vec can be reduced to the run time of a TF - IDF-based system, [27] see the performance of MIaS in Sec- tion 3.2. “If what you seek is Truth, there is one thing you must have above all else.” “I know. An overwhelming passion for it.” “No. An unremitting readiness to admit you may be wrong.” Anthony de Mello 6 C OMPU BERT Our C OMPU BERT system aims to utilize the expressive power of pre-trained Transformer models [45] and the results of applying the Transformer architec- Sentence Transformer [Airrel embedding] model maximize cos(Q, Airrel) [Q embedding] (Mean) Pooling [Arel embedding] [wordpiece embedding] minimize cos(Q, Arel) [wordpiece embedding] [wordpiece embedding] Sentence Transformer Sentence Transformer BERT_Base model model Q: Can anyone explain ... Arelevant: Consider c2 = a2+b2... Airrelevant: Ask elsewhere ... Fig. 7. C OMPU BERT model and Triplet objective, introduced by Reimers and Gurevych: [26] Informally, the system combines the Wordpiece embeddings [7] to cre- ate static representations of questions, that are similar to the representations of relevant answers, while dissimilar to the representations of irrelevant answers. ture to complex math-related tasks, such as computing derivatives and first- order differential equations by representing math formulae in the prefix nota- tion. [17] In this section, we will describe C OMPU BERT and its results on task 1 of the ARQ Math 2020 competition. Our experimental code is available online.20 6.1 Matching Questions with Answers In addition to math representation, we have to contend with additional chal- lenges characteristic to information retrieval but alien to Transformers: While the original Transformer architecture [45] builds upon the Wordpiece text seg- mentation [35] that optimizes the representation of subwords (not unlike fast- Text in the SCM), we also need to uniformly represent long spans of text. We address this challenge with an approach introduced by Reimers and Gurevych [26] and shown in Figure 7. The underlying idea of their Sentence Transformers is to adjust the pre-trained language model so that it reflects the pairwise similarity of pieces of text, where this similarity is known. This approach has shown to reach state-of-the-art results on several doc- ument classification tasks, where it minimizes the pairwise distance of docu- ments of the same class and maximizes the distance between documents in different classes. Similarly, in order to get a unified representation of spans of texts represent- ing both questions and answers, the Transformer architecture is extended with an additional pooling layer. The training of the system mimics the Siamese net- 20 https://github.com/MIR-MU/CompuBERT work architecture [34] with a Triplet objective to minimize the cosine distance of the similar blocks of text, as shown in Figure 7. C OMPU BERT uses the Triplet objective as a proxy to the objective of task 1: to minimize the cosine distance of questions to their high-ranking answers, while maximizing the distance to their low-ranking answers. Specifically, we use the following objective: | Qs| | Ai | ∑∑ minimize 1 − cos(qi , aij ) − distexp (qi , aij ) , (3) i =1 j =1 where the expected distance distexp of an answer aij to the question qi is the number of upvotes given to aij , standardized with respect to all answers to qi . Using a similar objective, we can fine-tune an arbitrary embedding model to respect either inherent ranking provided by a feedback of the users, or an explicit relevance ranking provided by relevance judgements. Adapting the embeddings of the Transformer architecture to Information Retrieval is an active field of research. [24, 18, 5] However, C OMPU BERT is, to our knowledge, the first application of pairwise, multimodal (joint text and math) embedding optimization applied in Information Retrieval. 6.2 Model Training We represent topics and questions and answers from the Math StackExchange corpus as the text and math in their body text. For math, we experiment with different representations: LATEX, prefix, and infix. In order to train the system to minimize the distance of relevant answers to a question, we used the Math StackExchange corpus (see Section 2.2) and our relevance judgements (see Section 2.4) with the nDCG’ and Spearman’s ρ as our objectives (see Section 2.5). We represent math in each question and answer using either LATEX, the prefix notation, or the infix notation (see Section 2.1). Finally, we iterate over the data set 3–4 times until we reach the convergence with respect to our objectives. We performed 35 experiments with different configurations of C OMPU BERT, and we submitted the configuration of C OMPU BERT with the highest nDCG’ as a primary submission. See Table 2 for the nDCG’ of selected C OMPU BERT con- figurations. See Figure 8 for the learning curves of the submitted C OMPU BERT configuration. 6.3 Results In this section, we will discuss the accuracy of C OMPU BERT using nDCG’ on the official human-annotated relevance judgements for task 1, and the speed of C OMPU BERT. 1.0 0.35 0.8 0.30 Spearman's rank correlation Loss (Cosine Similarity) 0.25 0.6 0.20 0.4 0.15 0.2 0.10 0.0 Epoch 0 Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 5 Epoch 6 Epoch 7 Epoch 8 Epoch 9 Epoch 10 Fig. 8. Spearman’s ρ and the corresponding loss (averaged over 20 batches) of training the submitted C OMPU BERT configuration. Based on the monitoring, we chose to use the model weights from after epoch 4, before the performance started to degrade. Accuracy We expected the nDCG’ and Spearman’s ρ to reflect the same objec- tive and we’ve observed that to be the case for the most cases: Out of 18 config- urations with both the nDCG’ and Spearman’s ρ measured, the ranking based on Spearman’s ρ diverged from the nDCG’ only for some three systems. How- ever, the automatic and human-annotated relevance judgements for task 1 have proven to be divergent. The best configuration, reaching 0.78 nDCG’ on our rel- evance judgements, reached only 0.009 nDCG’ on the human-annotated rele- vance judgements for task 1, which is not significantly better than zero. After the competition, we experimented with additional configurations: 1. We trained on the human-annotated relevance judgements, using weighted sampling to correct class imbalance (see Section 2.4). 2. We replaced the standardization of distexp with linear scaling to [0; 1]. 3. We replaced the Triplet objective with the objective of a related Transformer model that produces embeddings for questions and answers. [11, Section 4.a] 4. We assigned questions and answers different token types. None of the configurations achieved significantly better nDCG’ than zero on the human-annotated relevance judgements for task 1. Speed The submitted system was trained in 4 iterations over 1.4 million pairs of question-answer bodies with math represented in LATEX. Each iteration took 6 hours and 30 minutes to finish, using NVIDIA GTX 2080 Ti, with 11 GiB of VRAM. We were able to fit batches of 25 question-answer pairs in-memory in a training step and hence we adjusted weights based on such amount of data. Table 2. The nDCG’ on our relevance judgements for selected C OMPU BERT configu- rations. The bert-base-cased configurations were pre-trained as described by Devlin et al. [7] and the bert-base-wikipedia-sections-mean-tokens and bert-base-nli-mean-tokens configurations were pretrained using similar objective to ours, as described by Reimers and Gurevych. In the explicit token_type configuration, we assigned a distinct mask for math tokens on input, often used to distinguish between different input languages. [7] # Pre-trained checkpoint Representation Text preprocessing Epochs nDCG’ 1 bert-base-cased LATEX tag removal 4 0.7796 2 -wiki-sections-mean-tokens LATEX tag removal 5 0.7688 3 -nli-mean-tokens LATEX tag removal 4 0.7669 4 bert-base-cased LATEX tag removal 9 0.7659 5 bert-base-cased LATEX none 4 0.7653 6 bert-base-cased Prefix tag removal 4 0.7651 7 bert-base-cased Prefix tag removal 8 0.7645 8 bert-base-cased Infix tag removal 4 0.7603 9 bert-base-cased LATEX explicit token_type 4 0.7598 We fine-tuned the pre-trained Bert-base-cased snapshot of the Bert-base ar- chitecture. [7] Following the example of Reimers and Gurevych, [26] we choose the top-layer representation to be a 768-float vector. The trained system was able to index all 1.4 million answers by inferring the static embeddings of their content, preprocessed in the same manner. Indexing using the same GPU takes roughly four times less than training: 1 hour and 44 minutes. Once all the answers are indexed, the system consumes 27 GiB of RAM to keep the embeddings in-memory. This can be further optimized using vector databases, if needed. Retrieval for a single topic requires an inference of topic embedding and the retrieval of a ranked list of answers, based on the similarity of the embeddings. We use exact nearest neighbor search and compute the cosine similarity be- tween the topic embedding and each of the indexed answers. Single embedding inference takes 0.004 seconds on GPU and an exact-search retrieval of requested 1,000 ranked questions – just as all ranked answers – takes 3.426 seconds on average, 3.667 seconds maximum and 3.198 seconds minimum. “There are two ways to wash dishes: One is to wash them in order to make them clean; the other is to wash them in order to wash them.” Anthony de Mello 7 Ensemble Different information retrieval systems can agree on a small portion of the most relevant documents, but each individual system will miss the great majority of relevant documents. With ensembling, we can combine the strenghts of differ- ent information retrieval systems to produce more accurate results. In this section, we describe a parameter-free algorithm for ensembling an arbitrary number of result lists into a single result list. The algorithm is agnostic to the scoring functions used by the different systems and only uses the ranks of results. Our experimental code is available online.21 7.1 Algorithm Definition 1 (Frequency f ). Let T be a topic. Let R be a result for T that appears in exactly f result lists. The frequency of R is f . Definition 2 (Median Inverse Rank M−1 ). Let T be a topic. Let R be a result for T that appears in any of the N result lists. Let M be the median of rank1 , rank2 , . . . , rank N , where ranki , 0 ≤ ranki ≤ 1000 is the zero-based rank of R in the i-th result list if R appears in the i-th result list, and 1000 otherwise. The median inverse rank M−1 of R is (1000 − M )/1000. Definition 3 (Striped Inverse Rank S−1 ). Let T be a topic. Let R be the k-th re- sult for T (according to some criterion). Let S = ranki , where i ≡ k (mod f ) and ranki , 0 ≤ ranki ≤ 1000 is the zero-based rank of R in the i-th result list of the f result lists in which R appears. The striped inverse rank S−1 of R is (1000 − S)/1000. For each topic T, we combine the result lists of the systems, and we lexico- graphically order them using the four-tuple ( M−1 , f , S−1 , result name). The top 1,000 ordered results are the ensembled result list for topic T. The rationale of the algorithm is the following: If multiple results with the same M−1 exist, then f is used to break the ties. If ties still exist, the tied results are interleaved using S−1 to ensure fairness. The remaining ties are broken ar- bitrarily using the result name. In our implementation, we only reported M−1 as the score in the ensembled result lists. Since the trec_eval software22 used for evaluation disregards the re- ported ranks of results and infers them from the reported scores instead, the ensembled result lists were effectively reordered using the two-tuple ( M−1 , re- sult name) during evaluation. This flaw in trec_eval has only a limited effect on the nDCG’, since the four-tuple ( M−1 , f , S−1 , result name) has already decided which 1,000 results would appear in the result list. See also the discussion of tie-breaking in information retrieval by Cabanac et al. [4] 7.2 Evaluation For task 1, we ensembled our three primary submissions: MIaS (0.155, see Sec- tion 3), the small model of the SCM (0.224, see Section 4), and C OMPU BERT (0.009, see Section 6). To assess the capabilities of our ensembling algorithm, we also ensembled all non-baseline primary submissions submitted to task 1 21 https://github.com/MIR-MU/SCM-at-ARQMath, file combine_serps.py 22 https://github.com/usnistgov/trec_eval by the ARQMath participants: alpha05 (0.278), PSU 2 (0.228), SCM (0.224), MIaS (0.155), C OMPU BERT (0.009), zbMATH (0.101), and DPLR 2 (0.051). [47, Table A1] For task 2, we also ensembled our two primary submissions: the small model of the SCM (0.059, see Section 4), and the small model of Formula2Vec (0.108, see Section 6). To assess the capabilities of our ensembling algorithm, we also ensembled all non-baseline primary submissions submitted to task 2 by the ARQ Math participants: TangentCFTED (0.420), Formula2Vec (0.108), and SCM (0.059). [47, Table A2] The primary submission of formulaembedding (0.026) was not included in the ensemble, since it was not shared by the participants. 7.3 Results In this section, we will discuss the tie-breaking performance of ensembling, and the accuracy of ensembling using nDCG’ on the official human-annotated relevance judgements for task 1. Since the ensembling was applied on the result lists as a post-processing step, speed can not be meaningfully reported. Breaking Ties Table 3 shows how successful our algorithm was in breaking ties. 59.29% of results in task 1 and 47.56% of results in task 2 only had their ties broken using their names. A fairer tie-breaking strategy would have been to combine and lexicographically order the results R using the four-tuple ( M−1 , f , S−1 , index of R’s originating result list), so that tied results from different input result list are interleaved. Table 3. Percentages of ties in the submitted ensembled result lists when ordering the result lists by just a first few ordering criteria. ( M −1 , f, S−1 , result name) Task 1 92.76% 92.75% 59.29% 0.00% Task 2 48.33% 47.72% 47.56% 0.00% Accuracy For task 1, the ensemble of our three primary submissions achieved the highest accuracy of all our systems (0.238). The ensemble of all non-baseline primary submissions23 received the highest accuracy in the competition (0.419), significantly better than the alpha05noReRank alternative submission of the MathDowsers team (0.345). [47, Table A1] For task 2, the ensemble of our two primary submissions achieved a slightly lower accuracy (0.100) than Formula2Vec (0.108). The ensemble of all non-base- line primary submissions24 received the third highest accuracy results in the com- petition (0.327). 23 https://github.com/MIR-MU/SCM-at-ARQMath, file ensemble-task1.tsv 24 https://github.com/MIR-MU/SCM-at-ARQMath, file ensemble-task2.tsv There was no difference in nDCG’ for either task when using the fairer tie- breaking strategy described in the previous section. This is because of the ar- bitrary reordering of results by trec_eval discussed in Section 7.1: the fairer tie- breaking did not alter the top 1,000 results present in the ensembled result lists. Figure 9 shows the percentage of judged results at different ranks for all our submissions25 and offers an alternative interpretation of the surprisingly high accuracy of the ensemble: Due to the pooling of submitted results before anno- tation, [47, Figure 2] the ensembled result lists have significantly more judged results than any other submission. Since there is no penalty to the nDCG’ for adding non-relevant results (i.e. gain( RES’i ) = 0) at the end of a result list (see Section 2.5), having more judged results is beneficial even if few are relevant. However, the high mean relevance of the first ten results (see Figure 10) and the best P@10 among our systems [47, Tables A1 and A2] indicate that this is not the main cause of the high accuracy of the ensemble. “Meaning is only found when you go beyond meaning.” Anthony de Mello 8 Results In this section, we will describe the accuracy and the speed of all our systems on tasks 1 and 2. 25 https://github.com/MIR-MU/ARQMath-eval, file mean-relevance-and-percentages-judged.ipynb Table 4. The nDCG’ for the primary and secondary submissions of all our systems on tasks 1 (left) and 2 (right) as well as the best nDCG’ among the submitted and out-of- competition system configurations. Best and second best systems in a row are bold and italic, respectively. MIaS SCM F2Vec CBRT Ens. SCM F2Vec Ens. Best 0.155 0.237 0.101 0.009 0.419 Best 0.119 0.108 0.327 Primary 0.155 0.224 0.009 Primary 0.059 0.108 Alternative 0.050 0.238 Alternative 0.119 0.077 0.100 Table 5. The average, minimum, and maximum run time in seconds for all our systems to produce results for a task 1 (left) and 2 (right) topic. Best and second best results in a row are bold and italic, respectively. MIaS SCM F2Vec CBRT SCM F2Vec Minimum 0.1 30.52 3.14 3.2 Minimum 54.81 61.61 Average 1.24 58.64 3.23 3.43 Average 108.86 164.5 Maximum 7.27 502.84 7.87 3.67 Maximum 2720.14 5448.65 100 Percentage of judged results 80 60 Ensemble (alternative) SCM (primary) 40 MIaS (primary) 20 Formula2Vec (alternative) CompuBERT (primary) 0 1 10 20 25 50 100 150 1000 Rank of result 100 Percentage of judged results SCM (alternative) 80 Formula2Vec (primary) Ensemble (alternative) 60 Formula2Vec (alternative) SCM (primary) 40 20 0 1 10 20 25 50 100 150 1000 Rank of result Fig. 9. The percentage of judged results at different ranks for all our task 1 (above) and 2 (below) submissions. Whereas other alternative submissions only have the first 20 and ten results judged for tasks 1 and 2, respectively, due to the pooling of submissions, the ensemble has significantly more judged results, which may put it at an unfair advantage. Ensemble (alternative) Low SCM (primary) Mean relevance MIaS (primary) Formula2Vec (alternative) CompuBERT (primary) None 1 2 3 4 5 6 7 8 9 10 20 50 Rank of result SCM (alternative) Low Formula2Vec (primary) Mean relevance Ensemble (alternative) Formula2Vec (alternative) SCM (primary) None 1 2 3 4 5 6 7 8 9 10 20 50 Rank of result Fig. 10. The mean relevance (gain) for all our task 1 (above) and 2 (below) submissions. For task 1, the ensemble has the highest mean relevance at ten, except for ranks 1 and 3. 8.1 Accuracy Table 4 shows the nDCG’ of all our systems on both tasks. The primary and alternative rows only show the results of our submissions, whereas the best row also include the results of our out-of-competition system configurations. On task 1, the best configuration of the SCM is significantly better than MIaS, Formula2Vec, and C OMPU BERT. On both tasks, the best configuration of the ensemble is significantly better than the ensembled systems. 8.2 Speed Table 5 shows the run times of all our systems on task 1 and 2 topics. In task 1, MIaS was the fastest, closely followed by Formula2Vec and C OMPU BERT. In task 2, SCM was faster than Formula2Vec, which is likely due to the larger number of indexing units and the fact that SCM uses sparse matrix multiplica- tion, which becomes significantly faster than dense matrix multiplication as the matrices grow large. “Wisdom tends to grow in proportion to one’s awareness of one’s ignorance.” Anthony de Mello 9 Conclusion and Future Work In our work, we have introduced three significantly different MIR systems. The methods based on the vector space model (SCM and MIaS) have shown useful transferability to other tasks as well as interpretability, as it contains the solid ground in relying on concrete term matching. While C OMPU BERT has demonstrated an outstanding ability to learn a spe- cific task of guessing the number of votes of an answer based on its textual and mathematical content, we’ve also observed an unexpected sensitivity to the training objective, where C OMPU BERT was able to significantly outperform others on our relevance judgements but finished last on task 1 of the ARQMath 2020 competition. The possible reasons include overfitting, underfitting, and divergent objectives. Thanks to the ARQMath competition, we will be able to identify such fail- ures and further fine-tune our systems to better regard the provided human- annotated relevance judgements, that as we believe, better reflect the quality of a MIR system, than our artificial relevance judgements that we had to rely on when developing our systems. Our most surprising finding is that three is better than one: In terms of accu- racy, ensembling the results of our primary submissions outperformed all our three primary systems on task 1 of the ARQMath 2020 competition. Ensembling all non-baseline primary submissions for task 1 received the best score in the competition at a statistically significant margin. Question answering in a complex domain such as STEM is a very challenging task. Looking at it from diverse viewpoints is a good thing and so is compound- ing and merging diverse approaches and their results! Ensembling worked for the ARQMath tasks, and it may work for other complex tasks in general. Acknowledgements First author’s work was graciously funded by the South Moravian Centre for International Mobility as a part of the Brno Ph.D. Talent project. We also thank the three anonymous reviewers for their insightful com- ments. Last but not least, we extend our gratitude to the ARQMath 2020 orga- nizers for their great professionalism and patience. References 1. Ausbrooks, R., Buswell, S., Carlisle, D., Chavchanidze, G., Dalmas, S., Devitt, S., Diaz, A., Dooley, S., Hunter, R., Ion, P., Kohlhase, M., Lazrek, A., Libbrecht, P., Miller, B., Miner, R., Rowley, C., Sargent, M., Smith, B., Soiffer, N., Sutor, R., Watt, S.: Mathematical markup language (MathML) version 3.0 2nd edition (2014), https: //www.w3.org/TR/MathML3/Overview.html 2. Białecki, A., Muir, R., Ingersoll, G., Imagination, L.: Apache Lucene 4. In: SIGIR 2012 workshop on open source information retrieval. p. 17 (2012) 3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with sub- word information. Transactions of the Association for Computational Linguistics 5, 135–146 (2017) 4. Cabanac, G., Hubert, G., Boughanem, M., Chrisment, C.: Tie-breaking bias: effect of an uncontrolled parameter on information retrieval evaluation. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 112–123. Springer (2010) 5. Chang, W.C., Yu, F.X., Chang, Y.W., Yang, Y., Kumar, S.: Pre-training Tasks for Embedding-based Large-scale Retrieval. arXiv e-prints arXiv:2002.03932 (Feb 2020) 6. Charlet, D., Damnati, G.: Simbow at semeval-2017 task 3: Soft-cosine semantic simi- larity between questions for community question answering. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). pp. 315–319 (2017) 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 8. Formánek, D., Líška, M., Růžička, M., Sojka, P.: Normalization of digital mathemat- ics library content. In: Davenport, J., Jeuring, J., Lange, C., Libbrecht, P. (eds.) 24th OpenMath Workshop, 7th Workshop on Mathematical User Interfaces (MathUI), and Intelligent Computer Mathematics Work in Progress. pp. 91–103. No. 921 in CEUR Workshop Proceedings, Aachen (2012), http://ceur-ws.org/Vol-921/ wip-05.pdf 9. Ginev, D.: arXMLiv:08.2019 dataset, an HTML5 conversion of arXiv.org (2019), https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/, SIG- MathLing – Special Interest Group on Math Linguistics 10. González Barbosa, J.J., Frausto-Solis, J., Villanueva, D., Valdés, G., Florencia, R., González, L., Mata, M.: Implementation of an Information Retrieval System Us- ing the Soft Cosine Measure, vol. 667, pp. 757–766. Springer (Dec 2017), https: //doi.org/10.1007/978-3-319-47054-2_50 11. Jernite, Y.: Explain Anything Like I’m Five (June 2020), https://github.com/ huggingface/notebooks/blob/master/longform-qa/Long_Form_Question_ Answering_with_ELI5_and_Wikipedia.ipynb 12. Krstovski, K., Blei, D.M.: Equation Embeddings. ArXiv e-prints (Mar 2018) 13. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics pp. 159–174 (1977) 14. Larson, R.R., Reynolds, C., Gey, F.C.: The abject failure of keyword IR for mathemat- ics search: Berkeley at NTCIR-10 math. In: Proceedings of the 10th NTCIR Confer- ence on Evaluation of Information Access Technologies, NTCIR-10, National Center of Sciences, Tokyo, Japan, June 18–21, 2013 (2013) 15. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014), http://arxiv.org/abs/1405.4053 16. Líška, M., Sojka, P., Růžička, M.: Combining text and formula queries in math in- formation retrieval: Evaluation of query results merging strategies. In: Proceedings of the First International Workshop on Novel Web Search Interfaces and Systems. pp. 7–9. NWSearch ’15, Association for Computing Machinery, New York, NY, USA (2015), https://doi.org/10.1145/2810355.2810359 17. Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L., Liu, T.Y.: Understanding and improving transformer from a multi-particle dynamic system point of view. arXiv preprint arXiv:1906.02762 (2019) 18. Luan, Y., Eisenstein, J., Toutanova, K., Collins, M.: Sparse, Dense, and Attentional Representations for Text Retrieval. arXiv e-prints arXiv:2005.00181 (Apr 2020) 19. Mansouri, B., Rohatgi, S., Oard, D.W., Wu, J., Giles, C.L., Zanibbi, R.: Tangent-CFT: An Embedding Model for Mathematical Formulas. In: Proceedings of the 2019 ACM SIGIR international conference on theory of Information Retrieval. pp. 11–18 (2019) 20. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed Represen- tations of Words and Phrases and their Compositionality. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013) 21. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Represen- tations in Vector Space. arXiv preprint arXiv:1301.3781 (2013) 22. Novotný, V.: Implementation notes for the soft cosine measure. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Manage- ment (CIKM ’18). pp. 1639–1642. Association for Computing Machinery, Torino, Italy (2018). https://doi.org/10.1145/3269206.3269317 23. Novotný, V., Ayetiran, E.F., Štefánik, M., Sojka, P.: Text classification with word embedding regularization and soft similarity measure (2020), https://arxiv.org/ abs/2003.05019 24. Qadrud-Din, J., Bah Rabiou, A., Walker, R., Soni, R., Gajek, M., Pack, G., Rangaraj, A.: Transformer Based Language Models for Similar Text Retrieval and Ranking. arXiv e-prints arXiv:2005.04588 (May 2020) 25. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks. pp. 45–50. ELRA, Valletta, Malta (May 2010). https://doi.org/10.13140/2.1.2393.1847 26. Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing (EMNLP-IJCNLP). pp. 3982–3992. Association for Computational Linguistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1410, https://www.aclweb.org/anthology/ D19-1410 27. Rygl, J., Pomikálek, J., Řehůřek, R., Růžička, M., Novotný, V., Sojka, P.: Seman- tic vector encoding and similarity search using fulltext search engines. In: Pro- ceedings of the 2nd Workshop on Representation Learning for NLP. pp. 81– 90. Association for Computational Linguistics, Vancouver, Canada (Aug 2017). https://doi.org/10.18653/v1/W17-2611 28. Růžička, M.: Maths Information Retrieval for Digital Libraries. Ph.D. thesis, Masaryk University, Faculty of Informatics, Brno (2018), https://is.muni.cz/th/ pxz4q/ 29. Růžička, M., Novotný, V., Sojka, P., Pomikálek, J., Řehůřek, R.: Flexible Similar- ity Search of Semantic Vectors Using Fulltext Search Engines. In: HybridSemStats Workshop Proceedings. vol. 1923, pp. 1–12 (2017) 30. Růžička, M., Sojka, P., Líška, M.: Math Indexer and Searcher under the Hood: His- tory and Development of a Winning Strategy. In: Joho, H., Kishida, K. (eds.) Proc. of the 11th NTCIR Conference on Evaluation of Information Access Technologies. pp. 127–134. NII, Tokyo, Japan (Dec 2014) 31. Růžička, M., Sojka, P., Líška, M.: Math indexer and searcher under the hood: Fine- tuning query expansion and unification strategies. In: Kando, N., Kishida, K., Kato, M.P., Yamamoto, S. (eds.) Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. pp. 331–337. National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430 Japan, Tokyo (2016) 32. Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval 11(5), 447–470 (2008) 33. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In- formation Processing and Management 24, 513–523 (1988) 34. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embed- ding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 815–823 (2015). https://doi.org/10.1109/CVPR.2015.7298682 35. Sennrich, R., Haddow, B., Birch, A.: Neural Machine Translation of Rare Words with Subword Units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1715–1725. Association for Computational Linguistics, Berlin, Germany (Aug 2016). https://doi.org/10.18653/v1/P16-1162, https://www.aclweb.org/ anthology/P16-1162 36. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft co- sine measure: Similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014) 37. Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: ACM SIGIR forum. vol. 51, pp. 176–184. ACM New York, NY, USA (2017) 38. Singhal, A., Choi, J., Hindle, D., Lewis, D.D., Pereira, F.: AT&T at TREC-7. In: Pro- ceedings of the seventh Text REtrieval Conference (TREC-7). pp. 239–252 (July 1999) 39. Singhal, A., et al.: Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001) 40. Sojka, P., Líška, M.: The Art of Mathematics Retrieval. In: Proceedings of the ACM Conference on Document Engineering, DocEng 2011. pp. 57–60. Association of Computing Machinery, Mountain View, CA, USA (Sep 2011), http://doi.acm.org/ 10.1145/2034691.2034703 41. Sojka, P., Novotný, V., Ayetiran, E.F., Lupták, D., Štefánik, M.: Quo Vadis, Math In- formation Retrieval. In: Proceedings of Recent Advances in Slavonic Natural Lan- guage Processing, RASLAN 2019. pp. 117–128 (2019), https://nlp.fi.muni.cz/ raslan/2019/paper11-sojka.pdf 42. Sojka, P., Růžička, M., Novotný, V.: MIaS: Math-Aware Retrieval in Dig- ital Mathematical Libraries. In: Proceedings of the 27th ACM Interna- tional Conference on Information and Knowledge Management (CIKM ’18). pp. 1923–1926. Association for Computing Machinery, Torino, Italy (2018). https://doi.org/10.1145/3269206.3269233 43. Stamerjohanns, H., Kohlhase, M., Ginev, D., David, C., Miller, B.: Transforming large collections of scientific publications to xml. Mathematics in Computer Science 3(3), 299–307 (2010) 44. Student: The probable error of a mean. Biometrika 6(1), 1–25 (1908) 45. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is All you Need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30. pp. 5998–6008. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf 46. Wojciechowski, K., Nowiński, A., Sojka, P., Líška, M.: The EuDML Search and Browsing Service – Final (Feb 2013), deliverable D5.3 of EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, revision 1.2 https: //project.eudml.eu/sites/default/files/D5_3_v1.2.pdf 47. Zanibbi, R., Oard, D.W., Agarwal, A., Mansouri, B.: Overview of ARQMath 2020: CLEF Lab on Answer Retrieval for Questions on Math (2020)