=Paper=
{{Paper
|id=Vol-2696/paper_271
|storemode=property
|title=Overview of ARQMath 2020 (Updated Working Notes Version): CLEF Lab on Answer Retrieval for Questions on Math
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_271.pdf
|volume=Vol-2696
|authors=Richard Zanibbi,Douglas Oard,Anurag Agarwal,Behrooz Mansouri
|dblpUrl=https://dblp.org/rec/conf/clef/ZanibbiOAM20a
}}
==Overview of ARQMath 2020 (Updated Working Notes Version): CLEF Lab on Answer Retrieval for Questions on Math==
Overview of ARQMath 2020 (Updated Working Notes Version): CLEF Lab on Answer Retrieval for Questions on Math Richard Zanibbi,1 Douglas W. Oard,2 Anurag Agarwal,1 and Behrooz Mansouri1 1 Rochester Institute of Technology (USA) {rxzvcs,axasma,bm3302}@rit.edu 2 University of Maryland, College Park (USA) oard@umd.edu Abstract. The ARQMath Lab at CLEF considers finding answers to new mathematical questions among posted answers on a community question answering site (Math Stack Exchange). Queries are question posts held out from the searched collection, each containing both text and at least one formula. This is a challenging task, as both math and text may be needed to find relevant answer posts. ARQMath also includes a formula retrieval sub-task: individual formulas from question posts are used to locate formulae in earlier question and answer posts, with rele- vance determined considering the context of the post from which a query formula is taken, and the posts in which retrieved formulae appear. Keywords: Community Question Answering (CQA), Mathematical Informa- tion Retrieval, Math-aware search, Math formula search 1 Introduction In a recent study, Mansouri et al. found that 20% of mathematical queries in a general-purpose search engine were expressed as well-formed questions, a rate ten times higher than that for all queries submitted [14]. Results such as these and the presence of Community Question Answering (CQA) sites such as Math Stack Exchange3 suggest there is interest in finding answers to mathematical questions posed in natural language, using both text and mathematical notation. Related to this, there has also been increasing work on math-aware information retrieval and math question answering in both the Information Retrieval (IR) and Natural Language Processing (NLP) communities. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece. 3 https://math.stackexchange.com Table 1. Examples of relevant and not-relevant results for tasks 1 and 2 [12]. For Task 2, formulas are associated with posts, indicated with ellipses at right (see Figure 1 for more details). Query formulae are from question posts (here, the question at left), and retrieved formulae are from either an answer or a question post. Task 1: Question Answering Task 2: Formula Retrieval Question Query Formula I have spent the better part of this day trying to show from first 1 principles that this sequence tends to 1. Could anyone give me an ... lim nn ... idea of how I can approach this problem? n→+∞ 1 lim nn n→+∞ Relevant Relevant You can use AM ≥ GM. √ n √ √ ... lim n ... n→∞ 1 + 1 + ··· + 1 + n+ n 1/n ≥n ≥1 n 2 2 1/n 1− + √ ≥n ≥1 n n Not Relevant Not Relevant If you just want to show it converges, then the partial sums are increasing but the whole series is bounded above by X 1 π2 ... 2 = ... Z ∞ k=1 k 6 1 1+ dx = 2 1 x2 In light of this growing interest, we organized this new lab at the Conference and Labs of the Evaluation Forum (CLEF) on Answer Retrieval for Questions about Math (ARQMath).4 Using the formulae and text in posts from Math Stack Exchange, participating systems are given a question, and asked to re- turn a ranked list of potential answers. Relevance is determined by how well each returned post answers the provided question. Through this task we explore leveraging math notation together with text to improve the quality of retrieval results. This is one case of what we generically call math retrieval, in which the focus is on leveraging the ability to process mathematical notation to enhance, rather than to replace, other information retrieval techniques. We also included a formula retrieval task, in which relevance is determined by how useful a retrieved formula is for the searcher’s intended purpose, as best could be determined from the query formula’s associated question post. Table 1 illustrates these two tasks, and Figure 1 shows the topic format for each task. For the CQA task, 70,342 questions from 2019 that contained some text and at least one formula were considered as search topics, from which 77 were selected as test topics. Participants had the option to run queries using only the text or math portions of each question, or to use both math and text. One challenge inherent in this design is that the expressive power of text and formulae are sometimes complementary; so although all topics will include both text and formula(s), some may be better suited to text-based or math-based retrieval. 4 https://www.cs.rit.edu/~dprl/ARQMath Task 1: Question Answering... S i m p l i f y i n g t h i s s e r i e s T i t l e> I need to w r i t e the s e r i e s $ $ \sum_{ n=0}^N nx^n $$ span> i n a f o r m t h a t d o e s n o t i n v o l v e t h e summation n o t a t i o n , f o r example $\sum_{ i =0}^n i ^2 = \ f r a c { ( n^2+n ) ( 2 n +1)}{6}$ span> Does a n y o n e h a v e any i d e a how t o do t h i s ? I h a v e a t t e m p t e d m u l t i p l e ways i n c l u d i n g u s i n g g e n e r a t i n g f u n c t i o n s h o w e v e r no l u c k . Question> s e q u e n c e s −and− s e r i e s Topic> ... Topics> Task 2: Formula Retrieval... q_52 \sum_{ n=0}^N nx^n Latex> S i m p l i f y i n g t h i s s e r i e s T i t l e> ... Question> s e q u e n c e s −and− s e r i e s Topic> ... Topics> Fig. 1. XML Topic File Formats for Tasks 1 and 2. Formula queries in Task 2 are taken from questions in Task 1. Here, formula topic B.9 is a copy of question topic A.9 with two additional tags for the query formula identifier and LATEX before the question post. For the formula search task, an individual formula is used as the query, and systems return a ranked list of other potentially useful instances of formulae found in the collection. Each of the 45 queries is a single formula extracted from a question used in the CQA task. Mathematical problem solving was amongst the earliest applications of Artifi- cial Intelligence, such as Newell and Simon’s work on automatic theorem proving [15]. More recent work in math problem solving includes systems that solve al- gebraic word problems while providing a description of the solution method [11], and that solve algebra word problems expressed in text and math [10]. The focus of ARQMath is different; rather than prove or solve concrete mathematical prob- lems, we instead look to find answers to informal, and potentially open-ended and incomplete questions posted naturally in a CQA setting. The ARQMath lab provides an opportunity to push mathematical question answering in a new direction, where answers provided by a community are se- lected and ranked rather than generated. We aim to produce test collections, drive innovation in evaluation methods, and drive innovation in the development of math-aware information retrieval systems. An additional goal is welcoming new researchers to work together on these challenging problems. 2 Related Work The Mathematical Knowledge Management (MKM) research community is con- cerned with the representation, application, and search of mathematical informa- tion. Among other accomplishments, their activities informed the development of MathML5 for math on the Web, and novel techniques for math representation, search, and applications such as theorem proving. This community continues to meet annually at the CICM conferences [8]. Math-aware search (sometimes called Mathematical Information Retrieval ) has seen growing interest over the past decade. Math formula search has been studied since the mid-1990’s for use in solving integrals, and publicly available math+text search engines have been around since the DLMF6 system in the early 2000’s [6, 21]. The most widely used evaluation resources for math-aware information retrieval were initially developed over a five-year period at the Na- tional Institute of Informatics (NII) Testbeds and Community for Information access Research (at NTCIR-10 [1], NTCIR-11 [2] and NTCIR-12 [20]). NTCIR- 12 used two collections, one a set of arXiv papers from physics that is split into paragraph-sized documents, and the other a set of articles from English Wikipedia. The NTCIR Mathematical Information Retrieval (MathIR) tasks developed evaluation methods and allowed participating teams to establish base- lines for both “text + math” queries (i.e., keywords and formulas) and isolated formula queries. A recent math question answering task was held for SemEval 2019 [7]. Ques- tion sets from MathSAT (Scholastic Achievement Test) practice exams in three categories were used: Closed Algebra, Open Algebra and Geometry. A majority of the questions were multiple choice, with some having numeric answers. This is a valuable parallel development; the questions considered in the CQA task of ARQMath are more informal and open-ended, and selected from actual MSE user posts (a larger and less constrained set). At NTCIR-11 and NTCIR-12, formula retrieval was considered in a vari- ety of settings, including the use of wildcards and constraints on symbols or subexpressions (e.g., requiring matched argument symbols to be variables or con- stants). Our Task 2, Formula Retrieval, has similarities in design to the NTCIR- 12 Wikipedia Formula Browsing task, but differs in how queries are defined and how evaluation is performed. In particular, for evaluation ARQMath uses the visually distinct formulas in a run, rather than all (possibly identical) formula instances, as had been done in NTCIR-12. The NTCIR-12 formula retrieval test collection also had a smaller number of queries, with 20 fully specified formula 5 https://www.w3.org/Math 6 https://dlmf.nist.gov queries (plus 20 variants of those same queries with subexpressions replaced by wildcard characters). NTCIR-11 also had a formula retrieval task, with 100 queries, but in that case systems searched only for exact matches [19]. Over the years, the size of the NTCIR-12 formula browsing task topic set has limited the diversity of examples that can be studied, and made it difficult to measure statistically significant differences in formula retrieval effectiveness. To support research that is specifically focused on formula similarity measures, we have create a formula search test collection that is considerably larger, and in which the definition of relevance derives from the specific task for which retrieval is being performed, rather than isolated formula queries. 3 The ARQMath 2020 Math Stack Exchange Collection In this section we describe the raw data from which we started, collection pro- cessing, and the resulting test that was used in both tasks. Topic development for each task is described in the two subsequent sections. 3.1 MSE Internet Archive Snapshot We chose Math Stack Exchange (MSE), a popular community question answer- ing site as the collection to be searched. The Internet Archive provides free public access to MSE snapshots.7 We processed the 01-March-2020 snapshot, which in its original form contained the following in separate XML files: – Posts: Each MSE post has a unique identifier, and can be a question or an answer, identified by ‘post type id’ of 1 and 2 respectively. Each question has a title and a body (content of the question) while answers only have a body. Each answer has a ‘parent id’ that associates it with the question it is an answer is for. There is other information available for each post, including its score, the post owner id and creation date. – Comments: MSE users can comment on posts. Each comment has a unique identifier and a ‘post id’ indicating which post the comment is written for. – Post links: Moderators sometimes identify duplicate or related questions that have been previously asked. A ‘post link type id’ of value 1 indicates related posts, while value 3 indicates duplicates. – Tags: Questions can have one or more tags describing the subject matter of the question. – Votes: While the post score shows the difference between up and down votes, there are other vote types such as ‘offensive’ or ‘spam.’ Each vote has a ‘vote type id’ for the vote type and a ‘post id’ for the associated post. – Users: Registered MSE users have a unique id, and they can provide addi- tional information such as their website. Each user has a reputation score, which may be increased through activities such as posting a high quality answer, or posting a question that receives up votes. 7 https://archive.org/download/stackexchange – Badges: Registered MSE users can also receive three badge types: bronze, silver and gold. The ‘class’ attribute shows the type of the badge, value 3 indicating bronze, 2 silver and 1 gold. The edit history for posts and comments is also available, but for this edition of the ARQMath lab, edit history information has not been used. 3.2 The ARQMath 2020 Test Collection Because search topics are built from questions asked in 2019, all training and retrieval is performed on content from 2018 and earlier. We removed any data from the collection generated after the year 2018, using the ‘creation date’ available for each item. The final collection contains roughly 1 million questions and 28 million formulae. Formulae. While MSE provides aHTML tag for some mathematical formulae, many are only present as a LATEX string located be- tween single or double ‘$’ signs. Using the math-container tags and dollar sign delimiters we identified formulae in question posts, answer posts, and comments. Every identified instance of a formula was assigned a unique identifier, and then placed in a HTML tag using the form: ... where FID is the formula id. Overall, 28,320,920 formulae were detected and annotated in this way. Additional Formula Representations. Rather than use raw LATEX, it is common for math-aware information retrieval systems to represent formu- las as one or both of two types of rooted trees. Appearance is represented by the spatial arrangement of symbols on writing lines (in Symbol Layout Trees (SLTs)), and mathematical syntax (sometimes referred to as (shallow) seman- tics) is represented using a hierarchy of operators and arguments (in Operator Trees (OPTs)) [5, 13, 23]. The standard representations for these are Presenta- tion MathML (SLT) and Content MathML (OPT). To simplify the processing required of participants, and to maximize comparability across submitted runs, we used LaTeXML8 to generate Presentation MathML and Content MathML from LATEX for each formula in the ARQMath collection. Some LATEX formu- las were malformed and LaTeXML has some processing limitations, resulting in conversion failures for 8% of SLTs, and and 10% of OPTs. Participants could elect to do their own formula extraction and conversions, although the formulae that could be submitted in system runs for Task 2 were limited to those with identifiers in the LATEX TSV file. ARQMath formulae are provided in LATEX, SLT, and OPT representations, as Tab Separated Value (TSV) index files. Each line of a TSV file represents a single instance of a formula, containing the formula id, the id of the post in which the formula instance appeared, the id of the thread in which the post 8 https://dlmf.nist.gov/LaTeXML is located, a post type (title, question, answer or comment), and the formula representation in either LATEX, SLT (Presentation MathML), or OPT (Content MathML). There are two sets of formula index files: one set is for the collection (i.e., the posts from 2018 and before), and the second set is for the search topics (see below), which are from 2019. HTML Question Threads. HTML views of threads, similar to those on the MSE web site (a question, along with answers and other related information) are also included in the ARQMath test collection. The threads are constructed automatically from the MSE snapshot XML files described above. The threads are intended for use by teams who performed manual runs, or who wished to examine search results (on queries other than evaluation queries) for formative evaluation purposes. These threads were also used by assessors during evaluation. The HTML thread files were intended only for viewing threads; participants were asked to use the provided XML and formula index files (described above) to train their models. Distribution. The MSE test collection was distributed to participants as XML files on Google drive.9 To facilitate local processing, the organizers provided python code on GitHub10 for reading and iterating over the XML data, and generating the HTML question threads. 4 Task 1: Answer Retrieval The primary task for ARQMath 2020 was the answer retrieval task, in which participants were presented with a question that had actually been asked on MSE in 2019, and were asked to return a ranked list of up to 1,000 answers from prior years (2010-2018). System results (‘runs’) were evaluated using rank quality measures (e.g., nDCG0 ), so this is a ranking task rather than a set retrieval task, and participating teams were not asked to say where the searcher should stop reading. This section describes for Task 1 the search topics (i.e., the questions), the submissions and baseline systems, the process used for creating relevance judgments, the evaluation measures, and the results. 4.1 Topics In Task 1 participants were given 101 questions as search topics, of which 3 were training examples. These questions are selected from questions asked on MSE in 2019. Because we wished to support experimentation with retrieval systems that use text, math, or both, we chose from only the 2019 questions that contain some text and at least one formula. Because ranking quality measures can distinguish between systems only on topics for which relevant documents exist, we calculated the number of duplicate and related posts for each question and chose only from those that had at least one duplicate or related post.11 Because we were 9 https://drive.google.com/drive/folders/1ZPKIWDnhMGRaPNVLi1reQxZWTfH2R4u3 10 https://github.com/ARQMath/ARQMathCode 11 Note that participating systems did not have access to this information. interested in a diverse range of search tasks, we also calculated the number of formulae and Flesch’s Reading Ease score [9] for each question. Finally, we noted the asker’s reputation and the tags assigned for each question. We then manually drew a sample of 101 questions that was stratified along those dimensions. In the end, 77 of these questions were evaluated and included in the test collection. The topics were selected from various domains (real analysis, calculus, linear algebra, discrete mathematics, set theory, number theory, etc.) that represent a broad spectrum of areas in mathematics that might be of interest to expert or non-expert users. The difficulty level of the topics spanned from easy problems that a beginning undergraduate student might be interested in to difficult prob- lems that would be of interest to more advanced users. The bulk of the topics were aimed at the level of undergraduate math majors (in their 3rd or 4th year) or engineering majors fulfilling their math requirements. Some topics had simple formulae; others had fairly complicated formulae RR with subscripts, superscripts, and special symbols like the double integral V f (x, y)dx dy or binomial coefficients such as nr . Some topics were primarily based on compu- tational steps, and some asked about proof techniques (making extensive use of text). Some topics had named theorems or concepts (e.g. Cesàro-Stolz theorem, Axiom of choice). As organizers, we labeled each question with one of three broad categories, computation, concept or proof. Out the 77 assessed questions, 26 were catego- rized as computation, 10 as concept, and 41 as proof. We also categorized the questions based on their perceived difficulty level, with 32 categorized as easy, 21 as medium, and 24 as hard. The topics were published as an XML file with the format shown in Fig- ure 1, where the topic number is an attribute of the Topic tag, and the Title, Question and asker-provided Tags are from the MSE question post. To facilitate system development, we provided python code that participants could use to load the topics. As in the collection, the formulae in the topic file are placed in ‘math-container’ tags, with each formula instance being represented by a unique identifier and its LATEX representation. And, as with the collection, we provided three TSV files, one each for the LATEX, OPT and SLT representations of the formulae, in the same format as the collection’s TSV files. 4.2 Runs Submitted by Participating Teams Participating teams submitted runs using Google Drive. A total of 18 runs were were received from a total of 5 teams. Of these, 17 runs were declared as au- tomatic, meaning that queries were automatically processed from the topic file, that no changes to the system had been made after seeing the queries, and that ranked lists for each query were produced with no human intervention. One run was declared as manual, meaning that there was some type of human involve- ment in generating the ranked list for each query. Manual runs can contribute diversity to the pool of documents that are judged for relevance, since their error characteristics typically differ from those of automatic runs. All submitted runs used both text and formulae. The teams and submissions are shown in Table Table 2. Submitted Runs for Task 1 (18 runs) and Task 2 (11 runs). Additional baselines for Task 1 (5 runs) and Task 2 (1 run) were also generated by the organizers. Automatic Runs Manual Runs Primary Alternate Primary Alternate Task 1: Question Answering Baselines 4 1 DPRL 1 3 MathDowsers 1 3 1 MIRMU 3 2 PSU 1 2 ZBMath 1 Task 2: Formula Retrieval Baseline 1 DPRL 1 3 MIRMU 2 3 NLP-NIST 1 ZBMath 1 2. Please see the participant papers in the working notes for descriptions of the systems that generated these runs. Of the 17 runs declared as automatic, two were in fact manual runs (for ZBMath, see Table 2). 4.3 Baseline Runs As organizers, we ran five baseline systems for Task 1. The first baseline is a TF-IDF (term frequency–inverse document frequency) model using the Ter- rier system [17]. In the TF-IDF baseline, formulae are represented using their LATEX string. The second baseline is Tangent-S, a formula search engine using SLT and OPT formula representations [5]. One formula was selected from each Task 1 question title if possible; if there was no formula in the title, then one formula was instead chosen from the question’s body. If there were multiple for- mulae in the selected field, the formula with the largest number of nodes in its SLT representation was chosen. Finally, if there were multiple formulae with the highest number of nodes, one of these was chosen randomly. The third baseline is a linear combination of TF-IDF and Tangent-S results. To create this combi- nation, first the relevance scores from both systems were normalized between 0 and 1 using min-max normalization, and then the two normalized scores were combined using an unweighted average. The TF-IDF baseline used default parameters in Terrier. The second base- line (Tangent-S) retrieves formulae independently for each representation, and then linearly combines SLT and OPT scoring vectors for retrieved formulae [5]. For ARQMath, we used the average weight vector from cross validation results obtained on the NTCIR-12 formula retrieval task. The fourth baseline was the ECIR 2020 version of the Approach0 text + math search engine [22], using queries manually created by the third and fourth authors. This baseline was not available in time to contribute to the judgment pools and thus was scored post hoc. Table 3. Retrieval Time in Seconds for Task 1 Baseline Systems. Run Time (seconds) System Min (Topic) Max (Topic) (Avg, StDev) TF-IDF (Terrier) 0.316 (A.42) 1.278 (A.1) (0.733, 0.168) Tangent-S 0.152 (A.72) 160.436 (A.60) (6.002, 18.496) TF-IDF + Tangent-S 0.795 (A.72) 161.166 (A.60) (6.740, 18.483) Approach0 0.007 (A.3) 91.719 (A.5) (17.743, 18.789) The final baseline was built from duplicate post links from 2019 in the MSE collection (which were not available to participants). This baseline returns all answer posts from 2018 or earlier that were in threads from 2019 or earlier that MSE moderators had marked as duplicating the question post in a topic. The posts are sorted in descending order by their vote scores. Performance. Table 3 shows the minimum, maximum, average, and stan- dard deviation of retrieval times for each of the baseline systems. For running all the baselines, we used a system of 528 GB Ram, with Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz. 4.4 Assessment Pooling. Participants were asked to rank 1,000 (or fewer) answer posts for each Task 1 topic. Top-k pooling was then performed to create pools of answer posts to be judged for relevance to each topic. The top 50 results were combined from all 7 primary runs, 4 baselines, and 1 manual run. To this, we added the top 20 results from each of the 10 automatic alternate runs. Duplicates were then deleted, and the resulting pool was sorted in random order for display to assessors. The pooling process is illustrated in Figure 2. This process was designed to identify as many relevant answer posts as possible given the available assessment resources. On average, pools contained about 500 answers per topic. Relevance definition. Some questions might offer clues as to the level of mathematical knowledge on the part of the person posing the question; others might not. To avoid the need for the assessor to guess about the level of math- ematical knowledge available to the person interpreting the answer, we asked assessors to base their judgments on degree of usefulness for an expert (mod- eled in this case as a math professor) who might then try to use that answer to help the person who had asked the original question. We defined four levels of relevance, as shown in Table 4. Assessors were allowed to consult external sources on their own in order to familiarize themselves with the topic of a question, but the relevance judgments for each answer post were performed using only information available within the collection. For example, if an answer contained an MSE link such as https: //math.stackexchange.com/questions/163309/pythagorean-theorem, they could follow that link to better understand the intent of the person writing the Table 4. Relevance Scores, Ratings, and Definitions for Tasks 1 and 2. Task 1: Question Answering Score Rating Definition 3 High Sufficient to answer the complete question on its own 2 Medium Provides some path towards the solution. This path might come from clar- ifying the question, or identifying steps towards a solution 1 Low Provides information that could be useful for finding or interpreting an answer, or interpreting the question 0 Not Relevant Provides no information pertinent to the question or its answers. A post that restates the question without providing any new information is con- sidered non-relevant Task 2: Formula Retrieval Score Rating Definition 3 High Just as good as finding an exact match to the query formula would be 2 Medium Useful but not as good as the original formula would be 1 Low There is some chance of finding something useful 0 Not Relevant Not expected to be useful answer, but an external link to the Wikipedia page https://en.wikipedia. org/wiki/Pythagorean_theorem would not be followed. Training Set. The fourth author created a small set of relevance judgment files for three topics. We used duplicate question links to find possibly relevant answers, and then performed relevance judgments on the same 0, 1, 2 and 3 scale that was later used by the assessors. We referred to this as a ‘training set,’ although in practice such a small collection is at best a sanity check to see if Task 1: QUESTION ANSWERING Task 2: FORMULA RETRIEVAL Top-50 answers selected from baselines, Top-25 visually distinct formulae selected from baseline primary and manual runs, for a given query. and each primary run, for a given formula query. … … Pooling Pooling … … Top-20 answers selected from Top-10 visually distinct formulae selected from alternate runs for a given query. each alternate run for a given formula query. Fig. 2. Pooling Procedures. For Task 1, the pool depth for baselines, primary, and manual runs is 50, and for alternate runs 20. For Task 2 pool depth is the rank at which k visually distinct formulae are observed (25 for primary/baseline, 10 for alternate). systems were producing reasonable results. Moreover, these relevance judgments were performed before assessor training had been conducted, and thus the def- inition of relevance used by the fourth author may have differed in subtle ways from the definitions on which the assessors later settled. Assessment System. Assessments were performed using Turkle12 , a locally installed system with functionality similar to Amazon Mechanical Turk. Turkle uses an HTML task template file, plus a Comma Separate Value (CSV) file to fill HTML templates for each topic. Each row in the CSV file contains the question title, body, and the retrieved answer to be judged. Judgments are exported as CSV files. As Figure 6 (at the end of this document) illustrates, there were two panels in the Turkle user interface. The question was shown on the left panel, with the Title on top (in a grey bar); below that was the question body. There was also a Thread link, on which assessors could click to look at the MSE post in context, with the question and all of the answers that were actually given for this question (in 2019). This could help the assessor to better understand the question. In the right panel, the answer to be judged was shown at the top. As with the question, there was a thread link where the assessors could click to see the original thread in which the answer post being judged had been present in MSE. This could be handy when the assessors wanted to see details such as the question that had been answered at the time. Finally, the bottom of the right panel (below the answer) was where assessors selected relevance ratings. In addition to four levels of relevance, two additional choices were available. ‘System failure’ indicated system issues such as unintelligible rendering of formulae, or the thread link not working (when it was essential for interpretation). If after viewing the threads, the assessors were still not able to decide the relevance degree, they were asked to choose ‘Do not know’. The organizers asked the assessors to leave a comment in the event of a system failure or a ‘Do no know’ selection. Assessor Training. Eight paid undergraduate mathematics students (or, in three cases, recent graduates with an undergraduate mathematics degree) were paid to perform relevance judgments. Four rounds of training were performed be- fore submissions from participating teams had been received. In the first round, assessors met online using Zoom with the organizers, one of whom (the third author) is an expert MSE user and a Professor of mathematics. The task was explained, making reference to specific examples from the small training set. For each subsequent round, a small additional additional training set was created using a similar approach (pooling only answers to duplicate questions) with 8 actual Task 1 topics (for which the actual relevance judgments were not then known). The same 8 topics were assigned to every assessor and the assessors worked independently, thus permitting inter-annotator agreement measures to be computed. Each training round was followed by an online meeting with the organizers using Zoom at which assessors were shown cases in which one or more assessor pairs disagreed. They discussed the reasoning for their choices, with the third author offering reactions and their own assessment. These training judg- 12 https://github.com/hltcoe/turkle 0.5 0.4 Kappa Coefficient 0.3 0.2 0.1 0 2 3 4 Training Session Round All Relevance Degrees Binary Relevance Fig. 3. Inter-annotator agreement (Fleiss’ kappa) over 8 assessors during Task 1 train- ing (8 topics per round); four-way classification (gray) and two-way (H+M binarized) classification (black). ments were not used in the final collection, but the same topic could later be reassigned to one of the assessors to perform judgments on a full pool. Some of the question topics would not be typically covered in regular under- graduate courses, so that was a challenge that required the assessors to get a basic understanding of those topics before they could do the assessment. The as- sessors found the questions threads made available in the Turkle interface helpful in this regard (see Figure 6). Through this process the formal definition of each relevance level in Table 4 was sharpened, and we sought to help assessors internalize a repeatable way of making self-consistent judgments that were reasonable in the view of the organizers. Judging relevance is a task that calls for interpretation and formation of a personal opinion, so it was not our goal to achieve identical decisions. We did, however, compute Fleiss’ Kappa for the three independently conducted rounds of training to check whether reasonable levels of agreement were being achieved. As Figure 3 shows, kappa of 0.34 was achieved by the end of training on the four-way assessment task. Collapsing relevance to be binary by considering high and medium as relevant and low and not-relevant as a not-relevant (henceforth “H+M binarization") yielded similar results.13 Assessment. A total of 80 questions were assessed for Task 1. Three judg- ment pools (for topics A2, A22, and A70) had zero or one posts with relevance ratings of high or medium; these 3 topics were removed from the collection be- 13 H+M binarization corresponds to the definition of relevance usually used in the Text Retrieval Conference (TREC). The TREC definition is “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant. Only binary judgments (‘relevant’ or ‘not relevant’) are made, and a document is judged relevant if any piece of it is relevant (regardless of how small the piece is in relation to the rest of the document).” (source: https://trec.nist.gov/data/reljudge_eng.html) 0.6 0.5 Kappa Coefficient 0.4 0.3 0.2 0.1 0 3 4 5 7 8 Total Assessor Id All Relevance Degrees Binary Relevance Fig. 4. Inter-annotator agreement (Cohen’s kappa) over 5 assessors after Task 1 assess- ment was completed. Each assessor evaluated two topics that had been scored by two of the other assessors. Shown are results for four-way classification (gray) and two-way (H+M binarized) classification (black). Results are provided for each individual Task 1 assessor (as average kappa score), along with the average kappa values over all assessors at right (‘Total’). cause topics with no relevant posts cannot be used to distinguish between ranked retrieval systems, and topics with only a single relevant post result in coarsely quantized values for H+M binarized evaluation measures, and that degree of quantization can adversely affect the ability to measure statistically significant differences. For the remaining 77 questions, an average of 508.5 answers were assessed for each question, with an average assessment time of 63.1 seconds per answer post. The average number of answers labeled with any degree of rele- vance (high, medium, or low; henceforth “H+M+L binarization”) was 52.9 per question, with the highest number of relevant answers being 188 (for topic A.38) and the lowest being 2 (for topic A.96). Post Assessment. After the official assessments were complete for Task 1, each assessor was assigned two tasks completed by two other assessors to calcu- late their agreement. As shown in Figure 4, across all five assessors (‘Total’) an average Cohen’s kappa of 0.29 was achieved on the four-way assessment task, and using H+M binarization the average kappa value was 0.39. The individual as- sessors are reasonably similar (particularly in terms of 4-way agreement) except for Assessor 4. Comparing Figures 3 and 4, we see that agreement was relatively stable between the end of training and after assessments were completed. Af- ter assessment was complete, a slightly lower 4-way agreement but higher H+M binarized agreement was obtained relative to the end of training. 4.5 Evaluation Measures One risk when performing a new task for which rich training data is not yet avail- able is that a larger than typical number of relevant answers may be missed. Measures which treat unjudged documents as not relevant can be used when directly comparing systems that contributed to the judgment pools, but subse- quent use of such a first-year test collection (e.g., to train new systems for the second year of the lab) can be disadvantaged by treating unjudged documents (which as systems improve might actually be relevant) as not relevant. We there- fore chose the nDCG0 measure (read as “nDCG-prime”) introduced by Sakai and Kando [18] as the primary measure for the task. The nDCG measure on which nDCG0 is based is a widely used measure when graded relevance judgments are available, as we have in ARQMath, that pro- duced a single figure of merit over a set of ranked lists. Each retrieved document earns a gain value of (0, 1, 2, or 3) discounted by a slowly decaying function of the rank position of each document. The resulting discounted gain values are accumulated and then normalized to [0,1] by dividing by the maximum possi- ble Discounted Cumulative Gain (i.e., from all identified relevant documents, sorted by decreasing order of gain value). This results in normalized Discounted Cumulative Gain (nDCG). The only difference when computing nDCG0 is that unjudged documents are removed from the ranked list before performing the computation. It has been shown that nDCG0 has somewhat better discrimina- tive power and somewhat better system ranking stability (with judgement ab- lation) than the bpref measure [4] used recently for formula search (e.g., [13]). Moreover, nDCG0 yields a single-valued measure with graded relevance, whereas bpref, Precision@k, and Mean Average Precision (MAP) all require binarized relevance judgments. In addition to nDCG0 , we also compute Mean Average Precision (MAP) with unjudged posts removed (thus MAP0 ), and Precision at 10 posts (P@10).14 For MAP0 and P@10 we used H+M binarization. We removed unjudged posts as a preprocessing step where required, and then computed the evaluation measures using trec_eval.15 4.6 Results Table A1 in the appendix shows the results, with baselines shown first, and then teams and their systems ranked by nDCG0 . nDCG0 values can be interpreted as the average (over topics) of the fraction of the score for the best possible that was actually achieved. As can be seen, the best nDCG0 value that was achieved was 0.345, by the MathDowsers team. For measures computed using H+M bi- narization we can see that MAP0 and P@10 generally show system comparison patterns similar to those of nDCG0 , although with some differences in detail. 5 Task 2: Formula Retrieval In the formula retrieval task, participants were presented with one formula from a 2019 question used in Task 1, and asked to return a ranked list of up to 1,000 formula instances from questions or answers from the evaluation epoch (2018 14 Pooling to at least depth 20 ensures that there are no unjudged posts above rank 10 for any primary or secondary submission, and for four of the five baselines. Note, however, that P@10 cannot achieve a value of 1 because some topics have fewer than 10 relevant posts. 15 https://github.com/usnistgov/trec_eval or earlier). Formulae were returned by their identifiers in math-container tags and the companion TSV LATEX formula index file, along with their associated post identifiers. This task is challenging because someone searching for math formulae may have goals not evident from the formula itself. For example: – They may be looking to learn what is known, to form connections between disciplines, or to discover solutions that they can apply to a specific problem. – They may want to find formulae of a specific form, including details such as specific symbols that have significance in a certain context, or they may wish to find related work in which similar ideas are expressed using different notation. For example, the Schrödinger equation is written both as a wave equation and as a probability current (the former is used in Physics, whereas the latter is used in the study of fluid flow). – They may be happy to find formulae that contain only part of their formula query, Pnor they may want only complete matches. For example, Pnsearching for i=1 ui vi could bring up the Cauchy-Schwarz inequality i=1 ui vi ≤ Pn 1 Pn 2 2 1 2 2 i=1 ui i=1 vi . For these reasons (among others), it is difficult to formulate relevance judgments for retrieved formulae without access to the context in which the formula query was posed, and to the context in which each formula instance returned as a potentially useful search result was expressed. Three key details differentiate Task 2 from Task 1. First, in Task 1 only answer posts were returned, but for Task 2 the formulae may appear in answer posts or in question posts. Second, for Task 2 we distinguish visually distinct formulae from instances of those formulae, and evaluate systems based on the ranking of the visually distinct formulae that they return. We call formulae appearing in posts formula instances, and of course the same formula may appear in more than one post. By formula, we mean a set of formula instances that are visually identical when viewed in isolation. For example, x2 is a formula, x ∗ x is a different formula, and each time x2 appears is a distinct instance of the formula x2 . Systems in Task 2 rank formula instances in order to support the relevance judgment process, but the evaluation measure for Task 2 is based on the ranking of visually distinct formulae. The third difference between Task 1 and Task 2 is that in Task 2 the goal is not answering questions, but rather, to show the searcher formulae that might be useful to them as they seek to satisfy their information need. Task 2 is thus still grounded in the question, but the relevance of a retrieved formula is defined by the formula’s expected utility, not just the post in which that one formula instance was found. As with Task 1, ranked lists were evaluated using rank quality measures, making this a ranking task rather than a set retrieval task. Unlike Task 1, the design of which was novel, a pre-existing training set for a similar task (the NTCIR-12 Wikipedia Formula Browsing task test collection [20]) was available to participants. However, we note that the definition of relevance used in Task 2 differs from the definition of relevance in the NTCIR-12 task. This section describes for Task 2 the search topics, the submissions and baselines, the process used for creating relevance judgments, the evaluation measures, and the results. 5.1 Topics In Task 2, participating teams were given 87 mathematical formulae, each found in a different question from Task 1 from 2019, and they were asked to find relevant formulae instances from either question or answer posts in the test collection (from 2018 and earlier). The topics for Task 2 were provided in an XML file similar to those of Task 1, in the format shown in Figure 1. Task 2 topics differ from their corresponding Task 1 topics in three ways: 1. Topic number: For Task 2, topic ids are in form "B.x" where x is the topic number. There is a correspondence between topic id in tasks 1 and 2. For instance, topic id "B.9" indicates the formula is selected from topic "A.9" in Task 1, and both topics include the same question post (see Figure 1). 2. Formula_Id: This added field specifies the unique identifier for the query formula instance. There may be other formulae in the Title or Body of the question post, but the query is only the formula instance specified by this Formula_Id. 3. LATEX: This added field is the LATEX representation of the query formula instance as found in the question post. Because query formulae are drawn form Task 1 question posts, the same LATEX, SLT and OPT TSV files that were provided for the Task 1 topics can be consulted when SLT or OPT representations for a query formula are needed. Formulae for Task 2 were manually selected using a heuristic approach to stratified sampling over three criteria: complexity, elements, and text depen- dence. Formulae complexity was labeled low, medium or Pnhigh by the fourth df author. For example, dx = f (x + 1) is low complexity, k=0 nk k is medium x3 x5 x7 x(2n+1) P∞ complexity, and x − 3×3! + 5×5! − 7×7! + · · · = n=0 (−1)n (2n+1)×(2n+1)! is high complexity. Mathematical elements such as limit, integral, fraction or matrix were manually noted by the fourth author when present. Text dependence re- flected the fourth author’s opinion of the degree to which text in the Title and Question fields were likely to yield related search results. For instance, for one df Task 2 topic, the query formula is dx = f (x + 1) whereas the complete question df is: “How to solve differential equations of the following form: dx = f (x + 1) .” When searching for this formula, perhaps the surrounding text could safely be ignored. At most one formula was selected from each Task 1 question topic to produce Task 2 topics. In 12 cases, it was decided that no formula in a question post would be a useful query for Task 2, and thus 12 Task 1 queries have no corresponding Task 2 query. 5.2 Runs Submitted by Participating Teams A total of 11 runs were received for Task 2 from a total of 4 teams, as shown in Table 2. All were automatic runs. Each run contains at most 1,000 formula instances for each query formula, ranked in decreasing order of system-estimated relevance to that query. For each formula instance in a ranked list, participating teams provided the formula_id and the associated post_id for that formula. Please see the participant papers in the working notes for descriptions of the systems that generated these runs. 5.3 Baseline Runs We again used Tangent-S [5] as our baseline. Unlike Task 1, a single formula is specified for each Task 2 query, so no formula selection step was needed. This Tangent-S baseline makes no use of the question text. Performance. For the Tangent-S baseline, the minimum retrieval time was 0.238 seconds for topic B.3, and the maximum retrieval time was 30.355 seconds for topic B.51. The average retrieval time for all queries was 3.757 seconds, with a standard deviation of 5.532 seconds. The same system configuration was used as in Task 1. 5.4 Assessment Pooling. The retrieved items for Task 2 are formula instances, but pooling was done based on visually distinct formulae, not formula instances (see Figure 2). This was done by first clustering all formula instances from all submitted runs to identify visually distinct formulae, and then proceeding down each list until at least one instance of some number of different formulae had been seen. For primary runs and for the baseline run, the pool depth was the rank of the first instance of the 25th visually distinct formula; for secondary runs the pool depth was the rank of the first instance of the 10th visually distinct formulae. Additionally, a pool depth of 1,000 (i.e., all available formulae) was used for any formula for which the associated answer post had been marked as relevant for Task 1.16 This was the only way in which the post ids for answer posts was used. Clustering visually distinct formulae instances was performed using the SLT representation when possible, and the LATEX representation otherwise. We first converted the Presentation MathML representation to a string representation using Tangent-S, which performed a depth-first traversal of the SLT, with each SLT node generating a single character of the SLT string. Formula instances with identical SLT strings were considered to be the same formula; note that this ig- nores differences in font. For formula instances with no Tangent-S SLT string available, we removed the white space from their LATEX strings and grouped for- mula instances with identical strings. This process is simple and appears to have been reasonably robust for our purposes, but it is possible that some visually identical formula instances were not captured due to LaTeXML conversion fail- ures, or where different LATEX string produce the same formula (e.g., if subscripts and superscripts appear in a different order). 16 One team submitted incorrect post id’s for retrieved formulae; those post id’s were not used for pooling. Assessment was done on formula instances, so for each formula we selected at most five instances to assess. We selected the 5 instances that were contributed to the pools by the largest number of runs, breaking ties randomly. Out of 5,843 visually distinct formulae that were assessed, 93 (1.6%) had instances in more than 5 pooled posts. Relevance definition. The relevance judgment task was defined for asses- sors as follows: for a formula query, if a search engine retrieved one or more instances of this retrieved formula, would that have been expected to be useful for the task that the searcher was attempting to accomplish? Assessors were presented with formula instances, and asked to decide their relevance by considering whether retrieving either that instance or some other instance of that formula would have been useful, assigning each formula instance in the judgment pool one of four scores as defined in Table 4. 1 P For example,Pif the formula query was n2+cos n , and the formula instance ∞ 1 to be judged is n−1 n2 , the assessors would decide whether finding the second formula rather than the first would be expected to yield good results. To do this, they would consider the content of the question post containing the query (and, optionally, the thread containing that question post) in order to understand the searcher’s actual information need. Thus the question post fills a role akin to Borlund’s simulated work task [3], although in this case the title, body and tags from the question post are included in the topic and thus can optionally be used by the retrieval system. The assessor can also consult the post containing a retrieved formula instance (which may be another question post or an answer post), along with the associated thread, to see if in that case the formula instance would indeed have been a useful basis for a search. Note, however, that the assessment task is not to determine whether the specific post containing the retrieved formula instance is useful, but rather to use that context as a basis for estimating the degree to which useful content would likely be found if this or other instances of the retrieved formula were returned by a search engine. We then defined the relevance score for a formula to be the maximum rel- evance score for any judged instance of that formula. This relevance definition essentially asks “if instances of this formula were returned, would we reasonably expect some of those instances to be useful?” This definition of relevance might be used by system developers in several ways. One possibility is using Task 2 relevance judgments to train a formula matching component for use in a Task 1 system. A second possibility is using these relevance judgments to train and evaluate a system for interactively suggesting alternative formulae to users.17 Assessment System. As in Task 1, we used Turkle to build the assessment system. As shown in Figure 6 (at the end of this document), there are two main panels. In the left panel, the question is shown as in Task 1, but now with the formula query highlighted in yellow. In the right panel, up to five retrieval posts (question posts or answer posts) containing instances of the same retrieved formula are displayed, with the retrieved formula instance highlighted in each 17 See, for example, MathDeck [16], in which candidate formulae are suggested to the users during formula editing. P∞ case. For example, the formula n=1 an shown in Figure 6 was retrieved both in an answer post (shown first) and in a question post (shown second). As in Task 1, buttons are provided for the assessor to record their judgment; unlike Task 1, judgments for each instance of the same retrieved formula (up to 5) are recorded separately, and later used to produce a single (max) score for each visually distinct formula. Assessor training. After some initial work on assessment for Task 1, 3 assessors were reassigned to to perform relevance judgements for Task 2, with the remaining 5 continuing to do relevance judgments for Task 1. Two rounds of training were performed. In the first training round, the assessors were familiarized with the task. To illustrate how formula search might be used, we interactively demonstrated formula suggestion in MathDeck [16] and the formula search capability of Ap- proach0 [23]. Then the task was defined using examples, showing a formula query with some retrieved results, talking through the relevance definitions and how to apply those definitions in specific cases. During the training session, the assessors saw different example results for topics and discussed their relevance based on criteria defined for them with the organizers. They also received feedback from the third author, an expert MSE user. To prepare the judgment pools used for this purpose, we pooled actual submissions from participating teams, but only to depth 10 (i.e., 10 different formulae) for primary runs and the baseline run, and 5 different formulae for alternate runs. The queries used for this initial assessor training were omitted from the final Task 2 query set on which systems were evaluated because they were not judged on full-sized pools. All three assessors were then assigned two complete Task 2 pools (for topics B.46 and B.98) to independently assess; these topics were not removed from the collection. After creating relevance judgments for these full-sized pools, the as- sessors and organizers met by Zoom to discuss and resolve disagreements. The assessors used this opportunity to refine their understanding of the relevance criteria, and the application of those criteria to specific cases. Annotator agree- ment was found to be fairly good (kappa=0.83). An adjudicated judgment was recorded for each disagreement, and used in the final relevance judgment sets for these two topics. The assessors were then each assigned complete pools to judge for four topics, one of which was also assessed independently by a second assessor. The average kappa on the three dual-assessed topics was 0.47. After discussion between the organizers and the assessors, adjudicated disagreements were recorded and used in the final relevance judgments. The assessors then performed the remaining assessments for Task 2 independently. Assessment. A total of 47 topics were assessed for Task 2. Two queries (B.58 and B.65) had fewer than two relevant answers after H+M binarization and were removed. Of the remaining 45 queries, an average of 125.0 formulae were assessed per topic, with an average assessment time of 38.1 seconds per formulae. The average number of formulae instances labeled as relevant after 0.7 0.6 Kappa Coefficient 0.5 0.4 0.3 0.2 0.1 0 1 2 6 Total Assessor Id All Relevance Degrees Binary Relevance Fig. 5. Inter-annotator agreement (Cohen’s kappa) over 3 assessors after official Task 2 assessment. Each annotator evaluated two tasks completed by the other two annotators. Shown are four-way classification (gray) and two-way (H+M binarized) classification (black). Results are provided for each individual Task 1 assessor (as average kappa score), along with the average kappa values over all assessors at right (‘Total’). H+M+L binarization was 43.1 per topic, with the highest being 115 for topic B.60 and the lowest being 7 for topics B.56 and B.32. Post Assessment. As we did for Task 1, after assessment for Task 2 was completed, each of the three assessors were given two topics, one completed by each of the other two annotators. Figure 5 shows the Cohen’s kappa coefficient values for each assessor and total agreement over all of them. A kappa of 0.30 was achieved on the four-way assessment task, and with H+M binarization the average kappa value was 0.48. Interestingly, the post-assessment agreement be- tween assessors is about the same as Task 1 for four-way agreement (0.29), but H+M binarized agreement is almost 10% higher than Task 1 (0.39). When asked, assessors working on Task 2 (who had all been previously trained on Task 1) reported finding Task 2 assessment to be easier. We note that there were fewer assessors working on Task 2 than Task 1 (3 vs. 5 assessors). Additional Training Topics. After the official assessment, to increase the size of the available dataset, an additional 27 topics were annotated. These are available in the ARQMath dataset, and can be used for training models. As a result, 74 topics have been published for Task 2. 5.5 Evaluation Measures As for Task 1, the primary evaluation measure for Task 2 is nDCG0 , and MAP0 and P@10 were also computed. Participants submitted ranked lists of formula instances, but we computed these measures over visually distinct formulae. To do this, we replaced each formula instance with its associated visually distinct formula, then deduplicated from the top of the list downward to obtain a ranked list of visually distinct formulae, and then computed the evaluation measures. As explained above, the relevance score for each visually distinct formula was computed as the maximum score over each assessed instance of that formula. 5.6 Results Table A2 in the appendix shows the results, with the baseline run shown first, and then teams and their systems ranked by nDCG0 . No team did better than the baseline system as measured by nDCG0 or MAP0 , although the DPRL team did achieve the highest score for P@10. 6 Conclusion The ARQMath lab is the first shared-task evaluation exercise to explore Com- munity Question Answering (CQA) for mathematical questions. Additionally, the lab introduced a new formula retrieval task in which both the query and retrieved formulae are considered within the context of their question or answer posts, and evaluation is performed using visually distinct formulas, rather than all formulas returned in a run. For both tasks, we used posts and associated data from the Math Stack Exchange (MSE) CQA forum. To reduce assessor effort and obtain a better understanding of the relationship between mathematical CQA and formula search, the formulae used as formula search topics were selected from the Task 1 (CQA) question topics. This allowed us to increase coverage for the formula retrieval task by using relevant posts found in the CQA evaluations as candidates for assessment. To enrich the judgments pools for the CQA task, we added answer posts from the original topic question thread and threads identified as duplicate questions by the MSE moderators. In total, 6 teams submitted 29 runs: 5 teams submitted 18 runs for the CQA task (Task 1), and 4 teams submitted 11 runs for the formula retrieval task (Task 2). We thus judge the first year of the ARQMath lab to be successful. Each of these teams had some prior experience with math-aware information retrieval; in future editions of the lab we hope to further broaden participation, particularly from the larger IR and NLP communities. Our assessment effort was substantial: 8 paid upper-year or recently gradu- ated undergraduate math students worked with us for over a month, and under- went training in multiple phases. Our training procedure provided our assessors with an opportunity to provide feedback on relevance definitions, the assessment interface, and best practices for assessment. In going through this process, we learned that 1) the CQA task is much harder to assess than the formula retrieval task, as identifying non-relevant answers requires more careful study than iden- tifying non-relevant formulae, 2) the breadth of mathematical expertise needed for the CQA task is very high; this led us to having assessors indicate which questions they wished to assess and us assigning topics according to those pref- erences (leaving the 10 topics that no assessor requested unassessed), and 3) having an expert mathematician (in this case, a math Professor) involved was essential for task design, clarifying relevance definitions, and improving assessor consistency. To facilitate comparison with systems using ARQMath for benchmarking in the future, and to make use of our graded relevance assessments, we chose nDCG0 [18] as the primary measure for comparing systems. Additional metrics (MAP0 and Precision at 10) are also reported to provide a more complete picture of system differences. Overall, we found that systems submitted to the first ARQMath lab generally approached the task in similar ways, using both text and formulae for Task 1, and (with two exceptions) operating fully automatically. In future editions of the task, we hope to see a greater diversity of goals, with, for example, systems optimized for specific types of formulae, or systems pushing the state of the art for the use of text queries to find math. We might also consider supporting a broad range of more specialized investigations by, for example, creating subsets of the collection that are designed specifically to formula variants such as simplified forms or forms using notation conventions from different disciplines. Our present collection includes user-generated tags, but we might also consider defining a well-defined tag set to indicate which of these types of results are desired. Acknowledgements. Wei Zhong suggested using Math Stack Exchange for bench- marking, made Approach0 available for participants, and provided helpful feedback. Kenny Davila helped with the Tangent-S formula search results. We also thank our student assessors from RIT: Josh Anglum, Wiley Dole, Kiera Gross, Justin Haver- lick, Riley Kieffer, Minyao Li, Ken Shultes, and Gabriella Wolf. This material is based upon work supported by the National Science Foundation (USA) under Grant No. IIS- 1717997 and the Alfred P. Sloan Foundation under Grant No. G-2017-9827. References 1. Aizawa, A., Kohlhase, M., Ounis, I.: NTCIR-10 math pilot task overview. In: NT- CIR (2013) 2. Aizawa, A., Kohlhase, M., Ounis, I., Schubotz, M.: NTCIR-11 Math-2 task overview. In: NTCIR. vol. 11, pp. 88–98 (2014) 3. Borlund, P.: The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research 8(3), 8–3 (2003) 4. Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 25–32 (2004) 5. Davila, K., Zanibbi, R.: Layout and semantics: Combining representations for mathematical formula search. In: Proceedings of the 40th International ACM SI- GIR Conference on Research and Development in Information Retrieval. pp. 1165– 1168 (2017) 6. Guidi, F., Coen, C.S.: A survey on retrieval of mathematical knowledge. In: CICM. Lecture Notes in Computer Science, vol. 9150, pp. 296–315. Springer (2015) 7. Hopkins, M., Le Bras, R., Petrescu-Prahova, C., Stanovsky, G., Hajishirzi, H., Koncel-Kedziorski, R.: SemEval-2019 Task 10: Math Question Answering. In: Pro- ceedings of the 13th International Workshop on Semantic Evaluation (2019) 8. Kaliszyk, C., Brady, E.C., Kohlhase, A., Coen, C.S. (eds.): Intelligent Computer Mathematics - 12th International Conference, CICM 2019, Prague, Czech Repub- lic, July 8-12, 2019, Proceedings, Lecture Notes in Computer Science, vol. 11617. Springer (2019) 9. Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for Navy enlisted personnel. Tech. rep., Naval Technical Training Command Millington TN Research Branch (1975) 10. Kushman, N., Artzi, Y., Zettlemoyer, L., Barzilay, R.: Learning to automatically solve algebra word problems. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (2014) 11. Ling, W., Yogatama, D., Dyer, C., Blunsom, P.: Program induction by rationale generation: Learning to solve and explain algebraic word problems. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017) 12. Mansouri, B., Agarwal, A., Oard, D., Zanibbi, R.: Finding old answers to new math questions:the ARQMath lab at CLEF 2020. In: European Conference on Information Retrieval (2020) 13. Mansouri, B., Rohatgi, S., Oard, D.W., Wu, J., Giles, C.L., Zanibbi, R.: Tangent- CFT: An embedding model for mathematical formulas. In: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR). pp. 11–18 (2019) 14. Mansouri, B., Zanibbi, R., Oard, D.W.: Characterizing searches for mathematical concepts. In: Joint Conference on Digital Libraries (2019) 15. Newell, A., Simon, H.: The logic theory machine–a complex information processing system. IRE Transactions on information theory (1956) 16. Nishizawa, G., Liu, J., Diaz, Y., Dmello, A., Zhong, W., Zanibbi, R.: MathSeer: A math-aware search interface with intuitive formula editing, reuse, and lookup. In: European Conference on Information Retrieval. pp. 470–475. Springer (2020) 17. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Johnson, D.: Terrier information retrieval platform. In: European Conference on Information Retrieval. pp. 517–519. Springer (2005) 18. Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval 11(5), 447–470 (2008) 19. Schubotz, M., Youssef, A., Markl, V., Cohl, H.S.: Challenges of mathematical infor- mation retrieval in the NTCIR-11 Math Wikipedia Task. In: SIGIR. pp. 951–954. ACM (2015) 20. Zanibbi, R., Aizawa, A., Kohlhase, M., Ounis, I., Topic, G., Davila, K.: NTCIR-12 MathIR task overview. In: NTCIR (2016) 21. Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. International Journal on Document Analysis and Recognition (IJDAR) 15(4), 331– 357 (2012) 22. Zhong, W., Rohatgi, S., Wu, J., Giles, C.L., Zanibbi, R.: Accelerating substructure similarity search for formula retrieval. In: ECIR (1). Lecture Notes in Computer Science, vol. 12035, pp. 714–727. Springer (2020) 23. Zhong, W., Zanibbi, R.: Structural similarity search for formulas using leaf-root paths in operator subtrees. In: European Conference on Information Retrieval. pp. 116–129. Springer (2019) Project: 2_LI / Batch: B4 Auto-accept next Task Return Task Skip Task Expires in 23:58 Instructions: Select the Relevance of the highlighted formula within each post to the query formula (shown at bottom-left). How to compute this combinatoric sum? Retrieved Post Thread Thread I have the sum Answer: If converges to , then can be rewritten as : I know the result is but I don't know how you get there. How does one even begin to A simplify a sum like this that has binomial coefficients. if is a sequence of positive terms High Medium Low Not Relevant System failure Do not know Thread Title: and is decreasing. Suppose that converges. Prove that also converges. Question: Let be a series such that for each , and is decreasing. Suppose that converges. Prove that also converges. I try to prove by using definition but I got nowhere . Can anyone guide me ? High Medium Low Not Relevant System failure Do not know Submit Fig. 6. Turkle Assessment Interface. Shown are hits for Formula Retrieval (Task 2). In the left panel, the formula query is highlighted. In the right panel, one answer post and one question post containing the same retrieved formula are shown. For Task 1, a similar interface was used, but without formula highlighting, and just one returned answer post viewed at a time. A Appendix: Evaluation Results Table A1. Task 1 (CQA) results, averaged over 77 topics. P indicates a primary run, M indicates a manual run, and (X) indicates a baseline pooled at the primary run depth. For Precision@10 and MAP, H+M binarization was used. The best baseline results are in parentheses. * indicates that one baseline did not contribute to judgment pools. Run Type Evaluation Measures Run Data P M nDCG0 MAP0 P@10 Baselines Linked MSE posts n/a (X) (0.279) (0.194) (0.384) Approach0 * Both X 0.250 0.099 0.062 TF-IDF + Tangent-S Both (X) 0.248 0.047 0.073 TF-IDF Text (X) 0.204 0.049 0.073 Tangent-S Math (X) 0.158 0.033 0.051 MathDowsers alpha05noReRank Both 0.345 0.139 0.161 alpha02 Both 0.301 0.069 0.075 alpha05translated Both X 0.298 0.074 0.079 alpha05 Both X 0.278 0.063 0.073 alpha10 Both 0.267 0.063 0.079 PSU PSU1 Both 0.263 0.082 0.116 PSU2 Both X 0.228 0.054 0.055 PSU3 Both 0.211 0.046 0.026 MIRMU Ensemble Both 0.238 0.064 0.135 SCM Both X 0.224 0.066 0.110 MIaS Both X 0.155 0.039 0.052 Formula2Vec Both 0.050 0.007 0.020 CompuBERT Both X 0.009 0.000 0.001 DPRL DPRL4 Both 0.060 0.015 0.020 DPRL2 Both 0.054 0.015 0.029 DPRL1 Both X 0.051 0.015 0.026 DPRL3 Both 0.036 0.007 0.016 zbMATH zbMATH Both X X 0.042 0.022 0.027 Table A2. Task 2 (Formula Retrieval) results, averaged over 45 topics and computed over deduplicated ranked lists of visually distinct formulae. P indicates a primary run, and (X) shows the baseline pooled at the primary run depth. For MAP and P@10, relevance was thresholded H+M binarization. All runs were automatic. Baseline results are in parentheses. Evaluation Measures Run Data P nDCG0 MAP0 P@10 Baseline Tangent-S Math (X) ( 0.506 ) (0.288) ( 0.478 ) DPRL TangentCFTED Math X 0.420 0.258 0.502 TangentCFT Math 0.392 0.219 0.396 TangentCFT+ Both 0.135 0.047 0.207 MIRMU SCM Math 0.119 0.056 0.058 Formula2Vec Math X 0.108 0.047 0.076 Ensemble Math 0.100 0.033 0.051 Formula2Vec Math 0.077 0.028 0.044 SCM Math X 0.059 0.018 0.049 NLP_NITS formulaembedding Math X 0.026 0.005 0.042