TU_DBS in the ARQMath Lab 2021, CLEF Anja Reusch, Maik Thiele and Wolfgang Lehner Database Systems Group, Technische Universität Dresden, Germany Abstract Mathematical Information Retrieval (MIR) deals with the task of finding relevant documents that contain text and mathematical formulas. Therefore, retrieval systems should not only be able to process natural language, but also mathematical and scientific notation to retrieve documents. The goal of this work is to review the participation of our team in the ARQMath 2021 Lab where two different approaches based on ALBERT and ColBERT were applied to a Question Answer Retrieval task and a Formula Similarity task. The ALBERT-based classification approach received competitive results for the first task. We found that by pre-training on data separated in chunks of text and formulas, the model performed better on formula data. This way of pre-training could also be beneficial for the Formula Search task. Keywords Mathematical Language Processing, Information Retrieval, BERT-based Models 1. Introduction With the rising number of scientific publications and mathematics-aware online communities available Mathematical Information Retrieval has become more important since many of these documents and posts not only use natural language, but also mathematical notation to com- municate. Only interpreting natural language is not sufficient for retrieval in such documents anymore since the usage of mathematical notation is crucial to understand the information conveyed by the author. Hence, in order to search or retrieve information from these platforms, a retrieval system needs to understand the notation of mathematical expressions. The ARQMath Labs 2020 [1] and 2021 [2] have two related aims: Task 1 deals with the retrieval of relevant answers given a question from the Mathematics StackExchange Community. This task involves understanding the problem of the question poster in terms of natural language in combination with mathematical notation in form of LATEX, Symbol Layout Trees (SLTs) or Operator Trees (OPTs). For Task 2 on the other hand, the participants were required to develop a system that returns relevant formulas given a query formula. For Natural Language based Information Retrieval systems based on large pre-trained lan- guage models such as BERT [3] have been found to be effective and out-performed previous, traditional IR systems that were based on string matching methods [4]. In a previous work, we showed that our approach of using an ALBERT-based classifier as a similarity measure is CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " anja.reusch@tu-dresden.de (A. Reusch); maik.thiele@tu-dresden.de (M. Thiele); wolfgang.lehner@tu-dresden.de (W. Lehner) ~ https://wwwdb.inf.tu-dresden.de (W. Lehner)  0000-0002-2537-9841 (A. Reusch); 0000-0002-1665-977X (M. Thiele); 0000-0001-8107-2775 (W. Lehner) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) beneficial for Mathematical Question Answering when answers depend on the written text [5]. However, when answering the question depends on proper interpretation of the formulas, traditional methods are still more suitable. The second disadvantage is that the average query latency of BERT-based systems is a few orders of magnitude larger compared to non-neural methods [6, 7], due to the fact that for each query-document pair an entire forward pass through the deep network needs to be performed. A recent advance in terms of speed without neglecting performance is ColBERT [7], where the authors applied a late interaction mechanism to assess the relevance of a document given a query. This approach made offline indexing of the collection and a faster evaluation possible, since only one forward pass of the query is necessary. Furthermore, our ALBERT-based models have only been applied to Task 1, the retrieval of answers given a textual query question. But the application of this approach to formula retrieval, such as in Task 2, has not been tested yet. Therefore, with our participating in this year’s ARQMath Lab we would like to address the following three areas: • Pre-training adjustments in order to increase the models’ performance on formula under- standing • Faster evaluation by using ColBERT • Application of our ALBERT-based approach to formula retrieval This work is structured as follows: We will first introduce the ARQMath 2021 Lab and then review relevant literature for Information Retrieval and BERT-based systems for natural language and multi-modal tasks. In Section 4 the overall architecture of our approach will be explained. Section 5 and Section 6 introduce Task 1 and Task 2, respectively. We will describe the data set we used to pre-train and fine-tune the different models including a description of the experiments and discuss their results. Finally, the last section summarizes our work. 2. ARQMath 2021 Lab The aim of ARQMath Lab 2021 [2] is to accelerate the research in mathematical Information Retrieval and includes two related, but different tasks: Task 1 involves the retrieval of relevant answer posts for a question asked on the Mathematics StackExchange, which is a platform where users post questions to be answered by the community. The questions should be related to mathematics topics at any level1 . Users have the possibility to add mathematical formulas to their post to clarify their questions. These formulas are written in LATEX notation. Task 2 is built on top of the same data, but with a different goal in mind: Participants are expected to retrieve relevant formulas given a query formula in context of its post. This task is related to the formula browsing task of NTCIR-12 [8]. The participating teams submitted for each topic a ranked list of 1.000 documents retrieved by their systems, which were scored by Normalized Discounted Cumulative Gain, but with unjudged documents removed before assessment (nDCG’). The graded relevance scale used for scoring ranged from 0 (not relevant) to 3 (highly relevant). Two additional measures, mAP’ and P@10, were also reported using binarized relevance judgments 1 https://math.stackexchange.com (0 and 1: not relevant, 2 and 3: relevant). The relevance assessment was performed by pooling after the teams submitted their results. ARQMath 2021 provides data from the Mathematics StackExchange including question and answer posts from 2010 to 2018. In total, the collection contains 1 M questions and 1.4 M answers. Furthermore, users may use mathematical formulas to clarify their posts. These formulas written in LATEX notation were extracted and parsed into Symbol Layout Trees and Operator Trees. Each formula got assigned a formula id and a visual ids. Formulas sharing the same visual appearance received the same visual id. Apart from this corpus of posts and formulas that are available for training and evaluating models, also a test set of queries is released by the organizers of ARQMath. The query topics of 2020 and 2021 contain 99 and 100 topics, repectively, which are question posts including title, text and tags. In the 2020 test set 77 queries were evaluated for Task 1 and 45 formula queries for Task 2, while the evaluation of Task 1 in 2021 included 71 queries and 58 for Task 2. 3. Related Work Bidirectional Encoder Representations from Transformers (BERT) is an architecture based on the encoder of a Transformer model which was designed for language modelling [3]. Due to the success of this and other pre-trained, Transformer-based language models, BERT has been a basis in many systems for Natural Language Understanding (NLU) tasks and applications in Information Retrieval. Hence, there exist several advanced versions such as RoBERTa [9] or ALBERT [10] with the goal to optimize BERT’s performance. The influence of in-domain pre-training has been analyzed by Gururangan et al.[11] who found that this is especially valuable when the domain vocabulary has a low overlap with the pre-training data. As a consequence, various models for different domains have been developed, such as BioBERT [12], ClinicalBERT [13, 14] or SciBERT [15] for scientific domains or CuBERT [16] and CodeBERT [17]. A notable difference between these models is that BioBERT, ClinicalBERT and CodeBERT use the original vocabulary that their base model was pre-trained on while SciBERT and CuBERT trained their own domain specific vocabulary. However, each of these models could demonstrate its improvements compared to the original models without domain specific pre-training. BERT-based models for mathematical domains have also been studied with the most recent example being MathBERT [18]. In addition, during the last ARQMath Lab 2020, two teams submitted systems based on BERT and RoBERTa [19, 20]. Both teams used the models to generate post embeddings for a given question and all answers. Their similarity is calculated by comparing the vectors using cosine similarity. Shortly after BERT outperformed previous approaches in various NLU tasks, it was also successfully applied to Information Retrieval. The model by Nogueira et al. classified its input consisting of a query and one document for their relevance resulting in a score that can be used to rank multiple documents [4]. This approach achieved state-of-the-art performance, but was much slower and computationally expensive than previous systems, because one forward pass through the entire deep neural network was necessary to score one query-document pair. Nevertheless, this approach has also been proven to be effective for the multi-modal retrieval of Figure 1: Overview of our pre-training (details in Figure 2) and fine-tuning procedure (details in Figure 3 and 4). source code [17] and was also applied to Mathematical Question Answering using an ALBERT model trained and evaluated on the ARQMath Lab 2020 test set [5]. The evaluation results were also broken down to the categories determining which part of the question influenced answering it the most. The model showed the best performance when answering the question depended on the written text. But for questions relying on formulas the results were worse than systems based on non-neural methods. Therefore, the modeling capability of formulas needs to be improved to also be able to capture their semantics in a better way. Due to the fact that the ranking model by Nogueira et al. came with a steep increase in computational cost, recent research focused on improving the evaluation time without neglecting its performance gains. Despite there is more than one model dealing with this challenge, we will focus in this work on the approach by Khattab et al. called ColBERT [7]. ColBERT uses a BERT model to separately encode a query and a document and then apply a novel late interaction mechanism to calculate the similarity. This way they achieved competitive results when re- ranking on the popular MSMARCO data set [21] with a latency of 61 ms compared to 107,000 ms using the BERT-based approach by Nogueira et al. 4. Model Architecture BERT-based models have proven to be effective in Natural Language Understanding and In- formation Retrieval tasks. Their strength was also shown in scenarios where not only natural language plays an important role, such as Code Retrieval or Mathematical Language Processing as in this lab [17, 5, 18]. Building on top of these achievements, we apply two deep neural models based on the popular BERT in our submission: ALBERT and ColBERT. ALBERT is an even more recent model based on BERT, which is optimized by factorization of the embeddings and parameter sharing between layers. The general idea of our first approach is to employ the ALBERT model to determine the similarity score between two snippets, for Task 1 a question and an answer and for Task 2 two formulas with context. This is achieved by fine-tuning a classifier on top of the pre-trained ALBERT model which predicts how well the two snippets match. We apply ALBERT for this approach, because its optimizations result in less training pa- rameters and therefore a lower memory consumption and accelerated training speed compared to BERT. The second method that we apply for Task 1 uses a BERT model as a basis of ColBERT. The query and each document are passed through BERT separately in order to encode their respective content. This way an offline computation of the representations of each document is possible beforehand. The late interaction mechanism in form of the L2 distance is applied to aggregate and compare the contextualized embeddings. Finally, the documents are ranked by this computed L2 distance. The success of BERT and BERT-based models is attributed to its transformer architecture [22] and also to the self-supervised pre-training on large amounts of data. In this work, we will focus on the latter aspect and pre-train models on different data highlighting its influence. The overall process of our approach is depicted in Figure 1. We will present details about the pre-training and fine-tuning in the next sections. 5. Task 1 The goal of Task 1 is the retrieval of an answer post from 2010 - 2018 to questions that were posted in 2019. The ARQMath Lab 2021 added a second set of 100 questions asked in 2020. The optimal answers retrieved by the participants are expected to answer the complete question on its own. This relevance of each question was assessed by reviewers during a pooling process. In the following sections we will present our two approaches for this mathematical question answering task. We will first explain the models we used, then how we processed the data corpus for pre-training and fine-tuning. Finally, we give details on our experiments, the results and a comparison to other models. 5.1. Pre-Training As mentioned previously, BERT and also ALBERT rely on pre-training on rather simple tasks. BERT is pre-trained using two objectives to obtain general understanding of language: the masked language model (MLM) and the next sentence prediction (NSP). Pre-training is performed on a sentence-level granularity. Each sentence 𝑆 is split into tokens: 𝑆 = 𝑤1 𝑤2 · · · 𝑤𝑁 . Before inputting the sentence into the model, each token 𝑤𝑖 is embedded using a sum of three different embeddings, the word embedding 𝑡𝑖 encoding the semantic of the token, the position embedding 𝑝𝑖 denoting its position within the sentence, and the segment embedding 𝑠𝑖 in order to discern between the first and the second segment when the model is presented e.g., two sentences as for the NSP task. The segment embeddings will also help our model to differentiate between the query and document as the two segments later. All three embeddings are added up to form the input embedding 𝐸𝑖 for each token: 𝐸𝑖 = 𝑡𝑖 + 𝑝𝑖 + 𝑠𝑖 . Figure 2: BERT’s and ALBERT’s pre-training process, 0.9 symbolizes the NSP or SOP score for the two sentences, the red word ’values’ is the predicted word for the masked token. In order to obtain a representation of the entire input, we prepend the sentence 𝑆 with a classification token 𝑤𝑆 = ⟨𝐶𝐿𝑆⟩. It is embedded in the same way as the other tokens and will be used for the NSP task and also for fine-tuning tasks that rely on a representation of the input such as classification. The first pre-training task is the masked language model meaning tokens from the input sentence are randomly replaced by a ⟨𝑀 𝐴𝑆𝐾⟩ token, a different token or is not changed at all. After embedding the input, it is feed into the BERT model, consisting of 12 layers of transformer encoder blocks, resulting in a contextualized output embedding 𝑈𝑖 for each input token: 𝐶𝑈1 𝑈2 · · · 𝑈𝑁 = BERT(𝐸𝐶𝐿𝑆 𝐸1 𝐸2 · · · 𝐸𝑁 ), where 𝐸𝐶𝐿𝑆 and 𝐶 are the input and output embeddings of the ⟨𝐶𝐿𝑆⟩ token. Afterwards, a simple classifier is applied in order to predict the original word from the input: 𝑃 (𝑤𝑗 |𝑆) = softmax(𝑈𝑖 · 𝑊𝑀 𝐿𝑀 + 𝑏𝑀 𝐿𝑀 )𝑗 , where 𝑤𝑗 is the 𝑗-th word from the vocabulary. This determines the probability that the 𝑖-th input word was 𝑤𝑗 given the input sentence 𝑆. The weight matrix 𝑊𝑀 𝐿𝑀 and its bias 𝑏𝑀 𝐿𝑀 are only used for this pre-training task and are not reused afterwards. The next sentence prediction objective predicts whether the sentence given to the model as the first segment 𝑆𝐴 appears in a text before the sentences given to the model as the second segment 𝑆𝐵 (label 1) or whether the second sentence is a random sentence from another document (label 0). This task is performed as a binary classification using the output embedding 𝐶 as its input: 𝑝(𝑙𝑎𝑏𝑒𝑙 = 𝑖|𝑆) = softmax(𝐶 · 𝑊𝑁 𝑆𝑃 + 𝑏𝑁 𝑆𝑃 )𝑖 , where the matrix 𝑊𝑁 𝑆𝑃 and the bias 𝑏𝑁 𝑆𝑃 are only used for the NSP and are not used otherwise later. ALBERT also makes use of the MLM objective, but it has been found that NSP, predicting whether the second sentence in the input is swapped with a sentence from another document from the corpus, is a relatively challenging task and was changed to the sentence order prediction (SOP). Here, the model is asked to determine what the correct order of two presented sentences is. Hence, the model is presented with two sentences and performs their classification in the same way as BERT’s NSP. Therefore, the formulas for NSP as introduced above apply as well. The pre-training process of BERT and ALBERT is depicted together in Figure 2. Note, that BERT applies a classification on the output embedding 𝐶 for the NSP objective, while ALBERT does the same for the SOP objective. Both models use the MLM objective. Pre-Training Data Before pre-training we applied the official tool provided by ARQMath to read the posts, wrapped formulas in $ and removed other html markup, yielding a list of paragraphs for each post. BERT and ALBERT models rely on sentence separated data during pre-processing for the NSP and SOP tasks. Two different strategies were tested: (1) split the text into sentences, (2) split it into chunks of text and formulas. The SOP task is designed to work with sentences. Hence, (1) is usually used in various NLP tasks. On the other hand, our goal was to increase the model’s understanding of formulas. Therefore, strategy (2) splits a paragraph first into sentences, but also when a sentences contains a formula (with more than three LaTeX tokens to avoid splitting at e.g., definitions of symbols). In case the remaining text is too short (less than ten characters), it is concatenated to the formula before, separated by a $ sign. Before inputting the data into the models, tokenizing, creating the pre-training data for each task, i.e., masking tokens and assembling pairs of sentences, and further pre-processing was performed by the pre-processing scripts provided in the official BERT and ALBERT repositories2 . For the models that started from official checkpoints, we used the released sentencepiece vocabulary [23]. For the models that started from scratch, we trained our own sentencepiece model using the parameters recommended in the ALBERT repository which had a vocabulary overlap of 32.1% compared to the released sentencepiece vocabulary for ALBERT. Sentencepiece tokenizes the input into subwords using byte-pair-encoding[24], e.g., the sentence ’how can i evaluate $ \sum_{n=1}ˆ\infty \frac{2n}{3ˆ{n+1}} $?’ would be tokenized into ’how can i evaluate $ \ sum _ { n = 1 } ˆ \ in ##ft ##y \ fra ##c { 2 ##n } { 3 ˆ { n + 1 } } $?’ by the BERT tokenizer, where single tokens are separated by spaces. Input sequences whose length after tokenization exceeded the maximum number of input tokens where truncated to the maximum length. In case two 2 https://github.com/google-research/bert, https://github.com/google-research/ALBERT Figure 3: Architecture of Task 1’s Fine-Tuning. segments together exceeding the maximum length during e.g., NSP or fine-tuning, token by token was deleted from the longest sequence until the sum of the number of both segments equaled the maximum length. 5.2. ALBERT Model In order to predict whether an answer 𝐴 = 𝐴1 𝐴2 · · · 𝐴𝑀 is relevant to a question 𝑄 = 𝑄1 𝑄2 · · · 𝑄𝑁 a classifier is trained on top of the pre-trained ALBERT model as depicted in Figure 3. The input string ⟨𝐶𝐿𝑆⟩𝑄1 𝑄2 · · · 𝑄𝑁 ⟨𝑆𝐸𝑃 ⟩𝐴1 𝐴2 · · · 𝐴𝑀 , with ⟨𝐶𝐿𝑆⟩ being the classification token and ⟨𝑆𝐸𝑃 ⟩ the separation token, is presented to the model: 𝐶𝑈1 𝑈2 · · · 𝑈𝑁 = ALBERT(𝐸𝐶𝐿𝑆 𝐸1 𝐸2 · · · 𝐸𝑁 +𝑀 ), where 𝐸𝑖 and 𝐸𝐶𝐿𝑆 are the input embeddings for each input token and the ⟨𝐶𝐿𝑆⟩ token, respectively, calculated as explained in Section 5.1. After the forward pass through the model, the output vector of the ⟨𝐶𝐿𝑆⟩ token 𝐶 is given into a classification layer: 𝑝(𝑙𝑎𝑏𝑒𝑙 = 𝑖|𝑄, 𝐴) = softmax(𝐶 · 𝑊𝑇 𝑎𝑠𝑘1 + 𝑏𝑇 𝑎𝑠𝑘1 )𝑖 , where the label 1 stands for a matching or correct answer for the query and label 0 otherwise. During evaluation, the resulting probability of the classification layer for label 1, is the assigned similarity score 𝑠 for the answer 𝐴 to a question 𝑄 and is then used to rank the answers in the corpus: 𝑠(𝑄, 𝐴) = 𝑝(𝑙𝑎𝑏𝑒𝑙 = 1|𝑄, 𝐴). Fine-Tuning Data In order to fine-tune the ALBERT models for Task 1, we paired each question with one correct answer and one incorrect answer. The correct answer was randomly chosen from the answers of the question. Each question in the corpus comes along with tags, i.e., categories indicating the topic of a question such as sequences-and-series or limits. As an incorrect answer for each question, we picked a random answer from one question sharing at least one tag with the original question by chance. Following this procedure, we yielded 1.9M examples, of which 90% were used as training data for the fine-tuning task. We presented the model the entire text of the questions and answers using the structure introduced in the previous section. 5.3. ColBERT Model Our second approach was to train ColBERT on top of a pre-trained BERT model. In each training step, the model is presented the query 𝑄 and two answers: one being a relevant answer 𝐴, the second being an answer 𝐵 that should be regarded as non-relevant by the model. All three strings, 𝑄, 𝐴 and 𝐵 are prepended with a token denoting the string as either question (query), ⟨𝑄⟩ or answer (document) ⟨𝐷⟩, and are passed through the BERT model individually to create contextualized embeddings for each post: 𝐶𝑄 𝑄𝑈1 𝑈2 · · · 𝑈𝑁 = BERT(𝐸𝐶𝐿𝑆 𝐸𝑄 𝐸1 𝐸2 · · · 𝐸𝑁 ), 𝐶𝐷 𝐷𝑉1 𝑉2 · · · 𝑉𝑀 = BERT(𝐸𝐶𝐿𝑆 𝐸𝐷 𝐹1 𝐹2 · · · 𝐹𝑀 ), where 𝐸𝑖 , 𝐹𝑖 , 𝐸𝐶𝐿𝑆 , 𝐸𝑄 , and 𝐸𝐷 are the input embeddings for each input token, the ⟨𝐶𝐿𝑆⟩ token, ⟨𝑄⟩ token and the ⟨𝐷⟩ token, respectively, calculated as explained in Section 5.1. Using the late interaction mechanism as specified by Khattab et al. [7] a relevance or similarity score is calculated for each question-answer pair and optimized by applying softmax cross-entropy loss over the scores: 𝑁 ∑︁ 𝑠(𝑄, 𝐴) = max𝑗∈{1,...,𝑀 } 𝑈𝑖 𝑉𝑗𝑇 . 𝑖=1 More implementation specific detail can be found in work by Khattab et al. [7]. Fine-Tuning Data We use the same procedure to generate training data for the ColBERT-based models, but with the difference that we used up to 𝑁𝑎𝑛𝑠𝑤𝑒𝑟𝑠 = 10 correct and incorrect answers in case a question had that many submitted answers. If less answers were present, the minimum of correct and incorrect answers was used such that the number of correct and incorrect answers matched. We paired each correct answer with all incorrect answers, generating at most 10 × 10 = 100 samples for each question. We experimented with 𝑁𝑎𝑛𝑠𝑤𝑒𝑟𝑠 = 1 and 𝑁𝑎𝑛𝑠𝑤𝑒𝑟𝑠 = 5, but we achieved best results with 𝑁𝑎𝑛𝑠𝑤𝑒𝑟𝑠 = 10. 5.4. Evaluation Data During evaluation we exploited the tag information from the queries in order to rank only the answers that shared at least one tag with the query question. In this way, we saved large amounts of computation time for the ALBERT-based models. Each question and the answers were pre-processed and paired in the same way as during fine-tuning. Table 1 Overview of Pre-Training Configurations of ALBERT models Model Pre-Training Data Steps Base 750k (1) sentence split 750k Base 250k (1) sentence split 250k Base Combined (1)+(2) combined 135k Scratch 1M (1) sentence split 1M Scratch 2M (1) sentence split 2M Scratch Separated (2) separated 1M For ColBERT, we generated an index based on all answers whose question had at least one tag that was associated with at least one query question. For each query the organizers of the Lab annotated whether answering the question mostly depends on its text, its formulas or both. We used these categories for the interpretation of our results. 5.5. Experiments We tested various scenarios for training ALBERT of which we report six in this work: The models Base 750k, Base Combined and Base 250k are initialized from the official weights of the ALBERT base model released by the ALBERT authors and were further pre-trained on ARQMath data using strategy (1), i.e., sentence split text (see Section 5.1). The data pre-processed with strategy (2), i.e., data split into text and LATEX, was mixed with the aforementioned data to pre-train Base Combined. The weights of Scratch 1M, Scratch Separated and Scratch 2M were initialized randomly. Scratch 1M and Scratch 2M used the sentence split data (1) while Scratch Separated was only pre-trained on the separated data of strategy (2). All six models followed the recommendations for hyperparameter configuration during pre-training, with 12M parameters, using the LAMP optimizer [25], 3,125 warm up steps, maximum sequence length of 512 and a vocabulary size of 30,000. Furthermore, we used a batch size of 32 and a learning rate of 0.0005. The models were trained for different numbers of steps: Base 750k was trained for 750k steps while the training of Base 250k was already stopped after 250k steps. Scratch 1M and Scratch Separated pre-trained for 1M steps. This amount was doubled for Scratch 2M. Finally, Base Combined could only be trained for 135k steps before the final submission of the results. A summary of the different combination of pre-training data and number of steps for each model can be found in Table 1. After pre-training, each classification model was fine-tuned for 125k steps using a batch size of 32, a learning rate of 2e-5 and 200 warm-up steps. Both pre-training and fine-tuning was performed using the code published in the official ALBERT repository. We submitted results of four ALBERT-based models to the ARQMath 2021 Lab and evaluated Base 250k and Scratch 2M using the official evaluation tools. ColBERT can be seen as an extension of BERT whose performance depends on its pre-training [7]. Therefore, we apply three differently pre-trained models as the basis for ColBERT: ColBERT uses the weights of the original BERT-base, ColSciBERT uses SciBERT [15], which was trained on a large corpus of scientific publications from multiple domains and finally, we pre-trained our own BERT model for ColARQBERT. The last model was initialized using the original BERT weights and was then further pre-trained on the sentence split data (1) described earlier. The hyperparameters recommended by the BERT authors in their repository were used to pre-train this model: The learning rate was set to 2e-05, one batch contained 16 samples and the models were trained for 500k steps. In contrast to the recommendations we set the maximum length of the input to 512, because we did not start to train the model from scratch, where the initial sequence length was set to 128, but rather further trained the already fully pre-trained model on additional data. The training of all three ColBERT models made use of the same hyperparameter configuration. We optimized the L2 similarity between 128-dimensional vectors with a batch size of 128 for 75k steps. Other parameters kept their default values. Punctuation tokens were masked, but we also experimented with models that did not mask them, but we could not see a significant difference in the results. We also started to incorporate ALBERT as a base model for ColBERT, but did not yet find a configuration for a successful training. The pre-training of ColARQBERT was performed using the code published by the BERT authors, while the ColBERT repository was slightly adapted to support different checkpoints than BERT base in order to train the other models. Finally, ColSciBERT model was submitted to the competition, while ColBERT and ColARQBERT were evaluated later using the official evaluation guide. 5.6. Evaluation The results of our ALBERT and ColBERT-based models are shown in Table 2 together with additional experiments that were not submitted and results of other models from the ARQMath 2021 Lab for comparison. We report the scores of the 2020 test set and the new 2021 test set. In addition, we break down the nDCG’ score results of 2020 by the categories on which part answering the question depends. These categories are either text, formula or both in combination and were annotated by the organizers of the lab. The scores for each category are reported in Table 3. Pre-Training Adjustments In general, our results can be seen as competitive. Regarding nDCG’, all ALBERT-based models could outperform the baseline systems in both years. On the 2020’s test set, one team with three systems received the highest scores for mAP’, while our ALBERT-based models are all in the range of the top four teams. In 2021, our best model ranks second among all teams regarding mAP’. Our results for p’@10 are not as high as the best baseline, but there was not a single system from any of the teams that could beat the baseline results for p’@10. Comparing to the other participants, our system receives the highest score for p’@10 in 2021. The reason why our Precision is relatively high, but the nDCG’ is lower compared to the other teams that received higher scores could be that our systems do not rank all answers for each topic due to the too time consuming evaluation. Possibly, our results would have been better if all answers would have been scored for their relevance. We will now take a deeper look at the differences between the models we trained. When comparing Base 750k and Base 250k, the overall score is slightly increased by the longer pre- training. In Table 3 we see that with longer pre-training the model learned a better understanding Table 2 Results of Task 1 2020 2021 Model Official Identifier nDCG’ mAP’ p’@10 nDCG’ mAP’ p’@10 Base 750k TU_DBS_P 0.380 0.198 0.316 0.377 0.158 0.227 Scratch 1M TU_DBS_A1 0.362 0.178 0.304 0.353 0.132 0.180 Base Combined TU_DBS_A3 0.359 0.173 0.299 0.357 0.141 0.194 Scratch Separated TU_DBS_A2 0.356 0.173 0.291 0.367 0.147 0.217 ColSciBERT TU_DBS_A4 0.045 0.016 0.071 0.028 0.004 0.009 Additional Experiments Base 250k - 0.375 0.193 0.311 Scratch 2M - 0.359 0.177 0.297 ColARQBERT - 0.225 0.073 0.131 ColBERT - 0.183 0.053 0.110 ARQMath Competitors Best ’20&’21 MathDowsers-primary 0.433 0.191 0.249 0.434 0.169 0.211 Best ’20 DPRL-RRF 0.422 0.247 0.386 0.347 0.101 0.132 Best Baseline linked_results 0.279 0.194 0.386 0.203 0.120 0.282 Table 3 nDCG’ Scores of Task 1 by Category, 2020 Test Set Base 750k Scratch 1M Base Combined Scratch Separated Base 250k Both 0.365 0.365 0.364 0.321 0.370 Formula 0.382 0.354 0.338 0.367 0.366 Text 0.411 0.375 0.399 0.421 0.408 of text and formulas on their own, but for category ’both’ the results decreased. On the other hand, pre-training for too many steps shows effects of over-fitting as the scores start to decrease again as we see in the difference between Scratch 1M and Scratch 2M. The comparison of Scratch 1M and Scratch Separated shows that the separation of text and mathematical formulas leads to better nDCG’ scores for queries dependent on formulas and text separately, but the performance degrades on question-answers pairs that depend on both, which is expected since the model was not pre-trained on data that involved both in one example. Base Combined has a much lower nDCG’ value for the formula category in comparison to the other models. This can be explained by the fact that it was pre-trained for a much lower number of steps. The same effect is visible when viewing Base 750k and Base 250k. Therefore, we hypothesize that a pre-training of 750k or even more steps could even outperform Base 750k and Scratch Separated in all three categories. BERT-base models generally benefit from a long pre-training on a large corpus. In our experiments, we could not observe this behavior. We experimented with models trained for 2M steps on data from 41 StackExchange communities supporting LATEX, but the results are worse than the ones presented in Table 2. Figure 4: Architecture of Task 2’s Fine-Tuning. ColBERT ColSciBERT is the fifth model we submitted for the 2020 ARQMath Task 1 and it was trained using ColBERT. As can be seen from the results table, its performance is not optimal hinting at a substantial problem during training or evaluation. This could be caused by using SciBERT as the basis for ColBERT. Two other models that were not officially submitted to the Lab received higher scores, but are still not on par with our other ALBERT-based approaches regarding all three metrics. This confirms the hypothesis that SciBERT is not suitable in this scenario. Nevertheless, with ColBERT the time required to evaluate all 100 topics of 2020 took around six minutes using two NVIDIA GTX 1080 while evaluating one query using our ALBERT-based classification approach took between ten minutes and one hour on one NVIDIA V100. Therefore, further research in this direction is worthwhile for speeding up the evaluation while receiving competitive scores at the same time. Future work here should further analyze the performance and determine the best training scenario for a ColBERT-based system. 6. Task 2 Task 2 deals with the retrieval of relevant formulas given a query formula together with the post in which it appeared. As for Task 1 we will start from a pre-trained ALBERT model, which was already introduced in Section 5.1. In the following section, we will therefore only highlight how the fine-tuning and data processing was performed and present the results of the application of an ALBERT-based model for the task of formula similarity search. 6.1. Fine-Tuning Model For formula similarity search our approach slightly differs from the one presented in Section 5.2 for Task 1. Instead of presenting the model the two formulas as the query and answer to classify whether they are relevant to each other, we add the question in which the first formula appears as additional context since the same formula and especially its variables can have different interpretations depending on its context, e.g., 𝑃 (𝑋) can be a probability of a random variable or a polynomial depending on its context. Each query formula was concatenated with its question forming the first part of the input 𝑄1 𝑄2 · · · 𝑄𝑁 , separated by $. The formula that should be assessed for its similarity to the first formula makes up the second part of the input 𝐴1 𝐴2 · · · 𝐴𝑀 . Analogously to Task 1, the classification token ⟨𝐶𝐿𝑆⟩ is added at the beginning of the input and both parts are separated by the separation token ⟨𝑆𝐸𝑃 ⟩. The output of the classifier is the similarity score that is optimized during training and used for the ranking of candidates during evaluation. The process of fine-tuning ALBERT for Task 2 is depicted in Figure 4. Fine-Tuning Data Fine-tuning was performed on formulas in context with the post in which they appeared in order to provide the model with information on how the formula was used by the author. First, we removed formulas that contained less than three LATEX token and filtered the ARQMath corpus for posts that had at least one formula remaining. For each post in the corpus one formula was chosen either by chance (denoted as random) or the longest formula was used as the query formula (denoted as longest). Formulas from the title of the post were preferred when the title contained any formulas, because the title can be seen as a summary of the post that should include the formula with the most meaning to the post. We chose this procedure because we faced the problem that the posts contained many formulas (on average 9.41 formulas per post) not all of which were directly relevant to the post or the given answers, such as, definitions of variables or examples. For each query formula we determined positive and negative examples. The positive examples, i.e., the ones that should be classified as relevant formulas by the model, originate in the answers given a post. This assumes that the formulas in the answers are relevant to the formulas in its question post. We chose the negative examples from answer posts where their questions had at least one common tag with the query post. For each query we used a maximum of five positive and negative examples, where the number of positive examples and negative examples were equal. Hence, when a question had 𝑁𝑓 𝑜𝑟𝑚𝑢𝑙𝑎𝑠 with 𝑁𝑓 𝑜𝑟𝑚𝑢𝑙𝑎𝑠 < 5 formulas in its answers or we found only 𝑁𝑓 𝑜𝑟𝑚𝑢𝑙𝑎𝑠 , 𝑁𝑓 𝑜𝑟𝑚𝑢𝑙𝑎𝑠 < 5 formulas in other posts with the same tags, then only 𝑁𝑓 𝑜𝑟𝑚𝑢𝑙𝑎𝑠 formulas for positive and negative examples would be used. In total, we yielded 5,812,412 question-answer formula pairs of which 90% were used as training data. The training data was presented to the model in the way that was described above. 6.2. Evaluation Data Due to time and hardware constraints, it was not possible to evaluate our models on the entire collection of formulas. Therefore, we limited the corpus to visually distinct formulas with more than three LATEX tokens which appeared in questions and answers that shared either one or all tags with the query post, denoted by ’one’ or ’all’ in the results table, respectively. The ARQMath collection provided each formula with its visual id and the post id in which it appeared. We Table 4 Results of Task 2 Fine-Tuning Eval Official 2020 2021 Data Tags Identifier nDCG’ mAP’ p’@10 nDCG’ mAP’ p’@10 random one TU_DBS_A3 0.426 0.298 0.386 - - - longest one TU_DBS_A1 0.396 0.271 0.391 - - - random all TU_DBS_A2 0.157 0.085 0.122 0.154 0.071 0.217 longest all TU_DBS_P 0.152 0.080 0.122 0.153 0.069 0.216 Best 2020 DPRL-ltrall 0.738 0.525 0.542 0.445 0.216 0.333 Best 2021 Approach0-P300 0.507 0.342 0.441 0.555 0.361 0.488 Baseline TangentS_Res 0.691 0.446 0.453 0.492 0.272 0.419 determined the tags of the posts from the post’s corpus and aggregated for each visual id the tags of all posts in which formulas with this visual id appeared. For each visual id that remained in the corpus for a given query we provided the model with the query formula, its posts and the first formula that was associated with this visual id. The post id that was reported as our Task 2 result was the post id corresponding to the formula in the model input, i.e., the post id of the first formula for each visual id. 6.3. Experiments In total, we trained two classifiers on different fine-tuning data and evaluated each of them in two settings as described above. These four configurations can be found in Table 4. Both models are based on the pre-trained ALBERT-base model that was used for the Task 1 Model Base 750k. The fine-tuning hyperparameters are the same as for the models of Task 1: we used a learning rate of 2e-5, batch size of 32 and 200 steps for warm-up. Both classifiers were trained for 125k steps. 6.4. Evaluation The results of our experiments for Task 2 can be seen in Table 4. Our best model on the 2020 topics is fine-tuned on the random formulas as queries. It is evaluated on all distinct question and answer formulas that shared at least one tag with the query post. In general, the two models trained on data with a random query formula showed better results than the two models always using the longest formula. Including all formulas that had at least one similar tag increases the search space and therefore the results for the retrieved formulas from this search space are better. This suggests that the performance could be even more increased if all formulas from the entire corpus were used for the classification. This was not done in this work due to our hardware constraints leading to long evaluation times. Generally speaking, our ALBERT-based model is a promising approach, but the comparison to other participants of the lab demonstrates that it is not on par with state-of-the-art models or the baseline system. Hence, future work should explore better methods of representing formulas using ALBERT. In this work, only LATEX-based representations have been used, but ARQMath also provides tree-based data for each formula. One possible improvement could be the prediction of relationships between nodes in these trees as it was done in a similar way for Programming Language Understanding using data flow graphs [26]. Furthermore, as seen in the evaluation of Task 1 in Section 5.6, pre-training on data that separated text and formulas improved the scores on formula dependent questions pointing to a better understanding of mathematical formulas compared to models trained on non-split data. Therefore, these pre-trained models could also be beneficial as basis for Task 2 fine-tuning. 7. Conclusion Mathematical Information Retrieval deals with the retrieval of documents from a corpus, which are relevant to a query, where documents and queries may include both, natural language and mathematical formulas. Two instances for such an objective are Task 1 and Task 2 of the ARQMath Lab, whose goal is to either retrieve answers given a question or formula retrieval using a formula in its context. Since this challenge includes not only text written in English, but also formulas, approaches from Natural Language Processing and Information Retrieval have to be adapted in order to be able to interpret also the semantics of mathematical formulas. This has also been demonstrated in our previous work, where we showed that ALBERT has to be further pre-trained on relevant data in order to better handle formulas in MIR tasks. In this work we further analyzed this claim and showed that our previous results for Task 1 could even be improved by a longer pre-training on the data provided by ARQMath. Furthermore, we showed that separating large chunks of natural language text and LATEX notation in one sentence increased the model’s performance on formula-only and text-only dependent questions, respectively. The second contribution of this work was to explore the application of ColBERT in order to accelerate the evaluation of queries, because our classification-based approach is too time-consuming. Thereby, we trained and evaluated a ColBERT model and showed that further improvements are necessary before this approach can reach state-of-the-art performance. We also applied our ALBERT-based approach to the formula retrieval objective of Task 2 and showed that there is still work necessary in order to improve the model’s understanding of formula similarity. In conclusion, we showed promising approaches for Mathematical Answer Retrieval and Formula Similarity Search by applying differently pre-trained and fine-tuned ALBERT models and one ColBERT model. In order to improve the modeling capabilities of mathematical formulas, we recommended strategies involving several pre-training methods that include syntactical features of formula that we have not yet taken into account. To facilitate research based on our work, we release the code for data pre-processing and the training of the models in the project’s repository3 . The source code for training the ColBERT-based models was forked from the official ColBERT repository and slightly adjusted4 . 3 https://github.com/AnReu/ALBERT-for-Math-AR 4 https://github.com/AnReu/ColBERT-for-Formulas Acknowledgments This work was supported by the DFG under Germany’s Excellence Strategy, Grant No. EXC- 2068-390729961, Cluster of Excellence “Physics of Life” of TU Dresden. Furthermore, the authors are grateful for the GWK support for funding this project by providing computing time through the Center for Information Services and HPC (ZIH) at TU Dresden. We would also like to thank the reviewers for their helpful comments and recommendations. References [1] B. Mansouri, A. Agarwal, D. Oard, R. Zanibbi, Finding old answers to new math questions: the arqmath lab at clef 2020, in: European Conference on Information Retrieval, Springer, 2020, pp. 564–571. [2] B. Mansouri, A. Agarwal, D. Oard, R. Zanibbi, Advancing math-aware search: The arqmath- 2 lab at clef 2021 (2021) 631–638. [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [4] R. Nogueira, K. Cho, Passage re-ranking with bert, arXiv preprint arXiv:1901.04085 (2019). [5] A. Reusch, M. Thiele, W. Lehner, An albert-based similarity measure for mathematical answer retrieval, in: Proceedings of the 44rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, to appear. [6] S. MacAvaney, A. Yates, A. Cohan, N. Goharian, Cedr: Contextualized embeddings for document ranking, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 1101–1104. [7] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextualized late interaction over bert, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 39–48. [8] R. Zanibbi, A. Aizawa, M. Kohlhase, I. Ounis, G. Topic, K. Davila, Ntcir-12 mathir task overview., in: NTCIR, 2016. [9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [10] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised learning of language representations, arXiv preprint arXiv:1909.11942 (2019). [11] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop pretraining: Adapt language models to domains and tasks, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8342–8360. [12] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240. [13] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, W. Redmond, M. B. McDermott, Publicly available clinical bert embeddings, NAACL HLT 2019 (2019) 72. [14] K. Huang, J. Altosaar, R. Ranganath, Clinicalbert: Modeling clinical notes and predicting hospital readmission, arXiv preprint arXiv:1904.05342 (2019). [15] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), 2019, pp. 3606–3611. [16] A. Kanade, P. Maniatis, G. Balakrishnan, K. Shi, Learning and evaluating contextual embedding of source code, in: International Conference on Machine Learning, PMLR, 2020, pp. 5110–5121. [17] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al., Codebert: A pre-trained model for programming and natural languages, arXiv preprint arXiv:2002.08155 (2020). [18] S. Peng, K. Yuan, L. Gao, Z. Tang, Mathbert: A pre-trained model for mathematical formula understanding, arXiv preprint arXiv:2105.00377 (2021). [19] S. Rohatgi, J. Wu, C. L. Giles, Psu at clef-2020 arqmath track: Unsupervised re-ranking using pretraining, in: CEUR Workshop Proceedings. Thessaloniki, Greece, 2020. [20] V. Novotnỳ, P. Sojka, M. Štefánik, D. Lupták, Three is better than one, in: CEUR Workshop Proceedings. Thessaloniki, Greece, 2020. [21] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, Ms marco: A human generated machine reading comprehension dataset, in: CoCo@ NIPS, 2016. [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, Advances in Neural Information Processing Systems 30 (2017) 5998–6008. [23] T. Kudo, J. Richardson, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226 (2018). [24] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1715–1725. URL: https://www.aclweb.org/anthology/P16-1162. doi:10. 18653/v1/P16-1162. [25] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, C.-J. Hsieh, Large batch optimization for deep learning: Training bert in 76 minutes, in: International Conference on Learning Representations, 2019. [26] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. LIU, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, M. Zhou, Graphcode{bert}: Pre-training code representations with data flow, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id= jLoC4ez43PZ.