Transformer-Encoder and Decoder Models for Questions on Math Anja Reusch1 , Maik Thiele2 and Wolfgang Lehner1 1 Database Systems Group, Technische Universität Dresden, Germany 2 Hochschule für Wirtschaft und Technik Dresden, Germany Abstract This work summarizes our submission to ARQMath-3. We pre-trained Transformer-Encoder-based Language Models for the task of mathematical answer retrieval and employed a Transformer-Decoder Model for the generation of answers given a question from a mathematical domain. In comparison to our submission to ARQmath-2, we could improve the performance of our models regarding all three metrics nDGC’, mAP’ and p’@10 by refined pre-training and enlarged fine-tuning data. In addition, we improved our p’@10 results even further by additionally fine-tuning on annotated test data from ARQMath-2. In summary, our findings confirm that Transformer-based models benefit from domain adaptive pre-training in the mathematical domain. Keywords Mathematical Language Processing, Information Retrieval, Transformer-based Models 1. Introduction With a rising number of scientific publications, retrieval of information from documents con- taining mathematical notation has recently received more attention. The task of Mathematical Information Retrieval (MIR) deals with finding relevant documents for a query, where both document and query may include mathematical notation such as LATEX expressions beside natural language. As for many text-based tasks, Transformer-based models have demonstrated great potential for MIR. They could even be applied as a stand-alone model when adapted to the domain, since models like BERT [1] or ALBERT [2] were originally pre-trained on documents that did not contain mathematical notation. Recent research has therefore focused on adapting these models to the domain of mathematics by additional pre-training on the Mathematics StackExchange (MathSE). However, several base models such as BERT and ALBERT were further pre-trained and evaluated using different methods or data sets [3, 4, 5]. Hence, a fair comparison which base model is best suited for MIR is not possible. In order to evaluate their impact under the same conditions, we start our submission by pre-training and fine-tuning three Transformer-Encoder models, namely ALBERT, BERT and RoBERTa, on MIR. CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy " anja.reusch@tu-dresden.de (A. Reusch); maik.thiele@htw-dresden.de (M. Thiele); wolfgang.lehner@tu-dresden.de (W. Lehner) ~ https://wwwdb.inf.tu-dresden.de (W. Lehner)  0000-0002-2537-9841 (A. Reusch); 0000-0002-1665-977X (M. Thiele); 0000-0001-8107-2775 (W. Lehner) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) In the core of this work, we refine our ALBERT-based approach from ARQMath-2. Here, the model was pre-trained on MathSE serving as in-domain data and fine-tuned using a classifica- tion task with the objective to predict whether a given post answers the question. The resulting classification probability assigned by the model was used to rank the answers. The success of models like BERT is attributed to pre-training on large diverse corpora. To evaluate the influence of the corpus, we experiment with pre-training using the AMPS corpus [6] which consists of 23GByte of question and answer pairs. Compared to 2021, we also enlarged our fine-tuning corpus by 152%. Furthermore, we also rely on the annotated test data of last year which we are leveraging for more informed training data for MIR. We will also study the impact of this new training data on our models. Finally, ARQMath-3 includes the new task of generating answers instead of retrieving them from the corpus. Our team fine-tuned a GPT-2 model [7] on the AMPS as well as on the provided MathSE corpora to generate solutions for the questions from the retrieval task. In summary, our submission to the ARQMath Lab 2022 Task 1 Answer Retrieval and Task 3 Answer Generation focuses on three areas: • A comparison of three pre-trained Transformer-Encoder models: BERT, ALBERT, RoBERTa, • The impact of pre-training and fine-tuning data on MIR, • The application of Transformer-Decoder Models to Mathematical Answer Generation. Our evaluation shows that all our submitted models could outperform the best models of ARQMath-2. Our ALBERT model trained on three different versions of the MathSE data and the enlarged fine-tuning data demonstrated the best performance. We could improve p’@10 even more by fine-tuning using annotated data from the run of 2021. Training models based on RoBERTa also shows to result in a promising approach, even though longer computing time is needed. All our models are made publicly available as part of the Huggingface Model Hub1 . The remainder of this submission document is structured as follows: Section 2 will review related work in the field of MIR and related generation tasks. We will introduce the tasks of the lab in Section 3 and the models in Section 4. Section 5 describes our model setup and the results of Task 1, while Section 6 will summarize our efforts for Task 3. The final section concludes this work. 2. Related Work Deep learning models based on Encoders or Decoders of the Transformer architecture [8] have been widely introduced in several Natural Language Processing and Information Retrieval Tasks in recent years. Encoder-Models like BERT [1], ALBERT [2], and RoBERTa [9] have been applied to various domains including scientific literature [10], medical documents [11, 12] or source code [13, 14]. Decoder-Models are typically used to generate text, where the most prominent examples being GPT-1, GPT-2, and GPT-3 [15, 7, 16]. Transformer-Encoder-based models for mathematical domains have also been studied with one example being MathBERT [17]. Here, mathematical formulas in form of operator trees 1 https://huggingface.co/AnReu/albert-for-arqmath-3 are used as an input for pre-training. During the ARQMath Lab in 2020 and 2021, five teams submitted systems based on BERT, RoBERTa, and SentenceBERT [18, 19, 20, 21, 22, 23] where the models were used without domain adaption for downstream tasks. Only [4] pre-trained their submissions on mathematical documents. [3] fine-tuned a BERT model for notation prediction tasks based on scientific documents by enlarging the vocabulary of BERT with additional LATEX tokens. [5] followed a similar procedure for mathematical documents. Generative Models have also found their way into the mathematical domain with GPT-𝑓 , a Transformer-based proof-solver [24]. [6] introduced two new data sets, one for measuring the performance of generative mathematical language models and one for pre-training. Along with it, benchmarks based on GPT-2 and GPT-3 were published. All of these only used an exercise-level data set and not a community-data set like the MathSE data in the task at hand. 3. ARQMath 2022 Lab The overall goal of ARQMath Lab 2022 (ARQMath-3) [25] is to accelerate the research in mathematical Information Retrieval. The lab consists of three tasks offering three different scenarios. Task 1 of the lab involves mathematical answer retrieval for a question asked on the Mathematics StackExchange2 , which is a platform for users to post questions related to mathematical topics to be answered by the community. The goal of this task is the retrieval of an answer post from 2010 - 2018 to questions that were posted in 2019. The evaluation data of ARQMath-1 contain 99, while ARQMath-2 and 3 provided 100 query topics each, which are question posts including title, text and tags. In the 2020 test set, 77 queries were evaluated for Task 1, while its evaluation in 2021 included 71 queries. ARQMath-3 evaluated 78 queries. The optimal answers retrieved by the participants are expected to answer the complete question on their own. The relevance of the question-answer pairs was assessed by reviewers during the evaluation process. This relevance assessment was performed by pooling after the teams submitted their results. For each topic the participating teams submitted a ranked list of 1,000 documents retrieved by their systems, which were scored by Normalized Discounted Cumulative Gain, but with unjudged documents removed before assessment (nDCG’). The graded relevance scale used for scoring ranged from 0 (not relevant) to 3 (highly relevant). Two additional measures, mAP’ and p@10, were also reported using binarized relevance judgments (0 and 1: not relevant, 2 and 3: relevant). Task 2 of the ARQMath is built on top of the same data as Task 1, but with a different goal in mind: Participants are expected to retrieve relevant formulas given a query formula in the context of its post. This task is related to the formula browsing task of NTCIR-12 [26]. ARQMath-3’s new Task 3 presents an open-domain question/answering scenario instead of finding the most relevant answers for a given question. For each of the 100 topics of Task 1 in the 2022 test set the participants are asked to extract or generate a single answer. These answers contributed to the pool of answers which were judged for Task 1. Any knowledge source - except for the MathSE data from 2019 to today - was allowed as training data. The evaluation is carried out by evaluating the average relevance (AR) of the answers and the 2 https://math.stackexchange.com Precision at 1 (p@1) for each topic. Apart from the task definitions and the evaluation data, ARQMath provides data from the Mathematics StackExchange including question and answer posts from 2010 to 2018. In total, the collection contains 1M questions and 1.4M answers. Furthermore, users may use mathematical formulas to clarify their posts. These formulas written in LATEX notation were extracted and parsed into Symbol Layout Trees and Operator Trees. Apart from this corpus of posts and formulas that are available for training and evaluating models, the organizers of ARQMath also released a test set of queries. 4. Transformer-based Models The Transformer Architecture as introduced by [8] consists of Encoder and Decoder layers. Encoder models in Natural Language Processing typically apply several of these encoder layers on top of each other resulting in a model that reads a sequence of tokens and outputs con- textualized embeddings for each token as well as for the entire input. These embeddings can then be further processed for classification tasks among others. In contrast, decoder models are designed to generate the next output token given a sequence of previous tokens (context tokens). Training both types of models usually includes two phases: A pre-training phase and a fine-tuning phase. While pre-training consists of training the model on relatively simple self-supervised tasks on a large amount of data, fine-tuning does not necessarily need large annotated data. In the following sections we will describe the pre-training of both model types, differences in the concrete model instances we applied and our fine-tuning for the respective task. 4.1. Encoder Models Encoder models are trained to capture the meaning of natural language by self-supervised pre-training tasks. The most important and widely applied task is the Masked Language Model. The model is presented with the embeddings 𝐸𝑖 for each token 𝑖 from the input sentence: 𝐶𝑈1 𝑈2 · · · 𝑈𝑁 = BERT(𝐸𝐶𝐿𝑆 𝐸1 𝐸2 · · · 𝐸𝑁 ) , where 𝐸𝐶𝐿𝑆 and 𝐶 are the input and output embeddings of the ⟨𝐶𝐿𝑆⟩ token. A classifier is then applied to predict the original word given the input: 𝑃 (𝑤𝑗 |𝑆) = softmax(𝑈𝑖 · 𝑊𝑀 𝐿𝑀 + 𝑏𝑀 𝐿𝑀 )𝑗 , where 𝑤𝑗 is the 𝑗-th word from the vocabulary. This determines the probability that the 𝑖-th input word was 𝑤𝑗 given the input sentence 𝑆. The weight matrix 𝑊𝑀 𝐿𝑀 and its bias 𝑏𝑀 𝐿𝑀 are only used for this pre-training task and are not reused afterwards. RoBERTa uses only the MLM task, while BERT and ALBERT also employ a second pre-training task on the same input data, which is a sequence classification task applied on top of the contextualized embeddings of the ⟨𝐶𝐿𝑆⟩ token: 𝑃 (𝑙𝑎𝑏𝑒𝑙 = 𝑖|𝑆) = softmax(𝐶 · 𝑊𝑆𝑂𝑃 + 𝑏𝑆𝑂𝑃 )𝑖 , where the matrix 𝑊𝑆𝑂𝑃 and the bias 𝑏𝑆𝑂𝑃 are only used for pre-training and are not re-used otherwise later. In practice, this task is used to learn coherence between two input sequences given to the model. In this work, we pre-train using the Sentence Order Prediction (SOP), where 𝑙𝑎𝑏𝑒𝑙 = 1 denotes that the two input sequences are in correct order, while 𝑙𝑎𝑏𝑒𝑙 = 0 denotes that they were swapped. Fine-Tuning In order to predict whether an answer 𝐴 = 𝐴1 𝐴2 · · · 𝐴𝑀 is relevant to a question 𝑄 = 𝑄1 𝑄2 · · · 𝑄𝑁 a classifier is trained on top of the pre-trained Transformer-Encoder model. The input string ⟨𝐶𝐿𝑆⟩𝑄1 𝑄2 · · · 𝑄𝑁 ⟨𝑆𝐸𝑃 ⟩𝐴1 𝐴2 · · · 𝐴𝑀 , with ⟨𝐶𝐿𝑆⟩ being the classification token and ⟨𝑆𝐸𝑃 ⟩ the separation token, is presented to the model: 𝐶𝑈1 𝑈2 · · · 𝑈𝑁 = LM(𝐸𝐶𝐿𝑆 𝐸1 𝐸2 · · · 𝐸𝑁 +𝑀 ), where 𝐸𝑖 and 𝐸𝐶𝐿𝑆 are the input embeddings for each input token and the ⟨𝐶𝐿𝑆⟩ token, respectively, calculated as explained in the previous section. After the forward pass through the model, the output vector of the ⟨𝐶𝐿𝑆⟩ token 𝐶 is given into a classification layer: 𝑃 (𝑙𝑎𝑏𝑒𝑙 = 𝑖|𝑄, 𝐴) = softmax(𝐶 · 𝑊𝑀 𝐼𝑅 + 𝑏𝑀 𝐼𝑅 )𝑖 , where the label 1 stands for a matching or correct answer for the query and label 0 otherwise. During evaluation, the resulting probability of the classification layer for label 1 is the assigned a similarity score 𝑠 for the answer 𝐴 to a question 𝑄 and is then used to rank all answers in the corpus: 𝑠(𝑄, 𝐴) = 𝑝(𝑙𝑎𝑏𝑒𝑙 = 1|𝑄, 𝐴). 4.2. Decoder Models Decoder models are trained on the casual language modeling objective, i.e., given some input tokens, generate the most probable next token. In other words, the objective is to maximize the probability over the corpus consisting of a sequence of 𝑛 tokens 𝐶 = {𝑡0 , 𝑡1 , ..., 𝑡𝑛 }: 𝑛 ∏︁ 𝑃 (𝐶) = 𝑃 (𝑡𝑖 |𝑡𝑖−𝑘 , ..., 𝑡𝑖−1 ), 𝑖=0 where the conditional probability 𝑃 (𝑠𝑡 |𝑡𝑖−𝑘 , ..., 𝑡𝑖−1 ), ranging over a context window of size 𝑘, is estimated by the Decoder model. The input is embedded and given to the Transformer Decoder layers resulting in the last layer’s output ℎlast , which is used to calculate the probability: 𝑃 (𝑢) = softmax(ℎlast · 𝑊𝑒𝑇 ), with 𝑊𝑒𝑇 being the embedding matrix. Decoder Models such as GPT or GPT-2 are also fine-tuned in a supervised fashion to predict Base Model How are you? [SEP] He replied that Further Pre-Training Let $S$ be a set in [SEP] $S\X$ with Fine-Tuning Find $x$ such that … [SEP] Given a Evaluation Greatest lower bound … [SEP] Argue Figure 1: Overview of our approach for Task 1 - Mathematical Answer Retrieval including examples for training and evaluation data. labels from an annotated corpus. Since we only use GPT-2 to generate tokens and not to predict labels, we will omit the details of this fine-tuning here. Fine-Tuning To generate answers given a question, we provide the model with the question tokens and prompt it to complete the text by filling in the answer. During training, the model is prompted using the following pattern: PROBLEM: 𝑄1 𝑄2 · · · 𝑄𝑁 SOLUTION: , where 𝑄𝑖 are the question tokens. The model is then optimized to complete the prompt by generating the answer tokens 𝐴𝑖 : PROBLEM: 𝑄1 𝑄2 · · · 𝑄𝑁 SOLUTION: 𝐴1 𝐴2 · · · 𝐴𝑀 . During evaluation, the model is presented with the same pattern to generate an answer. 5. Contribution to Task 1 Task 1 deals with retrieving the most relevant answers from the MathSE corpus given 100 questions that were not seen during training. For this task we pre-trained and fine-tuned several models using the base models BERT, ALBERT, and RoBERTa, applying different corpora for pre-training and fine-tuning on three different sets of question-answer pairs. An overview of our approach for Task 1 is depicted in Figure 1. In the following, we will first describe the data we used, then our experiments including hyper-parameter settings, and finally present our results. Pre-Training Data Prior to pre-training, we applied the official tool provided by ARQMath to read the posts, wrapped formulas in $ and removed other HTML markup, yielding a list of paragraphs for each post. BERT and ALBERT models rely on data which is separated into sentences during pre-processing for the SOP task. We combined three different strategies: (1) split the text ALBERT-base Tokenizer \\ frac {\\ exp ( x _ i ) } {\\ s um _ j \\ exp ( x _ i ) } \\frac { \\exp ( x _ i ) } { \\sum _ j \\exp ( x _ i ) } ALBERT-base Tokenizer with additional Math Tokens Figure 2: An example of tokenizing the LATEX expression ∑︀exp(𝑥 𝑖) exp(𝑥𝑗 ) , important changes are highlighted. 𝑗 into sentences, (2) split text into chunks of natural language and formulas and (3) split the mathematical equations on relation symbols (e.g., =, ∈) into parts. The SOP task is designed to work on sentences level granularity to facilitate the modeling of inter-sentence-coherency. Hence, (1) is usually used in various NLP tasks. At the same time, our goal was to increase the model’s understanding of formulas. Therefore, strategy (2) splits a paragraph first into sentences. These sentences are then further split at a formula (with more than three LATEX tokens to avoid splitting at e.g., definitions of symbols). In case the remaining text is too short (less than ten characters), it is concatenated to the formula before, separated by a $ sign. Strategy (3) only uses formula data without natural language. The three strategies will be denoted by MathSE (1), MathSE (2), and MathSE (3), respectively. Apart from the MathSE corpus provided by the ARQMath Lab, we also pre-processed the Auxiliary Mathematics Problems and Solutions (AMPS) corpus containing questions and answers relating to mathematical problem-solving [6]. Since the data was already split in chunks, we used these data sets as the base for the sentence order task of ALBERT and BERT. The data set contains two parts: the Khan data set consisting of 100,000 exercise questions and answers from the Khan Academy and the Mathematica data set containing 5 M similar questions that are generated using Mathematica Scripts. The questions from both data sets range from topics like simple geometry to multivariate calculus. Both questions and answers use LATEX to convey mathematical notation. We used this data only for pre-training, but not for fine-tuning due to its structure. Tokenizing, creating the pre-training data for each task, i.e., masking tokens and assembling pairs of sentences, and further pre-processing was performed using Huggingface’s libraries transformers and datasets [27, 28]. For our models, we used the released sentencepiece vocabulary, but added 501 additional tokens3 to the tokenizer to cover LATEX [29]. The list of tokens was taken from the LATEX parser by Approach04 . An example of the impact of the new tokenizer on the expression ∑︀exp(𝑥 𝑖) can be seen in Figure 2. After we added the LATEX tokens 𝑗 exp(𝑥𝑗 ) to the vocabulary, typical tokens like \\sum or \\frac did not get torn apart into multiple tokens, but remain together. Input sequences whose length after tokenization exceeded the maximum number of input tokens were truncated to the maximum length of 512 tokens. 3 Our list of additional tokens can be found here: https://github.com/AnReu/ALBERT-for-Math-AR/blob/main/ untrained_models/latex_tokens.txt 4 https://github.com/approach0/search-engine/blob/master/tex-parser/lexer.template.l Fine-Tuning Data In order to fine-tune our models, we paired each question with up to 𝑁 correct answers and the same number of incorrect answers. Up to 𝑁 correct answers were randomly chosen from the answers of the question. Each question in the corpus comes along with tags, i.e. categories indicating the topic of a question such as sequences-and-series or limits. As an incorrect answer for each question, we picked a random answer from one question sharing at least one tag with the original question by chance. This way, we chose up to 𝑁 incorrect answers independently from another. This procedure yields 1.9M examples for 𝑁 = 1 and 2.8M examples for 𝑁 = 10, of which 90% were used as training data for the fine-tuning task. We presented to the model the entire text of the questions and answers using the structure introduced in the previous section. In addition, we pre-trained an ALBERT Model on MathSE (1) and fine-tuned it on 𝑁 = 1. We then let this model predict 1,000 answers to the 2021 test set. We evaluated the answers against the publicly available test set from last year and paired each correct answer with a randomly selected incorrect answer from the model’s results. These question-answer pairs were used as an additional fine-tuning set which we denote by Annotated. Evaluation Data To evaluate the trained models, we paired each question of the ARQMath-1 to 3 test sets with each of the answer posts from 2010 to 2018. The question-answer pairs are pre-processed in the same way as the fine-tuning data. Note, that we do not apply pre-filtering or a first-ranking stage as it would be usually done for this kind of cross-encoder design. Instead, we are ranking the entire set of answers. This is possible because we are using a GPU with a greater memory size compared to our submission to ARQMath-2. For the longest queries, ranking the entire set of answers takes around 3h. 5.1. Experimental Setup In the previous sections, we have introduced several base models, pre-training, and fine-tuning data sets which lead to many combinations for MIR. A summary of our devised models can be found in Table 1. Our submission includes five models which were fine-tuned using the 𝑁 = 10 fine-tuning data set. For these models, we added their official identifiers to the table. The other models are used as baselines and for comparison of our setup. The models Math_10 and Math_10_Add were first pre-trained on MathSE (1), then on MathSE (2), and finally, on MathSE (3). We refer to this pre-training as mathematical pre-training. Six other models did not incorporate these two additional data sets for pre-training but were only pre-trained using the first strategy MathSE (1). One model was trained only on the Khan part of the AMPS data set, while two models used a mix of samples from Khan and MathSE. Here, both corpora were combined into a single data set and shuffled. We experimented with the same approach on the entire AMPS corpus and MathSE. For fine-tuning, we denoted on which data set each model was trained. Two models where trained first on the 𝑁 = 10 data. After this training was completed, a second fine-tuning was conducted using Annotated. All twelve models were trained using eight A100 GPUs with 40 GB GPU memory each. For Official Identifier Base Model Pre-Training Data Fine-Tuning Data BERT MathSE (1) 𝑁 =1 RoBERTa MathSE (1) 𝑁 =1 roberta_10 RoBERTa MathSE (1) 𝑁 = 10 ALBERT MathSE (1) 𝑁 =1 base_10 ALBERT MathSE (1) 𝑁 = 10 ALBERT MathSE (1) 𝑁 = 10 + Annotated math_10 ALBERT MathSE (1) - (3) 𝑁 = 10 math_10_add ALBERT MathSE (1) - (3) 𝑁 = 10 + Annotated ALBERT Khan 𝑁 =1 ALBERT Khan + MathSE mixed 𝑁 =1 Khan_SE_10 ALBERT Khan + MathSE mixed 𝑁 = 10 ALBERT AMPS + MathSE mixed 𝑁 =1 Table 1 Model Configurations for Task 1. pre-training, a batch size of 16 samples per GPU was used. We pre-trained the models for 13 epochs using MathSE (1) and 9 epochs on MathSE (2). MathSE (3) added additional 20 epochs to the model. Fine-tuning on 𝑁 = 1 and 𝑁 = 10 used a batch size of 32 examples per device, and 200 warm-up steps with a learning rate of 2𝑒−05 . Annotated used the same hyperparameters, but a batch size of 32 in total. Pre-training and fine-tuning were performed using Huggingface’s library transformers [27]. 5.2. Evaluation This section summarizes our results using the different setups. We start by presenting our overall results of the models submitted to the lab and then discuss the details of choosing the base model, the pre-training, and fine-tuning data. 5.2.1. Overall Results The results of our runs submitted to Task 1 of the ARQMath Lab 2022 are presented in Tables 2 and 3. Regarding nDCG’ and mAP’, Math_10, our model using mathematical pre-training performs the best in all three years. The model which was fine-tuned using Annotated received the highest scores for p’@10, but its performance on the other two metrics degraded. Since it was fine-tuned on the ARQMath 2021 test set, the scores on this set are naturally much higher than the models which were not fine-tuned on this data. The other three models of our submission are on par even though they were trained on different data and with a different base architecture. Nevertheless, our models for the submission to ARQMath-3 outperform even the best models from ARQMath-2 in all three metrics. In comparison to other participants of ARQMath-3, our Math_10_Add received the highest p’@10 scores among all automatic runs. In the following, we will analyze different aspects of improvements of our submission. ARQMath 2020 ARQMath 2021 Official Identifier nDCG’ mAP’ p’@10 nDCG’ mAP’ p’@10 math_10_add 0.421 0.264 0.405 (0.566) (0.445) (0.589) math_10 0.446 0.268 0.392 0.454 0.228 0.321 Submissions 2022 roberta_10 0.438 0.254 0.372 0.446 0.224 0.309 Khan_SE_10 0.437 0.254 0.357 0.437 0.214 0.309 base_10 0.438 0.252 0.369 0.434 0.209 0.299 ARQMath 2021 Participants TU_DBS (2021) primary 0.380 0.198 0.316 0.377 0.158 0.227 Math Dowser (2021) primary 0.433 0.191 0.249 0.434 0.169 0.211 DPRL (2021) QASim 0.417 0.234 0.369 0.388 0.147 0.193 Table 2 Results of Task 1. ARQMath 2022 Official Identifier nDCG’ mAP’ p’@10 math_10_add 0.379 0.149 0.278 math_10 0.436 0.158 0.263 Submissions 2022 roberta_10 0.413 0.150 0.226 Khan_SE_10 0.426 0.154 0.236 base_10 0.423 0.154 0.228 Table 3 Results of Task 1. 5.2.2. Base Model We evaluated models trained on three base architectures: BERT, ALBERT, and RoBERTa. The results can be found in Table 4. Even though ALBERT and RoBERTa are considered to be advancements over BERT, their performance on our downstream task is not necessarily higher. RoBERTa receives the highest scores for nDCG’ and p’@10, while BERT scores highest using the metric mAP’. Overall, the improvements of the three architectures over each other are rather minimal. However, the training time should also be considered: Pre-training ALBERT on (1) MathSE took 24h, while BERT and RoBERTa needed on average 25% more time. To fine-tune each model, ALBERT was the fastest with 8h on the 𝑁 = 1 data set. The fine-tuning of BERT and RoBERTa on the same data set took 11h. Evaluation takes the same time on average for each of the three models because the data is processed by the same number of layers since ALBERT’s layer sharing is only beneficial during training and BERT and RoBERTa share the same underlying architecture. 5.2.3. Additional Pre-Training Data Transformer-Encoder models are known to benefit from more pre-training data which is why we evaluate the ALBERT model on four different data set configurations whose results are ARQMath Lab 2020 nDCG’ mAP’ p’@10 BERT 0.4068 0.2411 0.3560 ALBERT 0.4122 0.2335 0.3587 RoBERTa 0.4157 0.2328 0.3676 Table 4 Comparison of results of BERT, ALBERT and RoBERTa as base models. presented in Table 5. Interestingly, the model trained on a mixed data set consisting of data from the Khan Academy and the MathSE scores best, receiving slightly better scores on nDCG’ and mAP’. For p’@10 the model trained only on MathSE outperforms the other models, indicating that it is able to place relevant documents better within the top 10 documents, while the first model ranked relevant documents better in the long run. The model trained only on data from Khan Academy scored worst in this evaluation demon- strating the shortcomings of out-of-domain data. A reason for this behavior could be that the questions from Khan are designed to serve as exercises. Therefore, each sentence is relevant for solving the question and does not contain any irrelevant information that could be included by question authors on MathSE (e.g. "Dear community, I have a question . . . "). After training on Khan data only, it could be harder for the model to deal with these irrelevant information in questions. ARQMath Lab 2020 Base Model Dataset nDCG’ mAP’ p’@10 ALBERT MathSE (1) 0.4122 0.2335 0.3587 ALBERT Khan 0.3716 0.1852 0.2947 ALBERT Khan+MathSE (1) 0.4164 0.2356 0.3373 ALBERT AMPS+MathSE (1) 0.4052 0.2256 0.3400 Table 5 Comparison of results for pre-training using different data sets. 5.2.4. Fine-Tuning Data When comparing the amount of fine-tuning data needed for the answer retrieval task, we can see in Table 6 that more data is clearly beneficial. In both cases, for ALBERT and RoBERTa, we see an increase on all three metrics when fine-tuning on 𝑁 = 10 instead of 𝑁 = 1 is applied. With an additional training on Annotated, only p’@10 is increased, while the other two metrics deteriorate. This indicates that the model can differentiate better between relevant and non-relevant answers in the top 10, but fails to place other relevant documents in good positions afterwards. We also report the scores on nDCG’ for the three categories ’Both’, ’Math’, and ’Text’ indicating which of these parts are most crucial for answering the question. For example, a question based in the category ’Text’ would require to understand the written text of the question over the ARQMath Lab 2020 Base Model Fine-Tuning Data nDCG’ mAP’ p’@10 Both Math Text ALBERT 𝑁 =1 0.4122 0.2335 0.3587 0.4202 0.4033 0.4140 ALBERT 𝑁 = 10 0.4377 0.2519 0.3693 0.4391 0.4437 0.4184 ALBERT 𝑁 = 10+Annotated 0.3988 0.2435 0.3853 0.3837 0.4220 0.3789 RoBERTa 𝑁 =1 0.4157 0.2328 0.3676 0.4175 0.4200 0.3999 RoBERTa 𝑁 = 10 0.4376 0.2543 0.3720 0.4293 0.4511 0.4246 Table 6 Comparison of results for fine-tuning using different data sets. mathematical formulas. Models which were trained using 𝑁 = 10 data showed improvements in all three categories, but the ’Math’ category benefited the most. An explanation for this observation could be that for 𝑁 = 1 the model saw fewer irrelevant examples that shared the same notation (same mathematical symbols) as the question, but which were used in a different and therefore irrelevant way for the question. In the 𝑁 = 10 data set, it was more probable that an irrelevant question still used the same mathematical symbol. Therefore, using this data set, the model needed to learn the semantics of the usage of the symbols, rather than their mere appearance. However, a similar observation should be possible for fine-tuning on Annotated, but by adding this fine-tuning, the model’s performance on all three categories degrades. 6. Contribution to Task 3 Task 3 was first introduced to the ARQMath Lab in 2022 and has the goal of generating answers given a question instead of retrieving them from a corpus. The questions are identical to the ones for Task 1. All include at least one formula. In the following, we will introduce our approach of generating answers using GPT-2 by fine- tuning it on two corpora. An overview of the approach is illustrated in Figure 3. We start by describing the data which we used, then our experimental setup and finally, present our results. 6.1. Data For fine-tuning GPT-2, we use the same two data sets as for Task 1: the MathSE data set as provided by the ARQMath Lab and the AMPS data set consisting of questions-answer pairs Base Model How are you? He replied that … Fine-Tuning Problem: Find $x$ such Solution: … Evaluation Problem: Greatest lower Solution: … Figure 3: Overview of our approach for Task 3 - Mathematical Answer Generation including examples for fine-tuning and evaluation data. Length Official Identifier Training Setup Beam Size Hints Sampling Penalty 3 ep. AMPS amps3_se1_hints 5 True 1 False + 1 ep. MathSE se3_len_pen_10 3 ep. MathSE 10 False 2 False 3 ep. AMPS amps3_se1_len_pen_20_sample_hint 20 True 2 True + 1 ep. MathSE shortest * * * * * Table 7 Model Configurations for Task 3, ep. denotes the number of epochs. from the Khan Academy and generated questions with step-by-step answers using Mathematica. In total, we fine-tuned our models on 1,445,487 question-answer pairs from MathSE, where for each question a single answer was chosen by chance. In addition, the AMPS data set consists of 627,795 question-answer pairs. For pre-processing, we used the tokenizer provided by the authors of AMPS, which is based on the original GPT-2 tokenizer, but separated compounds of digits into single digits. We also experimented with the original GPT-2 tokenizer, but found the adapted one to perform better. 6.2. Experimental Setup A summary of our experiments can be found in Table 7. We experimented with fine-tuning the models on two data sets for a different number of epochs. We tested to train only on the MathSE data for three epochs, while another model was first fine-tuned on the AMPS data set for three epochs and afterwards for one epoch on MathSE. In addition, also smaller numbers of epochs were tested but did not yield better results. For training on AMPS, we sub-sampled the amount of training data for the Mathematica part to 0.5 and for Khan to 5 following the procedure of [6]. For decoding, we varied the length penalty between 1 and 2 and applied beam search with a beam size of 5, 10, and 20. We also experimented with top-k sampling. Apart from these modifications, we followed the training and evaluation recommendations reported by [6]. Because the length of the generated answers exceeded the allowed maximum length for the submission to the ARQMath Lab 2022 in several cases, we tried to force the model to generate shorter, but still relevant answers by adding the word ’HINT’ to the beginning of the solution during decoding. The data set for the training was not altered. These combinations in total led to three experiments. The fourth one is a combined run which includes the shortest generated answer of each of the three runs. This run is denoted by ’shortest’. For all experiments for Task 3, we adapted the code by [6] for our data set which is based on Huggingface Transformers [27]. Official Identifier AR p@1 amps3_se1_hints 0.325 0.078 se3_len_pen_10 0.244 0.064 amps3_se1_len_pen_20_sample_hint 0.231 0.051 shortest 0.205 0.026 Table 8 Results of Task 3. 6.3. Results Table 8 displays our efforts for Task 3 of ARQMath-3. Out best model was trained on the AMPS data and afterward on MathSE. During decoding, we use the word ’HINT’ to force the model to generate shorter answers. Surprisingly, this run scored better in both metrics than the one that was trained in the same way but used a higher length penalty, sampling, and a beam size of 20. The model which was trained on MathSE only ranks in second place for both metrics. The lowest scores received the run that consists of the shortest answers of our models for each topic. This indicates that shorter answers may be insufficient to convey enough relevant information in the post. Since an automatic evaluation of answer generation is challenging, we will not analyze the impact of different aspects of our submission but instead report in the following some results in the context of a qualitative evaluation. Below, we present two examples of questions from the test set with their generated answers by our primary submission. In general, the model is able to pick up the topic from the questions and generate meaningful, syntactically correct answers in most cases. However, whether the answers are relevant for solving the questions needs further evaluation. For example, the generated answer for Question 322 is simply a copy of a part of the question which does not seem to provide any additional information. This behavior of copied information can also be seen in other answers. Especially, when there was an edit of the author where they already provided a correct answer, the generated answer would only copy this solution without commenting. A similar issue can be seen in the generated solution for Question 340. The model only arranges the question differently in several steps but does not comment on the process. Whether and why this equation holds, can not directly be inferred from the model’s answer. More examples of model outputs can be seen in the Appendix. Post ID 322: Title: How do I calculate the sum of sum of triangular numbers? Question: As we know, triangular numbers are a sequence defined by 𝑛(𝑛+1) 2 . And it’s first few terms are 1, 3, 6, 10, 15.... Now I want to calculate the sum of the sum of triangular numbers. Let’s define 𝑛(𝑛 + 1) 𝑎𝑛 = 2 𝑛 ∑︁ 𝑏𝑛 = 𝑎𝑥 𝑥=1 𝑛 ∑︁ 𝑐𝑛 = 𝑏𝑥 𝑥=1 And I want an explicit formula for 𝑐𝑛 . After some research, I found the explicit formula for 𝑏𝑛 = 𝑛(𝑛+1)(𝑛+2) 6 . Seeing the patterns from 𝑎𝑛 and 𝑏𝑛 , I figured the explicit formula 𝑛(𝑛+1)(𝑛+2)(𝑛+3) for 𝑐𝑛 would be 24 or 𝑛(𝑛+1)(𝑛+2)(𝑛+3) 12 . Then I tried to plug in those two potential equations, If 𝑛 = 1, 𝑐𝑛 = 1, 𝑛(𝑛+1)(𝑛+2)(𝑛+3) 24 = 1, 𝑛(𝑛+1)(𝑛+2)(𝑛+3) 12 = 2. Thus we can know for sure that the second equation is wrong. If 𝑛 = 2, 𝑐𝑛 = 1 + 4 = 5, 𝑛(𝑛+1)(𝑛+2)(𝑛+3) 24 = 5 . Seems correct so far. 𝑛(𝑛+1)(𝑛+2)(𝑛+3) If 𝑛 = 3, 𝑐𝑛 = 1 + 4 + 10 = 15, 24 = 360 24 = 15 Overall, from the terms that I tried, the formula above seems to have worked. However, I cannot prove, or explain, why that is. Can someone prove (or disprove) my result above? Generated Solution: HINT: 𝑛(𝑛+1)(𝑛+2)(𝑛+3) 24 = 360 24 = 15 Post ID 340: Title: I have the following problem: Let |𝑥𝑛+1 − 𝑥𝑛 | < 1/3𝑛 . Show that (𝑥𝑛 ) is a Cauchy sequence. Question: We have that (𝑥𝑛 ) is a sequence of real numbers. And the relation on the title: 1 |𝑥𝑛+1 − 𝑥𝑛 | < 𝑛 . 3 We must prove that this is a Cauchy sequence. I know that an Cauchy sequence follows the definition: given 𝜖 > 0, exists 𝑛0 > 0 , such that 𝑚, 𝑛 > 𝑛𝑜 ⇒ |𝑥𝑚 − 𝑥𝑛 | < 𝜖 But I don’t know how to use both informations to prove the exercise. If someone please may help me, I’d be ⃒very thankful. ⃒ −𝑥𝑛 ⃒ |𝑥 −𝑥𝑛 | Generated Solution: |𝑥𝑛+1 − 𝑥𝑛 | = ⃒ 𝑥𝑛+1 ⃒ = 𝑛+1 1 < 3𝑛+1 . ⃒ 3𝑛+1 3𝑛+1 7. Conclusion In our contribution to this lab, we demonstrated the use of Transformer-Encoders and Transformer-Decoders for two mathematical question-answering tasks. We pre-trained and fine-tuned twelve models based on BERT, RoBERTa, and ALBERT for the retrieval of answers given a mathematical question for the ARQMath Lab 2022. Our results show significant improve- ments compared to the 2021 edition of this lab, which can be attributed to better pre-training and the enlarged fine-tuning data. In addition, we analyzed several pre-training data sets and found that the Khan data set showed slight improvements in two out of three metrics. Finally, we improved our p’@10 results even further by additionally fine-tuning on annotated test data from ARQMath-2. For Task 3, a GPT-2 model was fine-tuned on two data sets. The results for this task are not yet published, but first analyses showed that the model is able to capture the topic of the question and can generate syntactically correct answers. The limited length of the generated answers is still an issue, which should be addressed in future research. Acknowledgments This work was supported by the DFG under Germany’s Excellence Strategy, Grant No. EXC- 2068-390729961, Cluster of Excellence “Physics of Life” of TU Dresden. Furthermore, the authors are grateful for the GWK support for funding this project by providing computing time through the Center for Information Services and HPC (ZIH) at TU Dresden. We would also like to thank the reviewers for their helpful comments and recommendations. References [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [2] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised learning of language representations, arXiv preprint arXiv:1909.11942 (2019). [3] H. Jo, D. Kang, A. Head, M. A. Hearst, Modeling mathematical notation semantics in academic papers, in: Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 3102–3115. [4] A. Reusch, M. Thiele, W. Lehner, Tu_dbs in the arqmath lab 2021, clef, in: CEUR Workshop Proceedings, 2021. http://ceur-ws.org/Vol-2936/paper-07.pdf. [5] W. Zhong, J.-H. Yang, J. Lin, Evaluating token-level and passage-level dense retrieval models for math information retrieval, arXiv preprint arXiv:2203.11163 (2022). [6] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, J. Stein- hardt, Measuring mathematical problem solving with the math dataset, arXiv preprint arXiv:2103.03874 (2021). [7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, Advances in Neural Information Processing Systems 30 (2017) 5998–6008. [9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [10] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), 2019, pp. 3606–3611. [11] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, W. Redmond, M. B. McDermott, Publicly available clinical bert embeddings, NAACL HLT 2019 (2019) 72. [12] K. Huang, J. Altosaar, R. Ranganath, Clinicalbert: Modeling clinical notes and predicting hospital readmission, arXiv preprint arXiv:1904.05342 (2019). [13] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al., Codebert: A pre-trained model for programming and natural languages, arXiv preprint arXiv:2002.08155 (2020). [14] A. Kanade, P. Maniatis, G. Balakrishnan, K. Shi, Learning and evaluating contextual embedding of source code, in: International Conference on Machine Learning, PMLR, 2020, pp. 5110–5121. [15] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving language understanding by generative pre-training (2018). https://www.cs.ubc.ca/~amuham01/LING530/papers/ radford2018improving.pdf. [16] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [17] S. Peng, K. Yuan, L. Gao, Z. Tang, Mathbert: A pre-trained model for mathematical formula understanding, arXiv preprint arXiv:2105.00377 (2021). [18] S. Rohatgi, J. Wu, C. L. Giles, Psu at clef-2020 arqmath track: Unsupervised re-ranking using pretraining, in: CEUR Workshop Proceedings. Thessaloniki, Greece, 2020. http: //ceur-ws.org/Vol-2696/paper_121.pdf. [19] V. Novotnỳ, P. Sojka, M. Štefánik, D. Lupták, Three is better than one, in: CEUR Workshop Proceedings. Thessaloniki, Greece, 2020. http://ceur-ws.org/Vol-2696/paper_235.pdf. [20] S. Rohatgi, J. Wu, C. L. Giles, Ranked list fusion and re-ranking with pre-trained trans- formers for arqmath lab (2021). http://ceur-ws.org/Vol-2936/paper-08.pdf. [21] V. Novotnỳ, M. Štefánik, D. Lupták, M. Geletka, P. Zelina, P. Sojka, Ensembling ten math information retrieval systems (2021). http://ceur-ws.org/Vol-2936/paper-06.pdf. [22] P. Dadure, P. Pakray, S. Bandyopadhyay, Bert-based embedding model for formula retrieval, CLEF, 2021. http://ceur-ws.org/Vol-2936/paper-03.pdf. [23] B. Mansouri, D. W. Oard, R. Zanibbi, Dprl systems in the clef 2021 arqmath lab: Sentence- bert for answer retrieval, learning-to-rank for formula retrieval (2021). http://ceur-ws.org/ Vol-2936/paper-04.pdf. [24] S. Polu, I. Sutskever, Generative language modeling for automated theorem proving, arXiv preprint arXiv:2009.03393 (2020). [25] B. Mansouri, V. Novotný, A. Agarwal, D. W. Oard, R. Zanibbi, Overview of ARQMath- 3 (2022): Third CLEF lab on Answer Retrieval for Questions on Math (Working Notes Version), in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, 2022. [26] R. Zanibbi, A. Aizawa, M. Kohlhase, I. Ounis, G. Topic, K. Davila, Ntcir-12 mathir task overview., in: NTCIR, 2016. https://www.cs.rit.edu/~rlaz/files/ntcir12-mathir.pdf. [27] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan- guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. [28] Q. Lhoest, A. Villanova del Moral, P. von Platen, T. Wolf, M. Šaško, Y. Jernite, A. Thakur, L. Tunstall, S. Patil, M. Drame, J. Chaumond, J. Plu, J. Davison, S. Brandeis, V. Sanh, T. Le Scao, K. Canwen Xu, N. Patry, S. Liu, A. McMillan-Major, P. Schmid, S. Gugger, N. Raw, S. Lesage, A. Lozhkov, M. Carrigan, T. Matussière, L. von Werra, L. Debut, S. Bekman, C. Delangue, Datasets: A Community Library for Natural Language Processing, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, 2021, pp. 175–184. URL: https://aclanthology.org/2021.emnlp-demo.21. [29] T. Kudo, J. Richardson, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226 (2018). A. Sample Results - Task 3 Generated Solution: HINT: If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(mod 3) and 𝑎 ∈ F𝑝 , then gcd(3, 𝑝 − 1) = 1. If 𝑝 ≡ 2(m Generated Solution: HINT: If 𝛼 is an infinite ordinal, then there exists an injection from 𝛼+ to 𝛼. √ Generated Solution: If 𝑛 is not a perfect square, then 𝑛 is irrational. ∫︀ 𝑥 ∫︀ 𝑥 Generated ∫︀ 𝑥 Solution: ∫︀ 𝑥 ∫︀ 𝑥 If 𝑓 is Riemann-integrable, HINT: ∫︀ 𝑥 ∫︀ 𝑥 then 𝑎∫︀ 𝑥𝑓 (𝑡) 𝑑𝑡 = ∫︀ 𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = 𝑎∫︀ 𝑥𝑓 (𝑡) 𝑑𝑡 + 𝑎 ∫︀𝑓 𝑥(𝑡) 𝑑𝑡 − 𝑎 𝑓 (𝑡) 𝑑𝑡 ∫︀ 𝑥 = 𝑎 𝑓 (𝑡) 𝑑𝑡∫︀ + 𝑥 𝑎 𝑓 (𝑡) 𝑑𝑡 − ∫︀ 𝑥 𝑎 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = 𝑎 𝑓 (𝑡) 𝑑𝑡 + 𝑎 𝑓 (𝑡) 𝑑𝑡 − 𝑎 𝑓 (𝑡) 𝑑𝑡 = 𝑎 𝑓 (𝑡) 𝑑𝑡 + 𝑎 𝑓 (𝑡) 𝑑𝑡 − 𝑎 𝑓 (𝑡) 𝑑𝑡 = ∫︀ 𝑥 ∫︀ 𝑥 ∫︀ 𝑥 ∫︀ 𝑥 ∫︀ 𝑥 ∫︀ 𝑥 ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 + ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 − ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 = ∫︀𝑎𝑥 𝑓 (𝑡) 𝑑𝑡 +∫︀ 𝑥 𝑎 𝑓 (𝑡) 𝑑𝑡 ∫︀−𝑥 𝑎 𝑓 (𝑡) 𝑑𝑡∫︀ 𝑥 = 𝑎 𝑓∫︀(𝑡) 𝑑𝑡 + 𝑥 𝑎 𝑓 (𝑡) 𝑑𝑡 − 𝑎 𝑓 (𝑡) 𝑑𝑡 = 𝑎 𝑓 (𝑡) 𝑑𝑡 + 𝑎 𝑓 (𝑡) 𝑑𝑡 − 𝑎 𝑓 (𝑡) 𝑑𝑡 = 𝑎 𝑓 (𝑡) 𝑑𝑡 + 𝑎 𝑓 (𝑡) 𝑑𝑡−