Mathematical Word Problem Solution Evaluation via Data Preprocessing Approach Andrii D. Nikolaiev and Anatoliy V. Anisimov Taras Shevchenko National University of Kyiv, 4d, Glushkova str., Kyiv, 8300, Ukraine Abstract This article describes an overview of methods for processing mathematical text problems and correspondent domain datasets. It was proposed a new approach to estimate MWP solutions that could use more context around the problem. Described how method could be used for MWP similarity and for automation mathematical solutions grading. Keywords 1 Math word problem, natural language processing, machine learning, artificial intelligence 1. Introduction The importance of personalized learning has always been in high demand. But for quality learning it is necessary to have an expert – a person or a system for self-control. There are several ways to estimate level of knowledge, the most common is with the help of multiple-choice tests. Modern mathematical platforms are limited in the ability to accept tasks and mostly capable only to accept answers for MCQ-type questions (Multiple Choice Question), which are about to choose between only correct and incorrect answers from the proposed ones. Despite the fact, this approach provides an opportunity to objectively compare student results, it is not objective for estimation of comprehension of subject. More indicative is the concept inventories. Unfortunately, there are no significant automatic mathematical word problem solvers and estimators. It imposes high limitation for the possibilities in learning new mathematical concepts and detailed feedback loss over the learning process. Therefore, it is important to be able to accept detailed solutions over mathematical problems automatically. The problem is particularly challenging because a wide semantic gap remains between the human- readable words and machine understandable logics. It is also hard to provide relevant feedback about the given solution. That’s why we need to dive into some well-known NLP problems connected with the mathematical text interpretation. The one of them is designing an automatic solver for the mathematical word problem (MWP) which started back in 1960’s, and in the last years due to scientifically increased interest to artificial intelligence (AI) became more popular. Formalization systems could also be applied for solving and understanding mathematical problems (e.g., proof assistant LEAN [1], SAD [2]). The proof assistant is a piece of software that provides a language for defining objects, specifying properties of these objects, and proving that these specifications hold. The system checks that these proofs are correct down to their logical foundation. These tools are often used to verify the correctness of programs. But they can be also used for abstract mathematics. In the formalization, all definitions are precisely specified and all proofs are virtually guaranteed to be correct. The International Mathematical Olympiad Grand Challenge was announced in 2020 [3, 4]. The challenge is about creating an AI that can win a gold medal at the IMO. To remove ambiguity of the scoring rules, teams were proposed Information Technology and Implementation (IT&I-2021), December 01-03, 2021, Kyiv, Ukraine EMAIL: nikolaev@knu.ua (A. 1), a.v.anisimov@knu.ua (A. 2) ORCID: 0000-0002-1467-2006 (A. 2) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 94 the formal-to-formal (F2F) variant of the IMO: the AI receives a formal representation of the problem (in the LEAN Theorem Prover), and it is required to emit a formal (i.e., machine-checkable) proof. However, this way is less practical because of complexity of describing problem statement and check the correctness of the given solution via such systems. The remainder of the paper is organized as follows. Paper starts with the review of the AWP solvers in Section 2, followed by problem statement and method implementation main steps in Section 3. Since dataset deserves special attention, it was summarized as well as the associated ideas of dataset extension in Section 4. The general algorithm of the proposed method is described in Section 5. In the section was also described related ideas about similarity to math problems and variants of method usage for real case situations. The paper concludes list of references in the final sections. 2. Related work MWP is the common math problem type used for building solvers. There are some categories of MWP: Arithmetic Word Problem (AWP), Equation Set (ESS), Geometric Word Problem (GWP). We will take a closer look at AWP type problems which are due to given textual condition of the problem is about to compose a corresponding equation and get the answer to the problem. The problem is represented by a sequence of words 〈𝑤0 , 𝑤1 , … 𝑤𝑘 〉 – some of them are quantities 𝑞0 , 𝑞1 , … 𝑞𝑛 , which are mentioned in the text and the unknown variable 𝑥. The main goal is to present a text problem in the form of the corresponding equation E which is linear for the case. Among the basic arithmetic operations there are only {+, −, ×, ÷}. Table 1 An example of arithmetic word problem MATH WORD PROBLEM Misha found 221 seashells and 35 starfish on the beach. He gave 101 of the seashells to Katya. How many seashells does Misha now have? OUTPUT EQUATION: x = 221 + 35 – 101 SOLVER OUTPUT: 155 The different approaches which have been invested in solving math word problems can be categorized into four main efforts: (1) statistics-based, (2) tree-based, (3) deep learning-based and rule-based methods which were used on the early approaches which we will skip for now. Some approaches used a combination of the above categories. 2.1 Statistic-based Approach The statistic-based approach tries to identify the question entities, their values and the desired operators that need to be evaluated in order to achieve the correct answer. The identifications are obtained with common machine learning methods. Hosseini et al. 2014 suggested solving arithmetic word problems which include addition and subtraction operations [5]. The problem text is split into parts where each represents a transition between two world states. The quantities of the entities for each such transition are updated or observed, and the predicted solution is inferred from changing the world states until reaching the end of transitions. Solving Math Word Problems Using Encoder-Decoder Neural Network Mitra & 95 Baral (2016) used supervised learning for specifying the formula that should be applied to generate the appropriate equation and the relevant variables [6]. Liang et al. showed a similar work using log-linear models with handcrafted feature engineering [7]. 2.2 The Tree-based Approach The tree-based approach focuses on representing the problem in hierarchical manner. The hierarchy is represented by a unique tree named binary expression tree. An expression tree can be evaluated by applying the operator at the root to the values obtained by recursively evaluating the left and right subtrees. Koncel- Kedziorski et al. (2015) suggested a system named ALGES which generates a tree over the space of all valid expression trees, given a math word problem with single equation formula as the answer [8]. It uses integer linear programming and maximum-likelihood estimator. Roy & Roth (2016) presented the Expression Tree method [9]. It uniquely decomposes the math problem into multiple classification problems and then composes a monotone expression tree, which defines a collection of simple prediction problems, each determining the lowest common ancestor operation between a pair of quantities mentioned in the problem. Roy & Roth (2017) defined a new structure named Unit Dependency Graph (UNITDEP), an annotated graph with vertices for each of the quantities appearing in the problem and edges representing the relationship between two quantities. The graph is annotated by classifiers for node labeling and edge properties annotating [10]. 2.3 Deep Learning-based Approach In the deep learning approach, MathDQN by Wang et al., (2018) is a customized form of the general deep reinforcement learning framework. They defined actions, states and a reward function, and used a feed-forward network [11]. TheSeq2Seq model proposed by Wang et al. (2017) includes an encoder (a GRU unit) and a decoder (an LSTM unit) producing an equation template [12]. T-RNN (Wang et al., 2019) also extends the Seq2Seq model. It encodes the quantities using Bi-LSTM and self-attention network. It uses an RNN to construct a tree-structure template with inferred numbers as the leaf nodes and unknown operators as the inner nodes. The tree-structure template is designed to reduce the number of template space [13]. Amini et al. (2019) suggested the Sequence2Program model for multiple-choice math word problems. It uses an encoder-decoder neural network that maps word problems to a set of feasible operation programs. The results of the executed operation program are matched against the given multiple-choice options for a particular problem [14]. Figure 1: Main word problem solvers categories. 3. Problem statement The aim of the work is to propose an algorithm that is able to maintain significant points on the given solution for the mathematical text problem and provide the verdict. For input algorithm takes solution sample of the given problem statement. Algorithm output is to return solution score (0 or 1) and maintain solution part which is led to the decision (provide keywords or crucial points for qualitive estimation). 96 3.1 Method implementation main steps In order to implement method, it is important to complete all the steps below: 1. Algorithm. Machine learning methods could be useful for the problem. The approach uses n-gram extraction as the main source of comparison of solutions that could be inaccurate. At the same time, using phrase embeddings to resolve such cases as paraphrases is promising. 2. Dataset. One of the most important steps in method implementation is to create dataset of type Problem-Solution-Result. It could help to understand the way of solutions evaluation for future work. 3. Evaluation metrics. To develop evaluation metrics for a given task. This should be a vital part as the task is non-standard and it’s not obvious how to compare different systems. The same equation could be formulated in different ways. While for the standard MWPs F1 score is OK there is no such metrics for estimation of detailed solutions. 4. Methods leaderboard. Leaderboard is highly convenient and presentable for better comparison over other methods in future studies. Adding scores/results of described methods and their implementation versions should help in evaluation and estimation of provided different methods. 5. Module implementation. Module needs to be connected to database so it could upload a new solution and estimate it in correspondence to known verdicts. The module also could bring some useful insights on how the “ideal” solution could look like. Therefore, module itself could be used as generator for problem solutions and be used for finding similarities between the problems statement. 6. Experiments and discussion part. After module implementation, we need to test proposed method solution, evaluate it and compare with well-known approaches using NLP methods. Module could also be tested over the real case situations as it could be used in education field for correspondent math classes in automation processes. In this paper was provided first version of algorithm via machine learning methods, overview of dataset and some basic ideas of how the proposed algorithm could use for MWP similarity and for real case situations. 4. Datasets For providing solution for a given type of problem we could obtain semantic part via some NLP-models. There are bunch of datasets which is used for the providing solutions to MWP. For the experiments before 2016 commonly used datasets were [15, 16]:  Alg514 (Kushman et al., 2014) The dataset is crawled from algebra.com, a crowd-sourced tutoring website and contains only 514 linear algebra problems with 28 equation templates.  AI2 (Hosseini et al., 2014): There are 395 single-step or multi-step arithmetic word problems for the third, fourth, and fifth graders. It involves problems that can be solved with only addition and subtraction. The dataset is harvested from two websites: math-aids.com and ixl.com.  Dolphin1878 (Shi et al., 2015) includes 1,878 number word problems with 1183 equation templates, obtained from algebra.com and Yahoo! answers.  DRAW (Upadhyay and Chang, 2016) containing 1,000 algebra word problems from algebra.com, each annotated with linear equations.  SingleEQ (by Koncel-Kedziorski et al., 2015) The dataset contains both single-step and multi-step arithmetic problems and is a mixture of problems from a number of sources, including math- aids.com, k5learning.com, ixl.com and a subset of the data from AI2. Each problem involves operators of multiplication, division, subtraction, and addition over non-negative rational numbers.  Dolphin18K which contains over 18,000 annotated math word problems. It is constructed by semi- automatically extracting problems, equation systems and answers from community question-answering (CQA) web pages. The source data we leverage are the (question, answer, text) pairs in the math category of Yahoo! answers. 97  MAWPS is another testbed for arithmetic word problems with one unknown variable in the question. Its objective is to compile a dataset of varying complexity from different websites. Operationally, it combines the published word problem datasets used in AI2 and some others. There are 2,373 questions in the harvested dataset [17].  In 2017 Wang et al. presented the Math23K dataset. The dataset contains Chinese math word problems for elementary school students and is crawled from multiple online education websites [12]. The latest datasets are:  Ape210K (2020): 210K Chinese elementary school-level math problems [18].  In 2021 Arkil at el. used models Graph2Tree with RoBERTa, GTS with RoBERTa, LSTM Seq2Seq with RoBERTa, Transformer with RoBERTa on the data of a presented SVAMP dataset, which contained 1000 problems with compiling elementary school level equations created by applying carefully chosen variations over examples sampled from existing datasets [19].  In 2021 Dan et al. used a GPT-3 model based on MATH dataset which consists of 12,500 from high school math competitions which was proposed in the work and also contains solution base for each problem [20] and ASMP pretraining corpus, which consists of Khan Academy and Mathematica data. AMPS has over 100,000 Khan Academy problems with step-by-step solutions in LaTeX. It also contains over 5 million problems generated using Mathematica scripts, based on 100 hand-designed modules covering topics such as conic sections, div grad and curl, KL divergence, eigenvalues, polyhedra, and Diophantine equations. In total AMPS contains 23GB of problems and solutions. Yet, none of datasets contains one or more alternative solutions for every problem sample. Therefore, in order to run experiments for future method it’s essential to create Problem-Solution-Result dataset. Hopefully, we already have first such corpuses (as ASMP) and we could expect even more big corpuses as, for instance, in 2015 was mentioned that Baidu has collected over 950 million questions and solutions in its database for K-21 [16]. Additionally, for our dataset we could use the information of some well-known sites and databases where problems are stored with solutions refer to them, e.g.:  Use some of fully solved shortlists of recent IMO competitions which are available on the official IMO web-site, while the solutions of all shortlisted problems from 1959-2009 are available in The IMO Compendium [21].  Export and use anonymous data from messenger chat history which are used to process problem solutions of refer mathematical classes as far as more schools switched to online format. 5. Solution Preprocessing Approach 5.1 Solutions Context for Semantic Extraction It is difficult to understand the semantics behind the problem statement only. So, in order to simplify the task, we can use more context around the problem. For the context we could use other solutions (both incorrect and correct ones) and their estimation scores for them. 5.2 The Algorithm for Solution Evaluation For the given text problem 𝑊0 , we have M solutions 𝑆 = 〈𝑆0 , 𝑆1 , … 𝑆𝑀 〉, each solution could be presented with their own sequence of words 𝑤 = 〈𝑤0 , 𝑤1 , … 𝑤𝑘 〉 – some of them are quantities 𝑞0 , 𝑞1 , … 𝑞𝑛 . Every solution corresponds some binary vector of results 𝑟𝑖 : 0 refer to incorrect solution and 1 – to the correct one. We use one of mentioned models to extract vector of keywords (or phrases – together main features for a given problem 𝑊0 ) 𝑘 = 〈𝑘0 , 𝑘1 , … 𝑘𝑙 〉 from all solutions data. We train our model on correlation between 𝑘 and 𝑟. Every keyword could be positive and negative depending on the correlation with 𝑟. Then for each value 𝑘𝑖 we obtain the F1-score 𝑘𝑖𝐹1 of how often the specific keyword resulted in correct 98 𝑠𝑖𝑔𝑛 evaluation of a given solution plus sign-factor 𝑘𝑖 which is 1 in case of positive keyword and −1 – for negative keyword. The final vector 𝑘 sorted by key = 𝐹1(𝑘𝑖 ) and 𝑊 form set of parameters for our evaluator 𝑒𝑣𝑎𝑙𝑤 : 〈𝑊, 𝑘〉. Figure 2: The approach using solutions context for providing the evaluation over the given solution (in Ukrainian) Now we obtain some example solution 𝑆’ = 〈𝑤0 , 𝑤1 , … 𝑤𝑘 〉 in which we find presence of all the keywords from 𝑘 and obtain the vector 𝑝, where 𝑝𝑖 is the binary value of presence keyword 𝑘𝑖 in the given solution: 0 in case 𝑆’ not include 𝑘𝑖 , 1 – either. Finally, we calculate the weighted sum of our vector 𝑝 where the weights are corresponding F1-scores and give a verdict 𝑟𝑆′ for the solution via some activation function (for instance, binary step activation function 𝐻(𝑥) as far as we could provide verdict only from {0, 1} set). 99 Note that it is important to give more penalties for the solutions which contains negative keywords with high F1-scores as far as correct solutions may not include any mistakes. On the given problem set we could now generate 𝑒𝑣𝑎𝑙𝑤 = 〈𝑊, 𝑘〉 for each problem. 5.3 MWP Similarity Table 2 Math word problems similarity MATH WORD PROBLEM A SIMILAR MATH WORD PROBLEM B Masha had 5 fruits: 3 apples and 2 We have 5 letters: three letters “A” and two bananas. How many different ways to eat letters “B”. How many different 5-letter one fruit per each day during 5 days? words could be created by using all given letters? OUTPUT SOLUTION: Let’s encode the OUTPUT SOLUTION: Imagine if we choices: “A” for apples and “B” for have 5 different letters and the same bananas, “AAABB” will stand for the case ↔ question. Then the answer would be 5!. when Masha takes 3 apples at first 3 days But we need not to count the repetitions. and then 2 bananas for the rest 2 days. Therefore, the answer needs to be divided Therefore, we need to calculate all by number of permutations among three permutations of the word “AAABB” letters “A” and two letters “B”, which is 3! 𝟓! 𝟓! <…>. The answer is and 2! respectively. So, the answer is 𝟑!𝟐! 𝟑!𝟐! 𝟓! 𝟓! SOLVER OUTPUT: SOLVER OUTPUT: 𝟑!𝟐! 𝟑!𝟐! It could be hard to understand problem similarity via the statement. Instead, proposed to find the relation between 𝑒𝑣𝑎𝑙𝑎 and 𝑒𝑣𝑎𝑙𝑏 of the problems. Hence, to understand the similarity between two given problems and solutions we need to compare their 𝑘 vectors plus their parameters and 𝑊. Solution similarity helps with semantic problem understanding and also help in classifying the problems into smaller classes. Table 3 Table of corresponded phrases Problem A Problem B 5 fruits 5 letters 3 apples three letters “A” 2 bananas two letters “B” 5 days 5-letter words 5.4 Solutions Preprocessing for MWP As far the MWP mostly about composing appropriate equations based on the text in order to provide the correct answer, the solutions preprocessing approach can’t be used directly. However, the solutions context could be applied for MWP. For instance, in the given example there was 100 an error: the correct answer is 120 and correct equation is “x = 221 – 101”. That is because model has the statement incorrect interpretation. Table 4 An example of correctly solved arithmetic word problem MATH WORD PROBLEM SIMILAR MATH WORD PROBLEM Misha found 221 seashells and 35 starfish on Vasya count 10 red cars and 3 green cars the beach. He gave 101 of the seashells to ↔ while was walking to home. Then he saw 5 Katia. How many seashells does Misha now more red cars from the window. How many have? red cars did he saw? OUTPUT EQUATION: x = 221 + 35 – 101 OUTPUT EQUATION: x = 10 + 5 SOLVER OUTPUT: 155 SOLVER OUTPUT: x = 15 CORRECT EQUATION: x = 221 – 101 CORRECT OUTPUT: 120 Such error could be tracked via use of solutions preprocessing method. Over some data of correctly (and incorrectly) solved MWP, we can obtain features which could be generalized for the similar task problems. The “similarity” could be reached over large amount of Problem-and-Solutions data with evaluators 𝑒𝑣𝑎𝑙𝑤 similarity (5.3). Over the big amount of data, for the given example it would be not difficult to see that “35 starfish” is irrelevant data because the similar solutions do not include it in solutions as far as the problem question does not contain word “starfish”. 5.5. Method usage perspectives for real case situations Such system could be effectively applied for automated checking of school children mathematical solutions. In order to use the system in real case situations it is also important to use some recognition techniques and learn our method to differ where is the problem statement and solution is placed. After we get the text, we could process it through the mentioned method and obtain evaluation results. Of course, for the given task it’s important to obtain high quality data what would be difficult in refer to pupils. However, the practical usage of the system without no doubt is highly promising. 6. Conclusion It was done an overview of well-known methods for MWP and correspondent domain datasets. Also, it was proposed a new approach to estimate MWP solutions and described an idea of how module could be used for MWP similarity. The proposed method could be used for automation mathematical solutions grading. The last works approaches about MWP reached a high degree of accuracy [22, 23, 24], however, for some other datasets (such as SVAMP) the majority of models can’t provide significant results. That is because of semantic gap as natural language texts invariably assume some knowledge implicitly or information noise. Humans know the relevant information, but a computer reasoning from texts must be 101 given it explicitly. Filling these information gaps is a serious challenge; representation and acquisition of the necessary background knowledge are very hard AI problems. Figure 3: The model extracting main features from the student solution (in Ukrainian). The orange allocation is where the solution statement is placed, while green ones are the main points of the correct solution. System will grade given solution as the correct one Therefore, there is a big necessity in having more detailed datasets of not only problems with correct answers, but with suggested solutions part too. With the help of given algorithm using preprocessing approach to the problem sets, our method could use more context around the problem and increase accuracy in MWP and even more difficult (such as mathematical olympiad) problems. 7. References [1] Lean system proofer. URL https://leanprover.github.io/about/ [2] Verchinine, Konstantin & Lyaletski, Alexander & Paskevich, Andrey & Anisimov, A.. (2008). On Correctness of Mathematical Texts from a Logical and Practical Point of View.. 583-598. [3] IMO Grand Challenge. URL https://imo-grand-challenge.github.io/ [4] The International Mathematical Olympiad (IMO). URL https://www.imo-official.org/ [5] Hosseini, M. J., Hajishirzi, H., Etzioni, O., and Kushman,N. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 523–533, 2014. [6] Mitra, A. and Baral, C. Learning to use formulas to solve simple arithmetic problems. In Proceedings of the 54thAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2144–2153,2016. [7] Liang, C.-C., Wong, Y.-S., Lin, Y.-C., and Su, K.-Y. A meaning-based statistical english math word 102 problem solver. arXiv preprint arXiv:1803.06064, 2018. [8] Koncel-Kedziorski, R., Hajishirzi, H., Sabharwal, A., Etzioni, O., and Ang, S. D. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585– 597, 2015. [9] Roy, S. and Roth, D. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016. [10] Roy, S. and Roth, D. Unit dependency graph and its application to arithmetic word problem solving. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. [11]Wang, L., Zhang, D., Gao, L., Song, J., Guo, L., and Shen,H. T. Mathdqn: Solving arithmetic word problems via deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [12] Wang, Y., Liu, X., and Shi, S. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 845–854, 2017. [13] Wang, L., Zhang, D., Zhang, J., Xu, X., Gao, L., Dai, B., and Shen, H. T. Template-based math word problem solvers with recursive neural networks. 2019. [14]Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi,Y., and Hajishirzi, H. Math qa: Towards interpretable math word problem solving with operation-based formalisms. [15]Danqing Huang Shuming Shi Chin-Yew Lin Jian Yin Wei-Ying Ma. “How well do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation”. Meeting of the Association for Computational Linguistics (2016). [16]Dongxiang Zhang, Lei Wang, Luming Zhang, Bing Tian Dai and Heng Tao Shen. “The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solver”. URL shttps://arxiv.org/pdf/1808.07290.pdf [17]Rik Koncel-Kedziorski, Subhro Roy, Aida Amini1, Nate Kushman, Hannaneh Hajishirzi1. “MAWPS: A Math Word Problem Repository.” [18]Wei Zhao, Mingyue Shang, Yang Liu, Liang Wang, Jingming Liu. “Ape210K: A Large-Scale and Template-Rich Dataset of Math Word Problems.” [19]Arkil Patel, Satwik Bhattamishra, Navin Goya. “Are NLP Models really able to Solve Simple Math Word Problems?” Available at: https://arxiv.org/pdf/2103.07191v2.pdf [20]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt. “Measuring Mathematical Problem Solving With the MATH Dataset”. URL: https://arxiv.org/pdf/2103.03874v1.pdf [21]Filip Maric, Sana Stojanovic. “Formalizing IMO Problems and Solutions in Isabelle/HOL” https://arxiv.org/abs/2010.16015 [22]Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan Wang, Jie Shao, Ee-Peng Lim, “Graph-to- Tree Learning for Solving Math Word Problems”. URL: https://www.aclweb.org/anthology/2020.acl- main.362.pdf [23]Zhipeng Xie, Shichao Sun. A Goal-Driven Tree-Structured Neural Model for Math Word Problems. URL: https://www.ijcai.org/Proceedings/2019/0736.pdf [24]Jierui Li, Lei Wang, Jipeng Zhang, Yan Wang, Bing Tian Dai, Dongxiang Zhang. Modeling Intra- Relation in Math Word Problems with Different Functional Multi-Head Attentions. URL: https://www.aclweb.org/anthology/P19-1619.pdf 103