=Paper=
{{Paper
|id=Vol-2718/paper09
|storemode=property
|title=Solvers for Mathematical Word Problems in Czech
|pdfUrl=https://ceur-ws.org/Vol-2718/paper09.pdf
|volume=Vol-2718
|authors=Jan Kadlec,Daniel Průša
|dblpUrl=https://dblp.org/rec/conf/itat/KadlecP20
}}
==Solvers for Mathematical Word Problems in Czech==
Solvers for Mathematical Word Problems in Czech Jan Kadlec, Daniel Průša Czech Technical University, Faculty of Electrical Engineering Karlovo nám. 13, 121 35 Prague 2, Czech Republic {kadlej24,prusa}@fel.cvut.cz Abstract: We study the task of an automatic evaluation of next development was not so rapid since this initial study mathematical word problems, which belongs to the cate- until 2014 when Kushman et al. published their solution gory of natural language processing and has become pop- based on semantic interpretation and information extrac- ular in recent years. Since all the so far published meth- tion [9]. Several other works then followed. ods were developed for inputs in English, our goal is to The survey [12] suggests that three different princi- review them and propose solutions able to cope with in- ples for solving word problems automatically can be dis- puts in the Czech language. We face the question whether tinguished. The simplest approach works with patterns we can achieve a competitive accuracy for a natural lan- which are sequences of words. Each pattern represents guage with flexible word order, and with the assumption one type of word problems. Classification is done by de- that only a relatively small dataset of training and testing tecting the sequences in inputs. The approach is used e.g. data is available. by WolframAlpha. It might be problematic to use it for We propose and evaluate two methods. One relies on languages with flexible word order. a rule-based processing of dependency trees computed by The second approach relies on syntactic analysis. Syn- UDPipe. The other method builds on machine learning. It tactic dependencies among words are used to extract infor- transforms word problems into numeric vectors and trains mation from a given word problem. A disadvantage might SVM to classify them. We also show that it improves in be a dependence on linguistic tools. System ARIS [5] is a combination with a search for predefined sequences of an example of this type of solver. words and word classes, achieving 75% accuracy on our The most efficient approach is to apply machine learn- dataset of 500 Czech word problems. ing. Kushman et al. [9] use machine learning together with information extraction and semantic interpretation. The 1 Introduction role of the machine learning part is to estimate parame- ters of a probabilistic model. Huang et al. [6] apply neural The research in an automatic evaluation of mathemati- networks and reinforcement learning. Their method works cal word problems has nowadays a good practical moti- without an explicit information extraction and is even able vation. This motivation comes from the fact that a nat- to solve types of word problems that were not present in ural language is the easiest way how many people can the training dataset. On the other hand, the method fails express themselves. Hence, it is useful to have methods sometimes because of generating numbers not involved in that analyze assignments described by words, transform the input instance. them into an internal representation, calculate results and We can compare the accuracy of the methods since both present them in a descriptive form. This is exactly what were evaluated on dataset Alg514 [9] consisting of 514 is aimed by computational knowledge engines like Wol- word problems. The method from [9] achieved 70% ac- framAlpha1 . Mathematical word problems can be seen as curacy while the method from [6] achieved 82.5% accu- one of the most complicated text assignments. Moreover, racy. However, it is a known fact that the accuracy of the complexity of translating a mathematical text into sym- all solvers drops significantly down when they are applied bolic math scales with the difficulty of math involved in to instances from a large and much more diverse dataset. the problem solution. Hence, from a scientific point of One such a dataset, Dolphin18K consisting of 18.000 word view, word problems represent a very suitable domain for problems, was introduced in [7]. None of the solvers eval- research. uated in [7] and [6] was able to solve correctly more than The interest in automatic evaluation of word problems 34% instances from this dataset. This shows that solving dates back to 1963. It was identified in [4] as one of the im- math word problems automatically is a really challenging portant challenges for artificial intelligence in upcoming task. decades. A first attempt to implement an automatic solver All the systems discussed above were developed for was done by Bobrow within his diploma thesis [2]. The word problems in English. It is therefore natural to ask how the used principles would work for inputs in other lan- Copyright c 2020 for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC BY guages, especially if the language is in some sense more 4.0). difficult for analysis, for example, due to more flexible 1 https://wolframalpha.com word order. In this paper, we present a study of methods suitable for electronic form partially by using OCR, but manual edit- processing word problems in Czech, but we also present ing and several corrections of OCR outputs were also some new linguistically independent ideas. As no suit- needed. Each word problem was annotated by the ex- able annotated data has existed so far, it was necessary to pected answer (a non-negative integer) and also the arith- build a new dataset. We have collected 500 word problems metic expression used to derive the answer. intended for the first three classes of elementary schools. Let us list four samples from the dataset, together with These word problems can be solved by substituting num- their English translations: bers from the problem to an arithmetic expression. Hence, the problems are not so complex as some of those in W1 Jana má 4 pastelky nové a 3 pastelky staré. Kolik Alg514 and Dolphin18K that also include problems whose pastelek má Jana? solution requires to assemble and solve systems of equa- (Jane has 4 new crayons and 3 old crayons. How tions. Nevertheless, we consider our dataset to be suffi- many crayons does Jane have?) cient for an initial study, which can serve as a step towards methods processing more complex word problems. Fur- W2 Pavel má 120 Kč. Ota má 80 Kč. Kolik korun mají thermore, a solver working reliably on the considered do- dohromady? main might be appreciated by the youngest pupils, espe- (Paul has 120 CZK. Otto has 80 CZK. How many cially if there is a possibility to provide them a step by Czech korunas do they have together?) step explanation of how to solve a given problem. This W3 Petr přečetl 9 knih. Eva přečetla 6 krát více knih. should justify enough the focus of our research. Kolik knih Eva přečetla? We touch all three principles discussed before and pro- (Peter read 9 books. Eve read 6 times more books. pose two solvers. The first one is based on a support vec- How many books did Eve read?) tor machine (SVM) classifier and it is further extended by incorporating the patterns-based classification to han- W4 Maminka koupila 20 tvarohových koláčků a 15 dle some of the easier inputs. The advantage of this solver makových koláčků. Děti 8 koláčů snědly. Kolik is that it does not require any knowledge of semantic (un- koláčů zůstalo? like the system from [5]). The accuracy achieved on the (Mom bought 20 cheesecakes and 15 poppy seed created dataset is 75%. The second solution builds on syn- cakes. The children ate 8 cakes. How many cakes tactic analyses. It first calls UDPipe [11] to create syn- are left?) tactic dependency trees for all sentences, then it traverses the trees, tries to retrieve information on logical entities Word problem W1 is represented in a text file as follows. encoded in the problem, processes the question part and computes the result. Unlike the first solution, the extracted Jana má 4 pastelky nové a 3 pastelky staré. information could be used to output details on how to solve Kolik pastelek má Jana? | 7 | NUM1 + NUM2 the problem. On the other hand, the accuracy is much As it can be seen, there are three parts separated by worse when compared to the first solution. the pipe characters. The first part is a word problem, the The collected dataset and implementation of the pro- second part is the answer and the third part is the expres- posed methods are freely available2 . sion solving the problem. Variables in the expression are The content of the paper is structured as follows. Sec- strings of the form NUMx where x is the order number of tion 2 describes our dataset of 500 word problems in the referenced numeric value in the assignment. Czech. The next three sections describe and evaluate the proposed principles. Section 3 discusses how to extract Note that this approach has some limitations. For ex- and use patterns, Section 4 explains the use of SVM, Sec- ample, it does not allow to reference numbers expressed tion 5 focuses on retrieving information from dependency by words like day (has 24 hours) or June (has 30 days). trees produced by UDPipe. Finally, Section 6 summarizes However, our dataset does not contain word problems re- the achieved results, gives additional thoughts on the pre- laying on these translations of words to numbers. sented methods and outlines possible future work. A word problem itself consists of two parts – a word problem assignment (WPA-part) and a question (QUE- part). Each word problem in WP500CZ fulfills that its 2 WP500CZ dataset QUE-part is a simple sentence asking for one detail and the expression which gives the answer is exclusively com- We have created a dataset called WP500CZ comprising posed of addition, subtraction, multiplication and division of 500 word problems taken from the schoolbook [10]. operations. We have further divided the whole dataset into train- ing and testing subsets, denoted WP500CZ-train and Task 1. Given a word problem W , the goal of an auto- WP500CZ-test, respectively, each comprising 250 word matic word problem evaluation is to find an expression E problems. The word problems were transformed into an solving W . 2 https://github.com/hkad98/problem_solver Remark 1. We say that W is of type E. (to be NUM NOUN high) 160 kolik NOUN zbýt NOUN v NOUN | NUM1 - NUM2 140 (how many NOUN to left NOUN in NOUN) 120 kolik NOUN NOUN zbýt | NUM1 - NUM2 (how many NOUN NOUN to left) 100 80 Note that each line contains a pattern separated by the pipe character from the expression used for classification when 60 the pattern is matched. In addition, NOUN and NUM 40 in the pattern part represent word classes (noun and nu- meral). 20 We propose an algorithm which obtains the patterns au- 0 tomatically. Let us describe how it extracts e.g. WPA- patterns. Let S1 , . . . , Sk be the partition of training dataset NUM1 * NUM2 NUM1 * NUM2 - NUM1 NUM1 + NUM1 * NUM2 NUM1 + NUM1 - NUM2 NUM1 + NUM1 / NUM2 NUM1 + NUM2 NUM1 + NUM2 + NUM1 NUM1 + NUM2 + NUM3 NUM1 + NUM2 - NUM3 NUM1 - NUM2 NUM1 - NUM2 + NUM3 NUM1 - NUM2 - NUM3 NUM1 / NUM2 NUM2 + NUM3 NUM2 - NUM1 NUM2 - NUM3 NUM2 / NUM1 NUM3 - NUM1 - NUM2 by word problem types (i.e., each Si consists of word problems of the same type). Maps P1 , . . . , Pk are al- located to collect the found patterns (stored as keys) and their number of occurrences (stored as values). All these Figure 1: Histogram of word problem types in WP500CZ. maps are initially empty. In the end, each Pi contains pat- terns extracted from Si . The algorithm has three phases. First, it process the sub- Types of the word problem in WP500CZ and their fre- sets Si one by one. For each Si , it iterates through all quencies are depicted in Figure 1. pairs U,V ∈ Si , U 6= V . Let u and v be the WPA-part of As it can be seen, approximately one-third of the in- U and V , respectively. Moreover, for a sequence of words stances are of type NUM1 + NUM2, one-third are of type x = (x1 , . . . , xs ), let `(x) be the sequence (x10 , . . . , xs0 ) where NUM1 − NUM2, and the remaining instances are of 11 different types (note that these types include NUM2 − NOUN if xi is a noun, NUM1, which is considered as a type different to NUM1− ADJ if xi is an adjective, xi0 = NUM2). This corresponds to the distribution of word NUM if xi is a numeral, problem types in [10]. the lemma of xi otherwise. A longest common subsequence of `(u) and `(v), denoted 3 Pattern-based classification Pu,v , is found. If Pi does not yet contain Pu,v , the sequence is inserted to Pi and its number of occurrences is set to 1. In this section, we describe a classification of word If Pu,v is already in Pi , the number of occurrences of Pu,v problems based on patterns extracted from a training is incremented. dataset. This approach is able to handle reliably a In the second phase, the algorithm discards sequences subset of instances (we will show that the patters ex- that were not detected frequently. It iterates through sets tracted from WP500CZ-train apply to 54% instances of Pi and deletes from Pi each sequence P for which the WP500CZ-test). For this reason, we later combine the number of occurrences does not exceed τ · |Pi | where τ ∈ method with machine learning. (0, 1) is a suitable constant. A pattern is a sequence of word lemmas and classes. Finally, the third phase of the algorithm discards every Patterns are extracted from WPA-parts and QUE-parts sequence P which is shared by two distinct sets Pi , P j . separately and we call them WPA-patterns and QUE- In terms of the time complexity, the first phase is the patterns, respectively. Given a word problem W , it most demanding. Let us assume that the number of words matches a WPA-pattern (QUE-pattern) P if the WPA-part in the WPA-part as well as QUE-part of considered word (QUE-part) of W contains a subsequence of words whose problems is at most N. Moreover, let S = ki=1 Si . S lemmas/word classes are in the one-to-one correspondence Since the basic dynamic programming algorithm finding with the elements of P, preserving the order of the ele- the longest common subsequence for two sequences of ments. length at most N works in O N 2 time [3], the time com- To illustrate the form of patterns, we give two exam- plexity of the first phase, denoted T1 (S ), satisfies ples of WPA-patterns followed by two examples of QUE- patterns. k ! k ! |Si | 2 2 2 T1 (S ) = O ∑ N = O N ∑ |Si | NOUN a NUM krát málo NOUN | NUM1 / NUM2 2 i=1 i=1 (NOUN and NUM times little NOUN) 2 2 být o NUM NOUN vysoký | NUM1 + NUM2 = O N |S | . When WPA-patterns are available, they are ordered into a support vector machine (SVM) [1] to classify the inputs by sequence PWPA = P1 , . . . , Pn which fulfills that a pattern their types (expressions). In the experimental part, we also Pj is a proper subsequence of a pattern Pi if and only if show that it is worth to combine results produced by SVM j > i. Analogously, the extracted QUE-patters are ordered and the pattern matching from Section 3. into a sequence PQUE with the same property. If a word We decided to use SVM because it is a classifier able problem W with a WPA-part u and QUE-part v is given, to cope with high-dimensional data and it is not prone the pattern-based classification iterates through elements to overfitting. A flexibility is achieved through a kernel of PQUE until it finds a pattern PQUE that matches u, or the choice, which gives a possibility to linearly separate non- end of PQUE is reached. If PQUE is found, the classifica- linearly separable data by mapping them into a higher di- tion result is the expression assigned to PQUE . Otherwise mension. It also does not require a large dataset for train- the algorithm iterates through PWPA and tries to find a ing, unlike neural networks. pattern PWPA matching v, which again determines the clas- Our method of representing word problems by numeric sification result. No result is returned if PQUE neither PWPA vectors is semantic-independent. It is based on histograms is found. of words occurring in word problems of the same type. For this purpose we define 3.1 Experiments 1. a frequency function ω(w, p, E) which for a word w in its canonical form, p ∈ {WPA, QUE} and a word The described procedures work with lemmatized inputs. problem type E is equal to the number of word prob- We thus used corpus SYN2015 [8] to build a vocabulary lems of type E in the training dataset that contain a where keys are words extracted from the corpus and values word of the canonical form w in the p-part, and are pairs consisting of a word class and lemma. The am- biguity of some words is resolved by taking into account 2. a weight function α(c, p) which for a word class c their frequency. The built vocabulary has about 1 million and p ∈ {WPA, QUE} is equal to an integer deter- items. mining the importance of word class c in the classifi- The proposed pattern extraction algorithm was imple- cation process. mented in Python 3. It returned 28 WPA-patterns and 8 QUE-patterns for instances in WP500CZ-train. The run- The values of the frequency function are calculated ning time was 4.9 [s] on a notebook equipped with Intel(R) based on the training dataset. For example, we obtained Core(TM) i5-5250U CPU @ 1.60GHz, 8GB RAM and the following values for verb mít (to have) and adverb MacOS. dohromady (together) from WP500CZ-train: Table 1 shows the efficiency of the subsequent pattern- based classification. Three possible classification out- ω(mít, WPA, NUM1 + NUM2) = 27 , (1) comes are distinguished – a correct classification, incor- ω(mít, QUE, NUM1 + NUM2) = 28 , (2) rect classification and no result when the classification al- ω(mít, WPA, NUM1 − NUM2) = 25 , (3) gorithm does not return any expression. ω(mít, QUE, NUM1 − NUM2) = 11 , (4) correct incorrect no result ω(dohromady, WPA, NUM1 + NUM2) = 0 , (5) WP500CZ-train 85 10 155 ω(dohromady, QUE, NUM1 + NUM2) = 6 , (6) WP500CZ-test 107 8 135 ω(dohromady, WPA, NUM1 − NUM2) = 0 , (7) Table 1: The accuracy of pattern matching on WP500CZ. ω(dohromady, QUE, NUM1 − NUM2) = 0 . (8) Values of the weight function were learned by a genetic The collected patterns were detected in nearly one- algorithm, taking SVM accuracy as the objective function. half of the word problems in WP500CZ-test (115 out of When the functions ω and α are known, we transform a 250). In those cases, the reliability of classification was word problem W to a numeric vector x using the following quite high, achieving 93% accuracy (107 correct classifi- algorithm. cations). Interestingly, the accuracy was better than in the case of training data, however, there were fewer instances 1. Components of x are indexed by types of word prob- matching a pattern. lems in the training dataset (thus the vector dimension coincides with the number of SVM classes) and all of them are initially set to 0. 4 Machine learning 2. Words of W are taken one by one. Let the current This section proposes a solver based on machine learning. word be of a lemma w and word class c. Moreover, The main idea is to come up with a suitable transforma- let it be from a p-part of W . For each word problem tion of a word problem into a numeric vector of fixed di- type E, it is checked if ω(w, p, E) is defined and the mension (a so-called feature vector) and train a multi-class expression E evaluates to a non-negative integer after substituting the numeric values from W to E (i.e., E is feasible to W ). If true, then the E-component of x is increased by α(c, p) · ω(w, p, E). 0.5 Let us demonstrate how the frequencies (1)-(8) are used to calculate the numeric vector for word problem W1 in 0.0 Section 2. For simplicity, assume that word problems in 0.5 the training dataset are only of types NUM1 + NUM2 and NUM1 − NUM2, and that α(c, p) = 1 for all c and p (i.e., 1.0 we will ignore the weigh function). First, we observe that both considered expressions eval- 1.5 Classes uate to a positive integer when the numbers 4 and 3 from NUM1 - NUM2 NUM1 + NUM2 W1 are substituted for NUM1 and NUM2, respectively, 2.0 input hence both expressions are feasible to W1 . Since the verb 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 mít is in the WPA-part as well as QUE-part of W1 , it con- tributes by frequencies (1) and (2) to vector component NUM1 + NUM2, and, analogously, by frequencies (3) and Figure 2: A SVM trained for normalized 2-dimensional (4) to the component NUM1 − NUM2. The adverb dohro- vectors obtained for several word problems of NUM1 + mady appears only in the QUE-part of W1 , hence it con- NUM2 and NUM1 − NUM2 types. The circled samples tributes by frequency (6) to component NUM1 + NUM2 are the support vectors. The input displayed in red repre- (note that frequency (7) is zero, hence it does not con- sents the normalized vector of an input word problem to tribute to component NUM1 − NUM2). If we sum the be classified. listed contributions, we obtain the vector {NUM1 + NUM2 : 61, NUM1 − NUM2 : 36} . The combined solvers sSVM+PAT and sMAX+PAT try to apply the pattern matching first. If it does not return a Remark 2. It is important to notice that the proposed form result, then sSVM and sMAX is called to complete the of vectors suggests to use a simple classifier which only classification. finds a component of maximum value and returns the ex- We used scikit-learn3 library to implement sSVM. Ra- pression assigned to it. We will consider this strategy in dial basis function kernel was chosen as the best perform- the experimental part where it will be shown that it per- ing one among tested kernels. forms considerably worse than SVM. The accuracy of the solvers can be compared in Table 2 and Table 3. The combined solver sSVM+PAT was the As the testing data might have a different distribution best solver on the testing dataset. In addition, the strategy than the training data, we normalize the vectors before of sMAX solver is inferior to the SVM classification. passing them to SVM. For the SVM training stage, the normalization modifies the training set vectors so that the sSVM sMAX mean value is 0 and the variance is 1. The same transfor- WP500CZ-train 87.6% 58.0% mation is applied to each input vector before it is classified. WP500CZ-test 61.1% 51.0% Figure 2 shows a visualization of an SVM trained for Table 2: The accuracy of sSVM and sMAX. word problems of types NUM1 + NUM2 and NUM1 − NUM2. sSVM+PAT sMAX+PAT 4.1 Experiments WP500CZ-train 86.8% 69.2% WP500CZ-test 74.9% 66.1% In the next paragraphs we use the following abbreviations to denote the proposed solvers: Table 3: The accuracy of the combined solvers. • sSVM is SVM-based solver, All instances of WP500CZ were evaluated in 3.66 [s] by sSVM+PAT and 3.15 [s] by sMAX+PAT (the running • sMAX is the classifier suggested by Remark 2, environment was identical to that one described in Subsec- tion 3.1). • sSVM+PAT is sSVM combined with the pattern Figures 3 and 4 show distributions of errors per word matching, and problem type for sSVM+PAT (note that only word prob- lem types present in the training dataset are considered). • sMAX+PAT is sMAX combined with the pattern matching. 3 https://scikit-learn.org/stable/ correct given word problem. UDPipe returns information on wrong word classes, sentence elements and their dependencies in 80 CONNL-U format4 . This information allows to construct a dependency tree. We apply several algorithms working over the tree to re- 60 trieve structured information on entities encoded in the as- signment. For each number N, it is extracted which entity 40 E1 the number refers to and which entity E2 is the owner of E1 . Let us demonstrate it for the sentence: 20 Alenka si koupila 5 čokoládových bonbónů. (Alice bought 5 chocolate candies.) 0 Figure 5 shows the dependency tree constructed from NUM1 * NUM2 NUM1 + NUM1 * NUM2 NUM1 + NUM1 - NUM2 NUM1 + NUM1 / NUM2 NUM1 + NUM2 NUM1 + NUM2 + NUM3 NUM1 + NUM2 - NUM3 NUM1 - NUM2 NUM1 / NUM2 NUM2 + NUM3 NUM2 - NUM1 NUM2 - NUM3 NUM2 / NUM1 the data returned by UDPipe.Figure 3: The accuracy of sSVM+PAT on WP500CZ-train. koupila root VERB correct 80 wrong Alenka si bonbónů . 70 nsubj obl obj punct NOUN PRON NOUN PUNCT 60 50 5 čokoládových nummod:gov amod 40 NUM ADJ 30 Figure 5: A dependency tree example. 20 The structure which is supposed to be extracted from 10 the tree is visualized in Figure 6. 0 NUM1 * NUM2 NUM1 + NUM1 - NUM2 NUM1 + NUM1 / NUM2 NUM1 + NUM2 NUM1 + NUM2 + NUM3 NUM1 + NUM2 - NUM3 NUM1 - NUM2 NUM1 / NUM2 NUM2 + NUM3 NUM2 - NUM1 NUM2 / NUM1 Alenka (Alice) čokoládových (chocolate) Figure 4: The accuracy of sSVM+PAT on WP500CZ-test. bonbónů (candies) 5 Syntactic analysis 5 The so far presented solvers worked as classifiers but they were not suitable to be used for a step by step explana- Figure 6: Extracted information. tion of how to derive a solution to a given word problem. Here we give a high-level description of a method based on syntactic analysis, which creates an internal representation A given word problem is solved in three phases de- of the input problem. Although our current implementa- signed to handle a variety of assignment formulations: tion of this method lags behind the SVM-based solver, we 1. Preprocessing phase splits compound sentences and think that there is a room for improvements and it is thus evaluates each part separately. It also replaces worth to analyze causes of errors. The method uses UDPipe (with Universal Dependen- cies 2.5 Models 2019-12-06) to process sentences of a 4 https://universaldependencies.org/format.html phrases like 3 times more than by numbers, i.e., if ap- • The pattern extraction is a less efficient approach plicable, it computes intermediate results and passes when used as a standalone solver for word problems them to the next phases. entered in a language with flexible word order. It is applicable only to a portion of inputs. However, we 2. Structure creation phase connects numbers and enti- showed that it can be well used in a combination with ties. This is done by traversing the dependency tree other solvers since a matched pattern usually results from nodes representing numbers. When a structure in a correct classification. Another success is that we is extracted, a vocabulary of verbs that represent a managed to implement a fully automatic pattern ex- subtraction is used to adjust sign of the number (i.e., traction. unlike in the previous methods, verbs semantic is taken into account here). • A semantic-independent representation of word prob- lems, considering only frequencies of words, turned 3. Evaluation phase checks the question part and out to be sufficient to obtain a relatively accurate clas- guesses which entity E the word problem queries for. sifier. The trained SVM performed better than the The answer is derived based on the extracted struc- straightforward sMAX classifier. This might be in- tures that relate to E. terpreted as meaning that the SVM was able to find nontrivial relationships in the used features vectors. 5.1 Experiments • We have tested the suitability of existing linguistic To tune the method and to analyze errors encountered dur- tools for the studied task. The method based on syn- ing testing, we considered only a subset of WP500CZ con- tactic analysis processed dependency trees produced sisting of 150 word problem instances, denoted here as by UDPipe. The method was not able to recover from WP150CZ. The reason is that to analyze error types re- errors occasionally done by UDPipe. Improving the quired a manual approach, it would, therefore, be too la- accuracy of UDPipe is thus supposed to result in a borious to work with the entire dataset. higher accuracy of the method. The method achieved 76% accuracy on the training part of WP150CZ (formed of 75 instances) and 67.1% The accuracy of sSVM+PAT on WP500CZ (achieving accuracy on the testing part. However, the performance 74.9%) was not too far from the accuracy of the state- was much worse when the method was evaluated against of-the-art solvers on Alg514 (achieving 70% or 82.5% by all instances of WP500CZ (which took 23.5 [s]), solv- methods from [9] or [6], respectively), it must, however, ing correctly 41% of word problems. This indicates that be said that the complexity of some instances in Alg514 is WP500CZ is more diverse than WP150CZ and the rules out of the scope of our methods. To make a fairer com- designed based on WP150CZ are not general enough. parison with the state-of-the-art would require to create an Table 4 shows detailed analysis of errors made by the English version of WP500CZ, or to adapt the state-of-the- method on WP150CZ. The analysis reveals that UDPipe art solvers to inputs in Czech. itself is a non-negligible cause of errors since it often re- It must also be noted that WP500CZ is a small dataset turns partially incorrect outputs and our method is not able (however, as we explained in the introduction, the problem to recover from these errors. has not been studied for non-English languages by others and there are no publicly available datasets of word prob- UDPipe preproc. creation eval. total lems for e.g. Slavic languages). A worse performance can train 7 3 5 3 18 be expected if sSVM+PAT is applied to larger and more test 9 5 9 1 24 diverse datasets. A broad expansion of the dataset and a creation of multilingual versions can, therefore, be iden- Table 4: Errors by type made by the solver on WP150CZ. tified as the goal of our further work. Except much more thorough testing, it would allow us to apply deep learning which is expected to improve the quality of the results. In addition to our future plans, we also hope that the pre- 6 Conclusion sented research can inspire others to consider non-English inputs for the studied problem. Our paper is an initial study of an automatic evaluation of math word problems in Czech. We have reviewed the state-of-the-art methods for inputs in English and pro- posed three solvers, based on pattern matching, SVM and Acknowledgment syntactic analysis. We also contributed by a dataset of 500 annotated word problems in Czech. We thank anonymous reviewers whose suggestions helped The main findings and new ideas of our work can be improve this paper. Our work was supported by the Czech summarized as follows. Science Foundation grant no. 19-21198S. References [1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer- Verlag, Berlin, Heidelberg, 2006. [2] Daniel G. Bobrow. Natural language input for a computer problem solving system. Technical report, USA, 1964. [3] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009. [4] Edward A. Feigenbaum and Julian Feldman. Computers and Thought. McGraw-Hill, Inc., USA, 1963. [5] Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Et- zioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 523–533, Doha, Qatar, October 2014. Association for Computational Linguistics. [6] Danqing Huang, Jing Liu, Chin-Yew Lin, and Jian Yin. Neural math word problem solver with reinforcement learning. In Proceedings of the 27th International Confer- ence on Computational Linguistics, pages 213–223, Santa Fe, New Mexico, USA, August 2018. Association for Com- putational Linguistics. [7] Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, and Wei-Ying Ma. How well do computers solve math word problems? Large-scale dataset construction and eval- uation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 887–896, Berlin, Germany, August 2016. Association for Computational Linguistics. [8] Michal Křen, Václav Cvrček, Tomáš Čapka, Anna Čermáková, Milena Hnátková, Lucie Chlumská, Do- minika Kováříková, Tomáš Jelínek, Vladimír Petkevič, Pavel Procházka, Hana Skoumalová, Michal Škrabal, Petr Truneček, Pavel Vondřička, and Adrian Zasina. SYN2015: representative corpus of written Czech, 2015. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Math- ematics and Physics, Charles University. [9] Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 271–281, Baltimore, Maryland, June 2014. Association for Computational Linguistics. [10] Marie Reischigová. Matematika na základní a obecné škole ve slovních úlohách. Pansofia, 1996. In Czech. [11] Milan Straka and Jana Straková. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceed- ings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada, August 2017. Association for Compu- tational Linguistics. [12] Dongxiang Zhang, Lei Wang, Luming Zhang, Bing Dai, and Heng Shen. The gap of semantic parsing: A survey on automatic math word problem solvers. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP:1–1, 04 2019.