<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Solvers for Mathematical Word Problems in Czech</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Kadlec</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Pru˚ša</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Czech Technical University, Faculty of Electrical Engineering Karlovo nám.</institution>
          <addr-line>13, 121 35 Prague 2</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We study the task of an automatic evaluation of mathematical word problems, which belongs to the category of natural language processing and has become popular in recent years. Since all the so far published methods were developed for inputs in English, our goal is to review them and propose solutions able to cope with inputs in the Czech language. We face the question whether we can achieve a competitive accuracy for a natural language with flexible word order, and with the assumption that only a relatively small dataset of training and testing data is available. We propose and evaluate two methods. One relies on a rule-based processing of dependency trees computed by UDPipe. The other method builds on machine learning. It transforms word problems into numeric vectors and trains SVM to classify them. We also show that it improves in a combination with a search for predefined sequences of words and word classes, achieving 75% accuracy on our dataset of 500 Czech word problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The research in an automatic evaluation of
mathematical word problems has nowadays a good practical
motivation. This motivation comes from the fact that a
natural language is the easiest way how many people can
express themselves. Hence, it is useful to have methods
that analyze assignments described by words, transform
them into an internal representation, calculate results and
present them in a descriptive form. This is exactly what
is aimed by computational knowledge engines like
WolframAlpha1. Mathematical word problems can be seen as
one of the most complicated text assignments. Moreover,
the complexity of translating a mathematical text into
symbolic math scales with the difficulty of math involved in
the problem solution. Hence, from a scientific point of
view, word problems represent a very suitable domain for
research.</p>
      <p>
        The interest in automatic evaluation of word problems
dates back to 1963. It was identified in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as one of the
important challenges for artificial intelligence in upcoming
decades. A first attempt to implement an automatic solver
was done by Bobrow within his diploma thesis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The
next development was not so rapid since this initial study
until 2014 when Kushman et al. published their solution
based on semantic interpretation and information
extraction [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Several other works then followed.
      </p>
      <p>
        The survey [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] suggests that three different
principles for solving word problems automatically can be
distinguished. The simplest approach works with patterns
which are sequences of words. Each pattern represents
one type of word problems. Classification is done by
detecting the sequences in inputs. The approach is used e.g.
by WolframAlpha. It might be problematic to use it for
languages with flexible word order.
      </p>
      <p>
        The second approach relies on syntactic analysis.
Syntactic dependencies among words are used to extract
information from a given word problem. A disadvantage might
be a dependence on linguistic tools. System ARIS [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is
an example of this type of solver.
      </p>
      <p>
        The most efficient approach is to apply machine
learning. Kushman et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] use machine learning together with
information extraction and semantic interpretation. The
role of the machine learning part is to estimate
parameters of a probabilistic model. Huang et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] apply neural
networks and reinforcement learning. Their method works
without an explicit information extraction and is even able
to solve types of word problems that were not present in
the training dataset. On the other hand, the method fails
sometimes because of generating numbers not involved in
the input instance.
      </p>
      <p>
        We can compare the accuracy of the methods since both
were evaluated on dataset Alg514 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] consisting of 514
word problems. The method from [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] achieved 70%
accuracy while the method from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] achieved 82.5%
accuracy. However, it is a known fact that the accuracy of
all solvers drops significantly down when they are applied
to instances from a large and much more diverse dataset.
One such a dataset, Dolphin18K consisting of 18.000 word
problems, was introduced in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. None of the solvers
evaluated in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] was able to solve correctly more than
34% instances from this dataset. This shows that solving
math word problems automatically is a really challenging
task.
      </p>
      <p>All the systems discussed above were developed for
word problems in English. It is therefore natural to ask
how the used principles would work for inputs in other
languages, especially if the language is in some sense more
difficult for analysis, for example, due to more flexible
word order.</p>
      <p>In this paper, we present a study of methods suitable for
processing word problems in Czech, but we also present
some new linguistically independent ideas. As no
suitable annotated data has existed so far, it was necessary to
build a new dataset. We have collected 500 word problems
intended for the first three classes of elementary schools.
These word problems can be solved by substituting
numbers from the problem to an arithmetic expression. Hence,
the problems are not so complex as some of those in
Alg514 and Dolphin18K that also include problems whose
solution requires to assemble and solve systems of
equations. Nevertheless, we consider our dataset to be
sufficient for an initial study, which can serve as a step towards
methods processing more complex word problems.
Furthermore, a solver working reliably on the considered
domain might be appreciated by the youngest pupils,
especially if there is a possibility to provide them a step by
step explanation of how to solve a given problem. This
should justify enough the focus of our research.</p>
      <p>
        We touch all three principles discussed before and
propose two solvers. The first one is based on a support
vector machine (SVM) classifier and it is further extended
by incorporating the patterns-based classification to
handle some of the easier inputs. The advantage of this solver
is that it does not require any knowledge of semantic
(unlike the system from [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). The accuracy achieved on the
created dataset is 75%. The second solution builds on
syntactic analyses. It first calls UDPipe [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to create
syntactic dependency trees for all sentences, then it traverses
the trees, tries to retrieve information on logical entities
encoded in the problem, processes the question part and
computes the result. Unlike the first solution, the extracted
information could be used to output details on how to solve
the problem. On the other hand, the accuracy is much
worse when compared to the first solution.
      </p>
      <p>The collected dataset and implementation of the
proposed methods are freely available2.</p>
      <p>The content of the paper is structured as follows.
Section 2 describes our dataset of 500 word problems in
Czech. The next three sections describe and evaluate the
proposed principles. Section 3 discusses how to extract
and use patterns, Section 4 explains the use of SVM,
Section 5 focuses on retrieving information from dependency
trees produced by UDPipe. Finally, Section 6 summarizes
the achieved results, gives additional thoughts on the
presented methods and outlines possible future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>WP500CZ dataset</title>
      <p>
        We have created a dataset called WP500CZ comprising
of 500 word problems taken from the schoolbook [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
We have further divided the whole dataset into
training and testing subsets, denoted WP500CZ-train and
WP500CZ-test, respectively, each comprising 250 word
problems. The word problems were transformed into an
electronic form partially by using OCR, but manual
editing and several corrections of OCR outputs were also
needed. Each word problem was annotated by the
expected answer (a non-negative integer) and also the
arithmetic expression used to derive the answer.
      </p>
      <p>Let us list four samples from the dataset, together with
their English translations:
W1 Jana má 4 pastelky nové a 3 pastelky staré. Kolik
pastelek má Jana?
(Jane has 4 new crayons and 3 old crayons. How
many crayons does Jane have?)
W2 Pavel má 120 Kcˇ. Ota má 80 Kcˇ. Kolik korun mají
dohromady?
(Paul has 120 CZK. Otto has 80 CZK. How many
Czech korunas do they have together?)
W3 Petr prˇecˇetl 9 knih. Eva prˇecˇetla 6 krát více knih.</p>
      <p>Kolik knih Eva prˇecˇetla?
(Peter read 9 books. Eve read 6 times more books.</p>
      <p>How many books did Eve read?)
W4 Maminka koupila 20 tvarohových kolácˇku˚ a 15
makových kolácˇku˚. Deˇti 8 kolácˇu˚ sneˇdly. Kolik
kolácˇu˚ zu˚stalo?
(Mom bought 20 cheesecakes and 15 poppy seed
cakes. The children ate 8 cakes. How many cakes
are left?)</p>
      <p>Word problem W1 is represented in a text file as follows.
Jana má 4 pastelky nové a 3 pastelky staré.
Kolik pastelek má Jana? | 7 | NUM1 + NUM2</p>
      <p>As it can be seen, there are three parts separated by
the pipe characters. The first part is a word problem, the
second part is the answer and the third part is the
expression solving the problem. Variables in the expression are
strings of the form NUMx where x is the order number of
the referenced numeric value in the assignment.</p>
      <p>Note that this approach has some limitations. For
example, it does not allow to reference numbers expressed
by words like day (has 24 hours) or June (has 30 days).
However, our dataset does not contain word problems
relaying on these translations of words to numbers.</p>
      <p>A word problem itself consists of two parts – a word
problem assignment (WPA-part) and a question
(QUEpart). Each word problem in WP500CZ fulfills that its
QUE-part is a simple sentence asking for one detail and
the expression which gives the answer is exclusively
composed of addition, subtraction, multiplication and division
operations.</p>
      <p>Task 1. Given a word problem W , the goal of an
automatic word problem evaluation is to find an expression E
solving W .
2https://github.com/hkad98/problem_solver
Remark 1. We say that W is of type E.
160
140
120
100
80
60
40
20
0</p>
      <p>Types of the word problem in WP500CZ and their
frequencies are depicted in Figure 1.</p>
      <p>
        As it can be seen, approximately one-third of the
instances are of type NUM1 + NUM2, one-third are of type
NUM1 NUM2, and the remaining instances are of 11
different types (note that these types include NUM2
NUM1, which is considered as a type different to NUM1
NUM2). This corresponds to the distribution of word
problem types in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Pattern-based classification</title>
      <p>In this section, we describe a classification of word
problems based on patterns extracted from a training
dataset. This approach is able to handle reliably a
subset of instances (we will show that the patters
extracted from WP500CZ-train apply to 54% instances of
WP500CZ-test). For this reason, we later combine the
method with machine learning.</p>
      <p>A pattern is a sequence of word lemmas and classes.
Patterns are extracted from WPA-parts and QUE-parts
separately and we call them WPA-patterns and
QUEpatterns, respectively. Given a word problem W , it
matches a WPA-pattern (QUE-pattern) P if the WPA-part
(QUE-part) of W contains a subsequence of words whose
lemmas/word classes are in the one-to-one correspondence
with the elements of P, preserving the order of the
elements.</p>
      <p>To illustrate the form of patterns, we give two
examples of WPA-patterns followed by two examples of
QUEpatterns.</p>
      <p>NOUN a NUM krát málo NOUN | NUM1 / NUM2
(NOUN and NUM times little NOUN)
být o NUM NOUN vysoký | NUM1 + NUM2
(to be NUM NOUN high)
kolik NOUN zbýt NOUN v NOUN | NUM1 - NUM2
(how many NOUN to left NOUN in NOUN)
kolik NOUN NOUN zbýt | NUM1 - NUM2
(how many NOUN NOUN to left)
Note that each line contains a pattern separated by the pipe
character from the expression used for classification when
the pattern is matched. In addition, NOUN and NUM
in the pattern part represent word classes (noun and
numeral).</p>
      <p>We propose an algorithm which obtains the patterns
automatically. Let us describe how it extracts e.g.
WPApatterns. Let S1; : : : ; Sk be the partition of training dataset
by word problem types (i.e., each Si consists of word
problems of the same type). Maps P1; : : : ; Pk are
allocated to collect the found patterns (stored as keys) and
their number of occurrences (stored as values). All these
maps are initially empty. In the end, each Pi contains
patterns extracted from Si.</p>
      <p>The algorithm has three phases. First, it process the
subsets Si one by one. For each Si, it iterates through all
pairs U;V 2 Si, U 6= V . Let u and v be the WPA-part of
U and V , respectively. Moreover, for a sequence of words
x = (x1; : : : ; xs), let `(x) be the sequence (x10; : : : ; xs0) where
xi0 =
&gt;&lt;&gt;8 ANDOJUN iiff xxii iiss aannaodujne,ctive,
&gt; NUM if xi is a numeral,
&gt;: the lemma of xi otherwise.</p>
      <p>A longest common subsequence of `(u) and `(v), denoted
Pu;v, is found. If Pi does not yet contain Pu;v, the sequence
is inserted to Pi and its number of occurrences is set to 1.
If Pu;v is already in Pi, the number of occurrences of Pu;v
is incremented.</p>
      <p>In the second phase, the algorithm discards sequences
that were not detected frequently. It iterates through sets
Pi and deletes from Pi each sequence P for which the
number of occurrences does not exceed t jPij where t 2
(0; 1) is a suitable constant.</p>
      <p>Finally, the third phase of the algorithm discards every
sequence P which is shared by two distinct sets Pi, P j.</p>
      <p>
        In terms of the time complexity, the first phase is the
most demanding. Let us assume that the number of words
in the WPA-part as well as QUE-part of considered word
problems is at most N. Moreover, let S = Sik=1 Si.
Since the basic dynamic programming algorithm finding
the longest common subsequence for two sequences of
length at most N works in O N2 time [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the time
complexity of the first phase, denoted T1(S ), satisfies
T1(S ) = O
k
iå=1 jS2ij N2
!
= O
      </p>
      <p>
        k
N2 å jSij2
i=1
!
= O N2jS j2 :
When WPA-patterns are available, they are ordered into a
sequence PWPA = P1; : : : ; Pn which fulfills that a pattern
Pj is a proper subsequence of a pattern Pi if and only if
j &gt; i. Analogously, the extracted QUE-patters are ordered
into a sequence PQUE with the same property. If a word
problem W with a WPA-part u and QUE-part v is given,
the pattern-based classification iterates through elements
of PQUE until it finds a pattern PQUE that matches u, or the
end of PQUE is reached. If PQUE is found, the
classification result is the expression assigned to PQUE. Otherwise
the algorithm iterates through PWPA and tries to find a
pattern PWPA matching v, which again determines the
classification result. No result is returned if PQUE neither PWPA
is found.
3.1
The described procedures work with lemmatized inputs.
We thus used corpus SYN2015 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to build a vocabulary
where keys are words extracted from the corpus and values
are pairs consisting of a word class and lemma. The
ambiguity of some words is resolved by taking into account
their frequency. The built vocabulary has about 1 million
items.
      </p>
      <p>The proposed pattern extraction algorithm was
implemented in Python 3. It returned 28 WPA-patterns and 8
QUE-patterns for instances in WP500CZ-train. The
running time was 4.9 [s] on a notebook equipped with Intel(R)
Core(TM) i5-5250U CPU @ 1.60GHz, 8GB RAM and
MacOS.</p>
      <p>Table 1 shows the efficiency of the subsequent
patternbased classification. Three possible classification
outcomes are distinguished – a correct classification,
incorrect classification and no result when the classification
algorithm does not return any expression.</p>
      <p>The collected patterns were detected in nearly
onehalf of the word problems in WP500CZ-test (115 out of
250). In those cases, the reliability of classification was
quite high, achieving 93% accuracy (107 correct
classifications). Interestingly, the accuracy was better than in the
case of training data, however, there were fewer instances
matching a pattern.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Machine learning</title>
      <p>
        This section proposes a solver based on machine learning.
The main idea is to come up with a suitable
transformation of a word problem into a numeric vector of fixed
dimension (a so-called feature vector) and train a multi-class
support vector machine (SVM) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to classify the inputs by
their types (expressions). In the experimental part, we also
show that it is worth to combine results produced by SVM
and the pattern matching from Section 3.
      </p>
      <p>We decided to use SVM because it is a classifier able
to cope with high-dimensional data and it is not prone
to overfitting. A flexibility is achieved through a kernel
choice, which gives a possibility to linearly separate
nonlinearly separable data by mapping them into a higher
dimension. It also does not require a large dataset for
training, unlike neural networks.</p>
      <p>Our method of representing word problems by numeric
vectors is semantic-independent. It is based on histograms
of words occurring in word problems of the same type. For
this purpose we define
1. a frequency function w(w; p; E) which for a word w
in its canonical form, p 2 fWPA; QUEg and a word
problem type E is equal to the number of word
problems of type E in the training dataset that contain a
word of the canonical form w in the p-part, and
2. a weight function a(c; p) which for a word class c
and p 2 fWPA; QUEg is equal to an integer
determining the importance of word class c in the
classification process.</p>
      <p>The values of the frequency function are calculated
based on the training dataset. For example, we obtained
the following values for verb mít (to have) and adverb
dohromady (together) from WP500CZ-train:
w(mít; WPA; NUM1 + NUM2) = 27 ;
w(mít; QUE; NUM1 + NUM2) = 28 ;
w(mít; WPA; NUM1
w(mít; QUE; NUM1</p>
      <p>NUM2) = 25 ;
NUM2) = 11 ;
w(dohromady; WPA; NUM1 + NUM2) = 0 ;
w(dohromady; QUE; NUM1 + NUM2) = 6 ;
w(dohromady; WPA; NUM1
w(dohromady; QUE; NUM1
NUM2) = 0 ;
NUM2) = 0 :
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)</p>
      <p>Values of the weight function were learned by a genetic
algorithm, taking SVM accuracy as the objective function.</p>
      <p>When the functions w and a are known, we transform a
word problem W to a numeric vector x using the following
algorithm.</p>
      <p>1. Components of x are indexed by types of word
problems in the training dataset (thus the vector dimension
coincides with the number of SVM classes) and all of
them are initially set to 0.
2. Words of W are taken one by one. Let the current
word be of a lemma w and word class c. Moreover,
let it be from a p-part of W . For each word problem
type E, it is checked if w(w; p; E) is defined and the
expression E evaluates to a non-negative integer after
substituting the numeric values from W to E (i.e., E
is feasible to W ). If true, then the E-component of x
is increased by a(c; p) w(w; p; E).</p>
      <p>Let us demonstrate how the frequencies (1)-(8) are used
to calculate the numeric vector for word problem W1 in
Section 2. For simplicity, assume that word problems in
the training dataset are only of types NUM1 + NUM2 and
NUM1 NUM2, and that a(c; p) = 1 for all c and p (i.e.,
we will ignore the weigh function).</p>
      <p>First, we observe that both considered expressions
evaluate to a positive integer when the numbers 4 and 3 from
W1 are substituted for NUM1 and NUM2, respectively,
hence both expressions are feasible to W1. Since the verb
mít is in the WPA-part as well as QUE-part of W1, it
contributes by frequencies (1) and (2) to vector component
NUM1 + NUM2, and, analogously, by frequencies (3) and
(4) to the component NUM1 NUM2. The adverb
dohromady appears only in the QUE-part of W1, hence it
contributes by frequency (6) to component NUM1 + NUM2
(note that frequency (7) is zero, hence it does not
contribute to component NUM1 NUM2). If we sum the
listed contributions, we obtain the vector
fNUM1 + NUM2 : 61; NUM1
NUM2 : 36g :
Remark 2. It is important to notice that the proposed form
of vectors suggests to use a simple classifier which only
finds a component of maximum value and returns the
expression assigned to it. We will consider this strategy in
the experimental part where it will be shown that it
performs considerably worse than SVM.</p>
      <p>As the testing data might have a different distribution
than the training data, we normalize the vectors before
passing them to SVM. For the SVM training stage, the
normalization modifies the training set vectors so that the
mean value is 0 and the variance is 1. The same
transformation is applied to each input vector before it is classified.</p>
      <p>Figure 2 shows a visualization of an SVM trained for
word problems of types NUM1 + NUM2 and NUM1
NUM2.
4.1</p>
      <sec id="sec-4-1">
        <title>Experiments</title>
        <p>In the next paragraphs we use the following abbreviations
to denote the proposed solvers:
sSVM is SVM-based solver,
sMAX is the classifier suggested by Remark 2,
sSVM+PAT is sSVM combined with the pattern
matching, and
sMAX+PAT is sMAX combined with the pattern
matching.
0.5
0.0
0.5
1.0
1.5
2.0
Classes
NUM1 - NUM2
NUM1 + NUM2
input
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0</p>
        <p>The combined solvers sSVM+PAT and sMAX+PAT try to
apply the pattern matching first. If it does not return a
result, then sSVM and sMAX is called to complete the
classification.</p>
        <p>We used scikit-learn3 library to implement sSVM.
Radial basis function kernel was chosen as the best
performing one among tested kernels.</p>
        <p>The accuracy of the solvers can be compared in Table 2
and Table 3. The combined solver sSVM+PAT was the
best solver on the testing dataset. In addition, the strategy
of sMAX solver is inferior to the SVM classification.</p>
        <p>WP500CZ-train
WP500CZ-test
sSVM
87.6%
61.1%
sMAX
58.0%
51.0%</p>
        <p>All instances of WP500CZ were evaluated in 3.66 [s]
by sSVM+PAT and 3.15 [s] by sMAX+PAT (the running
environment was identical to that one described in
Subsection 3.1).</p>
        <p>Figures 3 and 4 show distributions of errors per word
problem type for sSVM+PAT (note that only word
problem types present in the training dataset are considered).
3https://scikit-learn.org/stable/
3
M
U
N
2
M
U
N
+
1
M
U
N
3
M
U
N
2
M
U
N
+
1
M
U</p>
        <p>N
2
M
U
N
*1
M
U
N
+
1
M
U
N
2
M
U
N
1
M
U
N
+
1
M
U
N
2
M
U
N
/
1
M
U
N
+
1
M
U
N
2
M
U
N
+
1
M
U
N
3
M
U
N
+
2
M
U
N
+
1
M
U
N
2
M
U
N
1
M
U
N
2
M
U
N
/
1
M
U
N
3
M
U
N
+
2
M
U
N
1
M
U
N
2
M
U
N
3
M
U
N
2
M
U
N
1
M
U
N
/
2
M
U
N
The so far presented solvers worked as classifiers but they
were not suitable to be used for a step by step
explanation of how to derive a solution to a given word problem.
Here we give a high-level description of a method based on
syntactic analysis, which creates an internal representation
of the input problem. Although our current
implementation of this method lags behind the SVM-based solver, we
think that there is a room for improvements and it is thus
worth to analyze causes of errors.</p>
        <p>The method uses UDPipe (with Universal
Dependencies 2.5 Models 2019-12-06) to process sentences of a
given word problem. UDPipe returns information on
word classes, sentence elements and their dependencies in
CONNL-U format4. This information allows to construct
a dependency tree.</p>
        <p>We apply several algorithms working over the tree to
retrieve structured information on entities encoded in the
assignment. For each number N, it is extracted which entity
E1 the number refers to and which entity E2 is the owner
of E1.</p>
        <p>Let us demonstrate it for the sentence:</p>
        <p>Alenka si koupila 5 cˇokoládových bonbónu˚.</p>
        <p>(Alice bought 5 chocolate candies.)
4https://universaldependencies.org/format.html
phrases like 3 times more than by numbers, i.e., if
applicable, it computes intermediate results and passes
them to the next phases.
2. Structure creation phase connects numbers and
entities. This is done by traversing the dependency tree
from nodes representing numbers. When a structure
is extracted, a vocabulary of verbs that represent a
subtraction is used to adjust sign of the number (i.e.,
unlike in the previous methods, verbs semantic is
taken into account here).
3. Evaluation phase checks the question part and
guesses which entity E the word problem queries for.
The answer is derived based on the extracted
structures that relate to E.
5.1</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experiments</title>
        <p>To tune the method and to analyze errors encountered
during testing, we considered only a subset of WP500CZ
consisting of 150 word problem instances, denoted here as
WP150CZ. The reason is that to analyze error types
required a manual approach, it would, therefore, be too
laborious to work with the entire dataset.</p>
        <p>The method achieved 76% accuracy on the training
part of WP150CZ (formed of 75 instances) and 67.1%
accuracy on the testing part. However, the performance
was much worse when the method was evaluated against
all instances of WP500CZ (which took 23.5 [s]),
solving correctly 41% of word problems. This indicates that
WP500CZ is more diverse than WP150CZ and the rules
designed based on WP150CZ are not general enough.</p>
        <p>Table 4 shows detailed analysis of errors made by the
method on WP150CZ. The analysis reveals that UDPipe
itself is a non-negligible cause of errors since it often
returns partially incorrect outputs and our method is not able
to recover from these errors.</p>
        <p>train
test</p>
        <p>UDPipe
7
9
preproc.</p>
        <p>3
5
creation
5
9
eval.</p>
        <p>3
1
total
18
24
Our paper is an initial study of an automatic evaluation
of math word problems in Czech. We have reviewed
the state-of-the-art methods for inputs in English and
proposed three solvers, based on pattern matching, SVM and
syntactic analysis. We also contributed by a dataset of 500
annotated word problems in Czech.</p>
        <p>The main findings and new ideas of our work can be
summarized as follows.</p>
        <p>The pattern extraction is a less efficient approach
when used as a standalone solver for word problems
entered in a language with flexible word order. It is
applicable only to a portion of inputs. However, we
showed that it can be well used in a combination with
other solvers since a matched pattern usually results
in a correct classification. Another success is that we
managed to implement a fully automatic pattern
extraction.</p>
        <p>A semantic-independent representation of word
problems, considering only frequencies of words, turned
out to be sufficient to obtain a relatively accurate
classifier. The trained SVM performed better than the
straightforward sMAX classifier. This might be
interpreted as meaning that the SVM was able to find
nontrivial relationships in the used features vectors.
We have tested the suitability of existing linguistic
tools for the studied task. The method based on
syntactic analysis processed dependency trees produced
by UDPipe. The method was not able to recover from
errors occasionally done by UDPipe. Improving the
accuracy of UDPipe is thus supposed to result in a
higher accuracy of the method.</p>
        <p>
          The accuracy of sSVM+PAT on WP500CZ (achieving
74.9%) was not too far from the accuracy of the
stateof-the-art solvers on Alg514 (achieving 70% or 82.5% by
methods from [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] or [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], respectively), it must, however,
be said that the complexity of some instances in Alg514 is
out of the scope of our methods. To make a fairer
comparison with the state-of-the-art would require to create an
English version of WP500CZ, or to adapt the
state-of-theart solvers to inputs in Czech.
        </p>
        <p>It must also be noted that WP500CZ is a small dataset
(however, as we explained in the introduction, the problem
has not been studied for non-English languages by others
and there are no publicly available datasets of word
problems for e.g. Slavic languages). A worse performance can
be expected if sSVM+PAT is applied to larger and more
diverse datasets. A broad expansion of the dataset and a
creation of multilingual versions can, therefore, be
identified as the goal of our further work. Except much more
thorough testing, it would allow us to apply deep learning
which is expected to improve the quality of the results.</p>
        <p>In addition to our future plans, we also hope that the
presented research can inspire others to consider non-English
inputs for the studied problem.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>We thank anonymous reviewers whose suggestions helped
improve this paper. Our work was supported by the Czech
Science Foundation grant no. 19-21198S.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Christopher</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bishop</surname>
          </string-name>
          .
          <source>Pattern Recognition and Machine Learning (Information Science and Statistics)</source>
          .
          <source>SpringerVerlag</source>
          , Berlin, Heidelberg,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Daniel</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Bobrow</surname>
          </string-name>
          .
          <article-title>Natural language input for a computer problem solving system</article-title>
          .
          <source>Technical report, USA</source>
          ,
          <year>1964</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Thomas</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Cormen</surname>
          </string-name>
          , Charles E. Leiserson, Ronald L.
          <string-name>
            <surname>Rivest</surname>
            , and
            <given-names>Clifford</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          . Introduction to Algorithms, Third Edition.
          <source>The MIT Press, 3rd edition</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Edward</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Feigenbaum</surname>
            and
            <given-names>Julian</given-names>
          </string-name>
          <string-name>
            <surname>Feldman</surname>
          </string-name>
          .
          <article-title>Computers and Thought</article-title>
          .
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , Inc., USA,
          <year>1963</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Javad</surname>
          </string-name>
          <string-name>
            <surname>Hosseini</surname>
          </string-name>
          , Hannaneh Hajishirzi,
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Nate</given-names>
            <surname>Kushman</surname>
          </string-name>
          .
          <article-title>Learning to solve arithmetic word problems with verb categorization</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>523</fpage>
          -
          <lpage>533</lpage>
          , Doha, Qatar,
          <year>October 2014</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Danqing</given-names>
            <surname>Huang</surname>
          </string-name>
          , Jing Liu,
          <string-name>
            <surname>Chin-Yew Lin</surname>
            , and
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
          </string-name>
          .
          <article-title>Neural math word problem solver with reinforcement learning</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          , pages
          <fpage>213</fpage>
          -
          <lpage>223</lpage>
          ,
          <string-name>
            <given-names>Santa</given-names>
            <surname>Fe</surname>
          </string-name>
          , New Mexico, USA,
          <year>August 2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Danqing</given-names>
            <surname>Huang</surname>
          </string-name>
          , Shuming Shi,
          <string-name>
            <surname>Chin-Yew</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Jian
          <string-name>
            <surname>Yin</surname>
          </string-name>
          , and
          <string-name>
            <surname>Wei-Ying Ma</surname>
          </string-name>
          .
          <article-title>How well do computers solve math word problems? Large-scale dataset construction and evaluation</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>887</fpage>
          -
          <lpage>896</lpage>
          , Berlin, Germany,
          <year>August 2016</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Michal</surname>
            <given-names>Krˇen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Václav</surname>
            <given-names>Cvrcˇek</given-names>
          </string-name>
          , Tomáš Cˇ apka, Anna Cˇermáková, Milena Hnátková, Lucie Chlumská,
          <string-name>
            <surname>Dominika</surname>
            <given-names>Kovárˇíková</given-names>
          </string-name>
          , Tomáš Jelínek,
          <string-name>
            <surname>Vladimír</surname>
            <given-names>Petkevicˇ</given-names>
          </string-name>
          , Pavel Procházka, Hana Skoumalová, Michal Škrabal,
          <string-name>
            <surname>Petr</surname>
            <given-names>Trunecˇek</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pavel</surname>
            <given-names>Vondrˇicˇka</given-names>
          </string-name>
          , and Adrian Zasina.
          <source>SYN2015: representative corpus of written Czech</source>
          ,
          <year>2015</year>
          .
          <article-title>LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL)</article-title>
          ,
          <source>Faculty of Mathematics and Physics</source>
          , Charles University.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Nate</given-names>
            <surname>Kushman</surname>
          </string-name>
          , Yoav Artzi, Luke Zettlemoyer, and
          <string-name>
            <given-names>Regina</given-names>
            <surname>Barzilay</surname>
          </string-name>
          .
          <article-title>Learning to automatically solve algebra word problems</article-title>
          .
          <source>In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>271</fpage>
          -
          <lpage>281</lpage>
          , Baltimore, Maryland,
          <year>June 2014</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Marie</given-names>
            <surname>Reischigová</surname>
          </string-name>
          .
          <article-title>Matematika na základní a obecné škole ve slovních úlohách</article-title>
          .
          <source>Pansofia</source>
          ,
          <year>1996</year>
          . In Czech.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Milan</given-names>
            <surname>Straka</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jana</given-names>
            <surname>Straková</surname>
          </string-name>
          .
          <article-title>Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe</article-title>
          .
          <source>In Proceedings of the CoNLL</source>
          <year>2017</year>
          <article-title>Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies</article-title>
          , pages
          <fpage>88</fpage>
          -
          <lpage>99</lpage>
          , Vancouver, Canada,
          <year>August 2017</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Dongxiang</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Lei Wang, Luming Zhang, Bing Dai, and
          <string-name>
            <given-names>Heng</given-names>
            <surname>Shen</surname>
          </string-name>
          .
          <article-title>The gap of semantic parsing: A survey on automatic math word problem solvers</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , PP:
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          ,
          <year>04 2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>