Semantic Graphs for Mathematics Word Problems based
                 on Mathematics Terminology

          Rogers Jeffrey Leo John                  Thomas S. McTavish               Rebecca J. Passonneau
            Center for Computational               Center for Digital Data,           Center for Computational
               Learning Systems                 Analytics & Adaptive Learning            Learning Systems
              Columbia University                          Pearson                      Columbia University
              New York, NY, USA                        Austin, TX, USA                  New York, NY, USA
             rl2689@columbia.edu                tom.mctavish@pearson.com becky@ccls.columbia.edu

ABSTRACT                                                          ercises, we randomly walk a graph using the cosine distance
We present a graph-based approach to discover and extend          as edge weights. We also recast the problem as a bipartite
semantic relationships found in a mathematics curriculum          graph with exercises on one side and words on the other,
to more general network structures that can illuminate re-        providing an edge when an exercise contains the math word.
lationships within the instructional material. Using words        We contrast these two different types of random walks and
representative of a secondary level mathematics curriculum        find somewhat similar results, which lends confidence to the
we identified in separate work, we constructed two similar-       analysis. The bipartite graph walks, however, are more sen-
ity networks of word problems in a mathematics textbook,          sitive to differences in word frequency. Casting measures of
and used analogous random walks over the two networks             similarity as graphs and performing random walks on them
to discover patterns. The two graph walks provide similar         affords more nuanced ways of relating objects, which can be
global views of problem similarity within and across chap-        used to build more granular domain models for analysis of
ters, but are affected differently by number of math words        prerequisites, instructional design, and adaptive learning.
in a problem and math word frequency.

1.    INTRODUCTION                                                2.   RELATED WORK
Curricula are compiled learning objects, typically presented      Random walks over graphs have been used extensively to
in sequential order and arranged hierarchically, as in a book’s   measure text similarity. Applications include similarity of
Table of Contents. Ideally, a domain model captures rela-         web pages [15] and other documents [5], citations [1], pas-
tionships between the learning objects and the knowledge          sages [14], person names in email [12] and so on. More re-
components or skills they exercise. Unfortunately, domain         cently, general methods that link graph walks with external
models are not often granular enough for optimal learning         resources like WordNet have been developed to produce a
experiences. For example, prerequisite relationships may be       single system that handles semantic similarity for words,
lacking, or the knowledge components associated with an           sentences or text [16]. Very little work compares walks over
exercise may be unknown. In such cases, assessments on            graphs of the same content, where the graphs have differ-
those learning objects will be insufficient to enable appro-      ent structure. We create two different kinds of graphs for
priate redirection unless expert (i.e. teacher) intervention      mathematics word problem and compare the results. We
is explicitly given. Domain models remain coarse because          find that the global results are very similar, which is good
using experts to enumerate and relate the knowledge com-          evidence for the general approach, and we find differences in
ponents is costly.                                                detail that suggest further investigation could lead to cus-
                                                                  tomizable methods, depending on needs.
As a means to automatically discover relationships among
learning objects and to reveal their knowledge components,        An initiative where elementary science and math tests are a
we demonstrate the use of direct similarity metrics and ran-      driver for artificial intelligence has led to work on knowledge
dom graph walks to relate exercises in a mathematics cur-         extraction from textbooks. Berant et al. [2] create a system
riculum. We first apply a standard cosine similarity measure      to perform domain-specific deep semantic analysis of a 48
between pairs of exercises, based on bag-of-word vectors con-     paragraphs from a biology textbook for question answering.
sisting of math terms that we identified in separate work         Extracted relations serve as a knowledge base against which
[7]. Then, to extract less explicit relationships between ex-     to answer questions, and answering a question is treated as
                                                                  finding a proof. A shallow approach to knowledge extraction
                                                                  from a fourth grade science curriculum is taken in [6], and
                                                                  the knowledge base is extended through dialog with users
                                                                  until a path in the knowledge network can be found that
                                                                  supports a known answer. In the math domain, Kushman et
                                                                  al. [10] generate a global representation of algebra problems
                                                                  in order to solve them by extracting relations from sentences
                                                                  and aligning them. Seo et al. [18] study text and diagrams
                                                                  together in order to understand the diagrams better through
                                                                  textual cues. We are concerned with alignment of content
 Two machines produce the same type of widget. Machine              high agreement. All the non-stop words were then labeled
 A produces W widgets, X of which are damaged. Machine B            by a trained annotator. We developed a supervised machine
 produces Y widgets, Z of which are damaged. The fraction of        learning approach to classify vocabulary into math and non-
                                      X
 damaged widgets for Machine A is W     or (simplified fraction).   math words [7] that can be applied to new mathematics cur-
                                                            Z
 The fraction of damaged widgets for Machine B is Y            or   ricula. For the text used here, there were 577 math terms.
 (simplified fraction). Write each fraction as a decimal and a
 percent. Use pencil and paper. Select a small percent that
 would allow for a small number of damaged widgets. Find
                                                                    3.3    Random Walks in Graphs
                                                                    A random walk on a graph starts at a given node and steps
 the number of widgets by which each machine exceeded the
                                                                    with random probability to a neighboring node. The same
 acceptable number of widgets.
                                                                    random decision process is employed at this and every sub-
                                                                    sequent node until a termination criterion is met. Each time
  Figure 1: Sample problem; math terms are in boldface.             a node is visited, it is counted. Open random walks require
                                                                    that the start node and end nodes differ. Traversal methods
                                                                    may employ a bias to navigate toward or away from certain
across rather than within problems, and our objective is            neighbors through edge weights or other graph attributes.
finer-grained analysis of curricula.
                                                                    In a graph, G = (V, E) with nodes V and edges E, a ran-
Other work that addresses knowledge representation from             dom walk that begins at vx and ends at vy can be denoted as
text includes ontology learning [3], which often focuses on         (vx , ..., vy ). By performing several random walks, the frac-
the acquisition of sets of facts from text [4]. There has           tion of times the node vy is visited converges to the prob-
been some work on linking lexical resources like WordNet            ability of target vy being visited given the start node vx ,
or FrameNet to formal ontologies [17, 13], which could pro-         which can be expressed as P (vy |vx ) under the conditions of
vide a foundation for reasoning over facts extracted from           the walk. In the case of a random walk length of 1, P (vy |vx )
text. We find one work that applies relation mining to e-           will simply measure the probability of vy being selected as
learning: Šimko and Bieliková [19] apply automated relation       an adjacent node to vx .
mining to extract relations to support e-course authoring in
the domain of teaching functional programming. Li et al.
[11] apply k-means clustering to a combination of problem           3.4    Cosine Similarity Graph
features and student performance features, and propose the          Math exercises are represented as bag-of-words vectors with
clusters correspond to Knowledge Components [8].                    boolean values to indicate whether a given math term is
                                                                    present. Cosine similarity quantifies the angle between the
                                                                    two vectors, and is given by the dot product of two vectors.
3. METHODS                                                                                             Pn
3.1 Data                                                                              te                  i=1 ti ei
                                                                        cos(t, e) =         = pPn             pPn              (1)
We used 1800 exercises from 17 chapters of a Grade 7 mathe-                         ktkkek         i=1 (ti )2      i=1 (ei )
                                                                                                                             2

matics curriculum. Most are word problems, as illustrated in
                                                                    Similarity values of 1 indicate that both the vectors are the
Figure 1. They can incorporate images, tables, and graphs,
                                                                    same whereas a value of zero indicates orthogonality between
but for our analysis, we use only the text. The vocabu-
                                                                    the two vectors. Pairwise cosine similarities for all 1800
lary of the resulting text consists of 3,500 distinct words.
                                                                    exercises were computed, yielding a cosine similarity matrix
We construct graphs where math exercises are the nodes, or
                                                                    Mcos . The matrix corresponds to a graph where non-zero
in a bipartite graph, math exercises are the left side nodes
                                                                    cosine similarities are edge weights between exercises.
and words are the right side nodes. Our initial focus is on
exercise similarity due to similarity of the math skills that
                                                                    In a graph walk, the probability that a node vy will be
exercises tap into, and we use mathematics terminology as
                                                                    reached in one step from a node vx is given by the prod-
an indirect proxy of skills a problem draws upon.
                                                                    uct of the degree centrality of vx and the normalized edge
                                                                    weight (vx , vy ). With each exercise as a starting node, we
3.2    Math Terminology                                             performed 100,000 random walks on the cosine-similarity
The text of the word problems includes ordinary language            graph, stepping with proportional probability to all outgo-
expressions unrelated to the mathematics curriculum, such           ing cosine similarity weights. To measure 2nd degrees of
as the nouns machines, widgets shown in problem in Fig-             separation, with each walk we made two steps.
ure 1, or the verbs produces, damaged. For our purposes,
mathematics terminology consists of words that expresses            For two math vectors considered as the sets A and B, cosine
concepts that are needed for the mathematical competence            similarity can be conceptualized in terms of the intersection
the curriculum addresses. To identify these terms, we devel-        set C = A ∪ B and set differences A \ B and B \ A. Cosine
oped annotation guidelines for human annotators who label           similarity is high when |C|  A \ B and |C|  B \ A.
words in their contexts of use, and assessed the reliability of
annotation by these guidelines. Words can be used in the            The degree of a node affects the probability of traversing
math texts sometimes in a math sense and sometimes in a             any edge from that node. The two factors that affect de-
non-math sense. Annotators were instructed to label terms           gree centrality of a start node are the document frequencies
based on the most frequent usage.                                   of its math words, and the total number of math words.
                                                                    Here, document frequency (df) is the normalized number of
Using a chance-adjusted agreement coefficient in [-1,1] [9],        exercises a word occurs in. A high df math word in a prob-
reliability among three annotators was 0.81, representing           lem increase its degree centrality because there will be more
Figure 2: Exercise-to-Exercise similarity in Chapter 6. The exercises of Chapter 6 are displayed in row-columns in a square
matrix. The rows represent source nodes and columns represent targets. Each row has been normalized across the book, even
though only Chapter 6 is shown. The axes demarcate the sections of the chapter. Mcos is the cosine similarity. Mcosrw is the
output from the random walk using cosine similarity edge weights, Mbp is the output from the random bipartite walk. Raw
values displayed between 0 and 0.005 corresponding to light and dark pixels, respectively.


problems it can share words with, resulting in non-zero co-       the distribution across the row as a probability distribution
sine values and therefore edges. The number of math words         to all other nodes for that source node.
in a problem also increases its degree centrality.
                                                                  4.    RESULTS
3.5    Bipartite exercise and word graph                          We compare the three measures of similarity between ex-
The set of exercises Ve are the left-side nodes and the math      ercises: 1) cosine similarity, 2) random walks using cosine
words Vw are the right-side nodes in the undirected bipartite     similarity as edge weights, and 3) random walks along a bi-
graph G = (Ve , Vw , E), where an edge exists between vex and     partite graph of exercises and words.
vwi if exercise x contains the math word i.

We performed open random walks on this graph to measure
                                                                  4.1   Exercise-to-Exercise Similarity
                                                                  We describe exercise-to-exercise similarity with square ma-
similarity between nodes. To measure the similarity of ex-
                                                                  trices where each exercise is represented as a row-column. A
ercises, we walk in even steps – a step to a connected word
                                                                  number of features of the measures are embedded in Figure
followed by a step back to one of the exercises that shares
                                                                  2, which shows heatmaps of color values for pairs of exercises
that word. The degrees of separation between vertices on
                                                                  in chapter 6 for each matrix. We find that within chapters
the same side of the graph (e.g. exercise-to-exercise) will be
                                                                  and especially within sections of those chapters, there is a
l/2 where l is the length of the walk. In this paper, we ex-
                                                                  high degree of similarity between exercises regardless of the
plored first and second degrees of separation so our bipartite
                                                                  measure. This demonstrates that words within sections and
graphs had a walk length of 4.
                                                                  chapters share a common vocabulary. We can see that Mcos
Table 1: Summary statistics of the similarity distributions       has more extreme values than Mcosrw ; as explained below,
                                                                  it has both more zero cosine values, and more very high
                cosine            rwcos          rwbp             values. This is most likely because Mcosrw , from doing the
   minimum      0              0              0                   walk, picks up exercises that are another degree of separa-
   maximum      6.3 ×10−2      0.50           0.11                tion away. When the row of the matrix is normalized to
   mean         5.55 ×10−4     5.55 ×10−4     5.55 ×10−4          capture the distribution of the source node, the otherwise
   median       0              2.41 ×10−4     2.06 ×10−4          high values from Mcos are tempered in the Mcosrw matrix.
   std. dev.    1.19 ×10−3     8.57 ×10−4     1.24 ×10−3          This shift to a large number of lower scores is shown in the
                                                                  bottom panel of Figure 3. Mbp and Mcosrw are very similar,
Because exercise nodes are connected via word nodes, we in-       but Mbp generally has a wider dynamic range.
terpret the fraction of node visits as a similarity measure be-
tween the source node and any node visited. We performed          4.2   Comparison of the Graph Walks
100,000 random walks from each node. Exercise-to-exercise         Table 1 provides summary statistics for cosine similarity and
similarity can be visualzed as square matrices with source        the two random walks for all pairs of problems (N=3,250,809).
nodes in the rows and target nodes in the columns. To fac-        The cosine matrix is very sparse, as shown by the median
tor out the times a source may have been selected as one of       value of 0. Of the two random walk similarities, rwcos has
the targets, we set the diagonal of the matrix to zero. We        a lower standard deviation around the mean, but otherwise
then normalized across the rows so that we could interpret        the two random walks produce similar distributions.
The similarity values given by cosine and the cosine random       Table 3: Two pairs of problems with high cosine similarity
walk will increasingly differ the more that the start problem     and reverse patterns of graph walk similarity. The first pair,
has relatively higher degree centrality due either to more        from section 5, have lower than average rwcos and higher
words or higher frequency of words in exercises (df). For         than average rwbp due to relatively many words in common
reference, the word that occurs most frequently, number,          (12 and 14). The second pair, from section 6, have higher
has a df of 0.42, and the second most frequent occurs in          than average rwcos and lower than average rwbp due to rel-
only 15% of the exercises. Fifty eight nodes have no edges        atively few words in common.
(0 degree), the most frequent number of edges is 170, and
the maximum is 1,706. Table 2 gives the summary statistics         Prob 1    Prob 2    cosine    rwcos      rwbp   N    max df
for df, number of math words, and degree centrality.                6.5.99    6.5.85   1.0000   0.0032    0.0102   12     0.42
                                                                    6.5.94    6.5.83   0.8819   0.0026    0.0064   14     0.13
Inspection of the data shows that for pairs of problems in         6.6.109   6.6.102   0.8660   0.0068    0.0037    3     0.11
the two chapters for our case study, if the cosine similar-        6.6.104   6.6.102   0.7746   0.0068    0.0029    3     0.11
ity between a pair is high (≥ 0.75), the similarity values
for rwcos tend to go down as the number of shared word
increases from 3 to between 5 and 7. For the rwbp , the           cosine similarity is less than 0.20, the two walks produce
opposite trend occurs, where the similarity goes up as the        very similar results. The average similarity values for the
number of words increases. This difference helps account for      bipartite walk are about 20% higher, and the maximum val-
an observed divergence in the two graph walks for sections        ues are higher, but the two walks produce similar means,
5 and 6 of Chapter 6.                                             independent of the lengths of the common word vectors, or
                                                                  the total number of math words.
Table 3 illustrates two pairs of problems from section 5 that
have high cosine similarities, and relatively higher rwbp sim-    Since we normalized the matrices across rows, which are
ilarities (greater than the rw means of 0.0055) and relatively    the source nodes, differences between the bipartite matrix,
lower rwcos (lower than the rw means). The reverse pattern        Mbp , and the cosine matrices implied that the degree of the
is seen for two pairs of problems from section 6 that have        target node had a greater impact on the variability in the
high cosine similarities. These problems have higher than         bipartite matrix. To measure the impact of the edge degree
average rwcos and lower than average rwbp . What differen-        on the target nodes, we considered the column sum for those
tiates the two pairs of problems is that the section 5 prob-      targets that had 1 edge, those that had 2, etc. up to 20
lems have a relatively large number of words in common:           edges. The results are summarized in Figure 4. As can be
14 for the first pair, 12 for the second pair. In both pairs,     seen, the column sum varies linearly by the number of target
some of the words have relatively high document frequency.        edges in the bipartite matrix, whereas the cosine matrices
As discussed above, these two properties increase the degree      do not. We found the cubed root of the column sum in Mbp
centrality of the start node of a step in the rwcos graph, and    approaches the distribution of column sums of the cosine
thus lower the probability of hitting each of the start node’s    matrices, which is provided in Figure 4.
one-degree neighbors. This effect propagates along the two
steps of the walk. For the rwbp graph, however, as the num-
ber of shared math words for a pair of problems increases,
the number of paths from one to the other also increases,
thus raising the probability of the traversal. This effect also
propagates through a two-step walk. In contrast to the sec-
tion 5 problems, the two section 6 problems have relatively
fewer words in common: 3 for both pairs.

For problem pairs where the cosine similarity is between
0.40 and 0.60, the mean similarity from rwbp is 30% higher
than for rwcos for when the number of math words in com-
mon is 3 (0.0033 vs. 0.0043), 80% higher when the number
of math words in common is 6 (0.0024 versus 0.0045), and          Figure 3: Tail distribution of similarity values in Mcos ,
three as high when the number of math words in common             Mcosrw , and Mbp . Because 62% of the values in Mcos are 0,
is 9 (0.0023 versus 0.0068). For problems pairs where the         the plot shows only non-zero values.

                                                                  5.   CONCLUSION
Table 2: Summary statistics of document frequency (df) of
                                                                  Visualization of the three similarity matrices shows they re-
math words, number of math words in problems, and degree
                                                                  veal the same overall patterns, thus each is confirmed by
centrality of rwcos
                                                                  the others. However, the bipartite walk was the most sen-
                                                                  sitive to word frequency across exercises, and the number
                   df        math words     degree ctr. rwcos
                                                                  of words in problems. With our goal of automatically dis-
 minimum      5.54 ×10−4     1              0                     covering knowledge components and identifying their rela-
 maximum      0.424          24             1,706                 tionships, the random walk that stepped in proportion to
 mean         0.183          8.35           418.4                 its cosine similarity performed best. It was able to discover
 median       6.66 ×10−3     8.00           340                   second-degree relationships that seem reasonable as we ex-
 std. dev.    0.314          3.64           317.1                 plore by eye those matches. Future work will test these re-
                                                                     knowledge-learning-instruction (KLI) framework:
                                                                     Toward bridging the science-practice chasm to
                                                                     enhance robust student learning. Cognitive Science,
                                                                     36(5):757–798, 2012.
                                                                 [9] K. Krippendorff. Content analysis: An introduction to
                                                                     its methodology. Sage Publications, Beverly Hills, CA,
                                                                     1980.
                                                                [10] N. Kushman, Y. Artzi, L. Zettlemoyer, and
                                                                     R. Barzilay. Learning to automatically solve algebra
                                                                     word problems. In Proceedings of the 52nd Annual
                                                                     Meeting of the Association for Computational
                                                                     Linguistics (Volume 1: Long Papers), pages 271–281,
                                                                     Baltimore, Maryland, June 2014. Association for
Figure 4: Distribution of column sums by number of edges
                                                                     Computational Linguistics.
in the target node represented by the column. Error plots
show the mean and standard error for each type. Black line      [11] N. Li, W. W. Cohen, and K. R. Koedinger.
is the cubed root of the mean of the column sums of Mbp .            Discovering student models with a clustering
                                                                     algorithm using problem content. In Proceedings of the
                                                                     6th International Conference on Educational Data
lationships with student performance data. We should find,           Mining, 2013.
for example, that if two exercises are conceptually similar,    [12] E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual
then student outcomes should also be similar and learning            search and name disambiguation in email using
curves should reveal shared knowledge components. In this            graphs. In Proceedings of the 29th Annual
respect, such automatically constructed knowledge graphs             International ACM SIGIR Conference on Research
can create more refined domain models that intelligent tu-           and Development in Information Retrieval, SIGIR ’06,
toring systems and robust assessments can be built upon.             pages 27–34, New York, NY, USA, 2006. ACM.
                                                                [13] I. Niles and A. Pease. Mapping WordNet to the
6.   REFERENCES                                                      SUMO ontology. In Proceedings of the IEEE
 [1] Y. An, J. Janssen, and E. E. Milios. Characterizing             International Knowledge Engineering Conference,
     and mining the citation graph of the computer science           pages 23–26, 2003.
     literature. Knowledge and Information Systems,             [14] J. Otterbacher, G. Erkan, and D. R. Radev. Biased
     6(6):664–678, 2004.                                             lexrank: Passage retrieval using random walks with
 [2] J. Berant, V. Srikumar, P.-C. Chen,                             question-based priors. Information Processing and
     A. Vander Linden, B. Harding, B. Huang, P. Clark,               Management, 45(1):42–54, 2009.
     and C. D. Manning. Modeling biological processes for       [15] L. Page, S. Brin, R. Motwani, and T. Winograd. The
     reading comprehension. In Proceedings of the 2014               pagerank citation ranking: Bringing order to the web.
     Conference on Empirical Methods in Natural Language             Technical Report 1999-66, Stanford InfoLab,
     Processing (EMNLP), pages 1499–1510, Doha, Qatar,               November 1999. Previous number =
     October 2014. Association for Computational                     SIDL-WP-1999-0120.
     Linguistics.                                               [16] T. M. Pilehvar, D. Jurgens, and R. Navigli. Align,
 [3] P. Buitelaar, A. Frank, M. Hartung, and S. Racioppa.            disambiguate and walk: A unified approach for
     Ontology-based information extraction and integration           measuring semantic similarity. In Proceedings of the
     from heterogeneous data sources, 2008.                          51st Annual Meeting of the Association for
 [4] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. H.         Computational Linguistics (Volume 1: Long Papers),
     Jr., and T. M. Mitchell. Toward an architecture for             pages 1341–1351. Association for Computational
     never-ending language learning. In Proceedings of the           Linguistics, 2013.
     24th Conference on Artificial Intelligence (AAAI),         [17] J. Scheffczyk, A. Pease, and M. Ellsworth. Linking
     volume 2, pages 1306–1313, 2010.                                FrameNet to the suggested upper merged ontology. In
 [5] G. Erkan and D. R. Radev. Lexrank: Graph-based                  B. Hennett and C. Fellbaum, editors, Formal Ontology
     lexical centrality as salience in text summarization. J.        in Information Systems, pages 289–. IOS Press, 2006.
     Artif. Int. Res., 22(1):457–479, Dec. 2004.                [18] M. J. Seo, H. Hajishirzi, A. Farhadi, and O. Etzioni.
 [6] B. Hixon, P. Clark, and H. Hajishirzi. Learning                 Diagram understanding in geometry questions. In
     knowledge graphs for question answering through                 Proceedings of the 28th AAAI Conference on Artificial
     conversational dialog. In Proceedings of the 2013               Intelligence, 2014.
     Conference of the North American Chapter of the            [19] M. Šimko and M. Bieliková. Automatic concept
     Association for Computational Linguistics: Human                relationships discovery for an adaptive e-course. In
     Language Technologies, Denver, CO, May-June 2015.               Proceedings of the Second International Conference on
 [7] R. J. L. John, R. J. Passonneau, and T. S. McTavish.            Educational Data Mining (EDM), pages 171–179,
     Semantic similarity graphs of mathematics word                  2009.
     problems: Can terminology detection help? In
     Proceedings of the Eighth International Conference on
     Educational Data Mining, 2015.
 [8] K. R. Koedinger, A. T. Corbett, and C. Perfetti. The