Semantic Graphs for Mathematics Word Problems based on Mathematics Terminology Rogers Jeffrey Leo John Thomas S. McTavish Rebecca J. Passonneau Center for Computational Center for Digital Data, Center for Computational Learning Systems Analytics & Adaptive Learning Learning Systems Columbia University Pearson Columbia University New York, NY, USA Austin, TX, USA New York, NY, USA rl2689@columbia.edu tom.mctavish@pearson.com becky@ccls.columbia.edu ABSTRACT ercises, we randomly walk a graph using the cosine distance We present a graph-based approach to discover and extend as edge weights. We also recast the problem as a bipartite semantic relationships found in a mathematics curriculum graph with exercises on one side and words on the other, to more general network structures that can illuminate re- providing an edge when an exercise contains the math word. lationships within the instructional material. Using words We contrast these two different types of random walks and representative of a secondary level mathematics curriculum find somewhat similar results, which lends confidence to the we identified in separate work, we constructed two similar- analysis. The bipartite graph walks, however, are more sen- ity networks of word problems in a mathematics textbook, sitive to differences in word frequency. Casting measures of and used analogous random walks over the two networks similarity as graphs and performing random walks on them to discover patterns. The two graph walks provide similar affords more nuanced ways of relating objects, which can be global views of problem similarity within and across chap- used to build more granular domain models for analysis of ters, but are affected differently by number of math words prerequisites, instructional design, and adaptive learning. in a problem and math word frequency. 1. INTRODUCTION 2. RELATED WORK Curricula are compiled learning objects, typically presented Random walks over graphs have been used extensively to in sequential order and arranged hierarchically, as in a book’s measure text similarity. Applications include similarity of Table of Contents. Ideally, a domain model captures rela- web pages [15] and other documents [5], citations [1], pas- tionships between the learning objects and the knowledge sages [14], person names in email [12] and so on. More re- components or skills they exercise. Unfortunately, domain cently, general methods that link graph walks with external models are not often granular enough for optimal learning resources like WordNet have been developed to produce a experiences. For example, prerequisite relationships may be single system that handles semantic similarity for words, lacking, or the knowledge components associated with an sentences or text [16]. Very little work compares walks over exercise may be unknown. In such cases, assessments on graphs of the same content, where the graphs have differ- those learning objects will be insufficient to enable appro- ent structure. We create two different kinds of graphs for priate redirection unless expert (i.e. teacher) intervention mathematics word problem and compare the results. We is explicitly given. Domain models remain coarse because find that the global results are very similar, which is good using experts to enumerate and relate the knowledge com- evidence for the general approach, and we find differences in ponents is costly. detail that suggest further investigation could lead to cus- tomizable methods, depending on needs. As a means to automatically discover relationships among learning objects and to reveal their knowledge components, An initiative where elementary science and math tests are a we demonstrate the use of direct similarity metrics and ran- driver for artificial intelligence has led to work on knowledge dom graph walks to relate exercises in a mathematics cur- extraction from textbooks. Berant et al. [2] create a system riculum. We first apply a standard cosine similarity measure to perform domain-specific deep semantic analysis of a 48 between pairs of exercises, based on bag-of-word vectors con- paragraphs from a biology textbook for question answering. sisting of math terms that we identified in separate work Extracted relations serve as a knowledge base against which [7]. Then, to extract less explicit relationships between ex- to answer questions, and answering a question is treated as finding a proof. A shallow approach to knowledge extraction from a fourth grade science curriculum is taken in [6], and the knowledge base is extended through dialog with users until a path in the knowledge network can be found that supports a known answer. In the math domain, Kushman et al. [10] generate a global representation of algebra problems in order to solve them by extracting relations from sentences and aligning them. Seo et al. [18] study text and diagrams together in order to understand the diagrams better through textual cues. We are concerned with alignment of content Two machines produce the same type of widget. Machine high agreement. All the non-stop words were then labeled A produces W widgets, X of which are damaged. Machine B by a trained annotator. We developed a supervised machine produces Y widgets, Z of which are damaged. The fraction of learning approach to classify vocabulary into math and non- X damaged widgets for Machine A is W or (simplified fraction). math words [7] that can be applied to new mathematics cur- Z The fraction of damaged widgets for Machine B is Y or ricula. For the text used here, there were 577 math terms. (simplified fraction). Write each fraction as a decimal and a percent. Use pencil and paper. Select a small percent that would allow for a small number of damaged widgets. Find 3.3 Random Walks in Graphs A random walk on a graph starts at a given node and steps the number of widgets by which each machine exceeded the with random probability to a neighboring node. The same acceptable number of widgets. random decision process is employed at this and every sub- sequent node until a termination criterion is met. Each time Figure 1: Sample problem; math terms are in boldface. a node is visited, it is counted. Open random walks require that the start node and end nodes differ. Traversal methods may employ a bias to navigate toward or away from certain across rather than within problems, and our objective is neighbors through edge weights or other graph attributes. finer-grained analysis of curricula. In a graph, G = (V, E) with nodes V and edges E, a ran- Other work that addresses knowledge representation from dom walk that begins at vx and ends at vy can be denoted as text includes ontology learning [3], which often focuses on (vx , ..., vy ). By performing several random walks, the frac- the acquisition of sets of facts from text [4]. There has tion of times the node vy is visited converges to the prob- been some work on linking lexical resources like WordNet ability of target vy being visited given the start node vx , or FrameNet to formal ontologies [17, 13], which could pro- which can be expressed as P (vy |vx ) under the conditions of vide a foundation for reasoning over facts extracted from the walk. In the case of a random walk length of 1, P (vy |vx ) text. We find one work that applies relation mining to e- will simply measure the probability of vy being selected as learning: Šimko and Bieliková [19] apply automated relation an adjacent node to vx . mining to extract relations to support e-course authoring in the domain of teaching functional programming. Li et al. [11] apply k-means clustering to a combination of problem 3.4 Cosine Similarity Graph features and student performance features, and propose the Math exercises are represented as bag-of-words vectors with clusters correspond to Knowledge Components [8]. boolean values to indicate whether a given math term is present. Cosine similarity quantifies the angle between the two vectors, and is given by the dot product of two vectors. 3. METHODS Pn 3.1 Data te i=1 ti ei cos(t, e) = = pPn pPn (1) We used 1800 exercises from 17 chapters of a Grade 7 mathe- ktkkek i=1 (ti )2 i=1 (ei ) 2 matics curriculum. Most are word problems, as illustrated in Similarity values of 1 indicate that both the vectors are the Figure 1. They can incorporate images, tables, and graphs, same whereas a value of zero indicates orthogonality between but for our analysis, we use only the text. The vocabu- the two vectors. Pairwise cosine similarities for all 1800 lary of the resulting text consists of 3,500 distinct words. exercises were computed, yielding a cosine similarity matrix We construct graphs where math exercises are the nodes, or Mcos . The matrix corresponds to a graph where non-zero in a bipartite graph, math exercises are the left side nodes cosine similarities are edge weights between exercises. and words are the right side nodes. Our initial focus is on exercise similarity due to similarity of the math skills that In a graph walk, the probability that a node vy will be exercises tap into, and we use mathematics terminology as reached in one step from a node vx is given by the prod- an indirect proxy of skills a problem draws upon. uct of the degree centrality of vx and the normalized edge weight (vx , vy ). With each exercise as a starting node, we 3.2 Math Terminology performed 100,000 random walks on the cosine-similarity The text of the word problems includes ordinary language graph, stepping with proportional probability to all outgo- expressions unrelated to the mathematics curriculum, such ing cosine similarity weights. To measure 2nd degrees of as the nouns machines, widgets shown in problem in Fig- separation, with each walk we made two steps. ure 1, or the verbs produces, damaged. For our purposes, mathematics terminology consists of words that expresses For two math vectors considered as the sets A and B, cosine concepts that are needed for the mathematical competence similarity can be conceptualized in terms of the intersection the curriculum addresses. To identify these terms, we devel- set C = A ∪ B and set differences A \ B and B \ A. Cosine oped annotation guidelines for human annotators who label similarity is high when |C|  A \ B and |C|  B \ A. words in their contexts of use, and assessed the reliability of annotation by these guidelines. Words can be used in the The degree of a node affects the probability of traversing math texts sometimes in a math sense and sometimes in a any edge from that node. The two factors that affect de- non-math sense. Annotators were instructed to label terms gree centrality of a start node are the document frequencies based on the most frequent usage. of its math words, and the total number of math words. Here, document frequency (df) is the normalized number of Using a chance-adjusted agreement coefficient in [-1,1] [9], exercises a word occurs in. A high df math word in a prob- reliability among three annotators was 0.81, representing lem increase its degree centrality because there will be more Figure 2: Exercise-to-Exercise similarity in Chapter 6. The exercises of Chapter 6 are displayed in row-columns in a square matrix. The rows represent source nodes and columns represent targets. Each row has been normalized across the book, even though only Chapter 6 is shown. The axes demarcate the sections of the chapter. Mcos is the cosine similarity. Mcosrw is the output from the random walk using cosine similarity edge weights, Mbp is the output from the random bipartite walk. Raw values displayed between 0 and 0.005 corresponding to light and dark pixels, respectively. problems it can share words with, resulting in non-zero co- the distribution across the row as a probability distribution sine values and therefore edges. The number of math words to all other nodes for that source node. in a problem also increases its degree centrality. 4. RESULTS 3.5 Bipartite exercise and word graph We compare the three measures of similarity between ex- The set of exercises Ve are the left-side nodes and the math ercises: 1) cosine similarity, 2) random walks using cosine words Vw are the right-side nodes in the undirected bipartite similarity as edge weights, and 3) random walks along a bi- graph G = (Ve , Vw , E), where an edge exists between vex and partite graph of exercises and words. vwi if exercise x contains the math word i. We performed open random walks on this graph to measure 4.1 Exercise-to-Exercise Similarity We describe exercise-to-exercise similarity with square ma- similarity between nodes. To measure the similarity of ex- trices where each exercise is represented as a row-column. A ercises, we walk in even steps – a step to a connected word number of features of the measures are embedded in Figure followed by a step back to one of the exercises that shares 2, which shows heatmaps of color values for pairs of exercises that word. The degrees of separation between vertices on in chapter 6 for each matrix. We find that within chapters the same side of the graph (e.g. exercise-to-exercise) will be and especially within sections of those chapters, there is a l/2 where l is the length of the walk. In this paper, we ex- high degree of similarity between exercises regardless of the plored first and second degrees of separation so our bipartite measure. This demonstrates that words within sections and graphs had a walk length of 4. chapters share a common vocabulary. We can see that Mcos Table 1: Summary statistics of the similarity distributions has more extreme values than Mcosrw ; as explained below, it has both more zero cosine values, and more very high cosine rwcos rwbp values. This is most likely because Mcosrw , from doing the minimum 0 0 0 walk, picks up exercises that are another degree of separa- maximum 6.3 ×10−2 0.50 0.11 tion away. When the row of the matrix is normalized to mean 5.55 ×10−4 5.55 ×10−4 5.55 ×10−4 capture the distribution of the source node, the otherwise median 0 2.41 ×10−4 2.06 ×10−4 high values from Mcos are tempered in the Mcosrw matrix. std. dev. 1.19 ×10−3 8.57 ×10−4 1.24 ×10−3 This shift to a large number of lower scores is shown in the bottom panel of Figure 3. Mbp and Mcosrw are very similar, Because exercise nodes are connected via word nodes, we in- but Mbp generally has a wider dynamic range. terpret the fraction of node visits as a similarity measure be- tween the source node and any node visited. We performed 4.2 Comparison of the Graph Walks 100,000 random walks from each node. Exercise-to-exercise Table 1 provides summary statistics for cosine similarity and similarity can be visualzed as square matrices with source the two random walks for all pairs of problems (N=3,250,809). nodes in the rows and target nodes in the columns. To fac- The cosine matrix is very sparse, as shown by the median tor out the times a source may have been selected as one of value of 0. Of the two random walk similarities, rwcos has the targets, we set the diagonal of the matrix to zero. We a lower standard deviation around the mean, but otherwise then normalized across the rows so that we could interpret the two random walks produce similar distributions. The similarity values given by cosine and the cosine random Table 3: Two pairs of problems with high cosine similarity walk will increasingly differ the more that the start problem and reverse patterns of graph walk similarity. The first pair, has relatively higher degree centrality due either to more from section 5, have lower than average rwcos and higher words or higher frequency of words in exercises (df). For than average rwbp due to relatively many words in common reference, the word that occurs most frequently, number, (12 and 14). The second pair, from section 6, have higher has a df of 0.42, and the second most frequent occurs in than average rwcos and lower than average rwbp due to rel- only 15% of the exercises. Fifty eight nodes have no edges atively few words in common. (0 degree), the most frequent number of edges is 170, and the maximum is 1,706. Table 2 gives the summary statistics Prob 1 Prob 2 cosine rwcos rwbp N max df for df, number of math words, and degree centrality. 6.5.99 6.5.85 1.0000 0.0032 0.0102 12 0.42 6.5.94 6.5.83 0.8819 0.0026 0.0064 14 0.13 Inspection of the data shows that for pairs of problems in 6.6.109 6.6.102 0.8660 0.0068 0.0037 3 0.11 the two chapters for our case study, if the cosine similar- 6.6.104 6.6.102 0.7746 0.0068 0.0029 3 0.11 ity between a pair is high (≥ 0.75), the similarity values for rwcos tend to go down as the number of shared word increases from 3 to between 5 and 7. For the rwbp , the cosine similarity is less than 0.20, the two walks produce opposite trend occurs, where the similarity goes up as the very similar results. The average similarity values for the number of words increases. This difference helps account for bipartite walk are about 20% higher, and the maximum val- an observed divergence in the two graph walks for sections ues are higher, but the two walks produce similar means, 5 and 6 of Chapter 6. independent of the lengths of the common word vectors, or the total number of math words. Table 3 illustrates two pairs of problems from section 5 that have high cosine similarities, and relatively higher rwbp sim- Since we normalized the matrices across rows, which are ilarities (greater than the rw means of 0.0055) and relatively the source nodes, differences between the bipartite matrix, lower rwcos (lower than the rw means). The reverse pattern Mbp , and the cosine matrices implied that the degree of the is seen for two pairs of problems from section 6 that have target node had a greater impact on the variability in the high cosine similarities. These problems have higher than bipartite matrix. To measure the impact of the edge degree average rwcos and lower than average rwbp . What differen- on the target nodes, we considered the column sum for those tiates the two pairs of problems is that the section 5 prob- targets that had 1 edge, those that had 2, etc. up to 20 lems have a relatively large number of words in common: edges. The results are summarized in Figure 4. As can be 14 for the first pair, 12 for the second pair. In both pairs, seen, the column sum varies linearly by the number of target some of the words have relatively high document frequency. edges in the bipartite matrix, whereas the cosine matrices As discussed above, these two properties increase the degree do not. We found the cubed root of the column sum in Mbp centrality of the start node of a step in the rwcos graph, and approaches the distribution of column sums of the cosine thus lower the probability of hitting each of the start node’s matrices, which is provided in Figure 4. one-degree neighbors. This effect propagates along the two steps of the walk. For the rwbp graph, however, as the num- ber of shared math words for a pair of problems increases, the number of paths from one to the other also increases, thus raising the probability of the traversal. This effect also propagates through a two-step walk. In contrast to the sec- tion 5 problems, the two section 6 problems have relatively fewer words in common: 3 for both pairs. For problem pairs where the cosine similarity is between 0.40 and 0.60, the mean similarity from rwbp is 30% higher than for rwcos for when the number of math words in com- mon is 3 (0.0033 vs. 0.0043), 80% higher when the number of math words in common is 6 (0.0024 versus 0.0045), and Figure 3: Tail distribution of similarity values in Mcos , three as high when the number of math words in common Mcosrw , and Mbp . Because 62% of the values in Mcos are 0, is 9 (0.0023 versus 0.0068). For problems pairs where the the plot shows only non-zero values. 5. CONCLUSION Table 2: Summary statistics of document frequency (df) of Visualization of the three similarity matrices shows they re- math words, number of math words in problems, and degree veal the same overall patterns, thus each is confirmed by centrality of rwcos the others. However, the bipartite walk was the most sen- sitive to word frequency across exercises, and the number df math words degree ctr. rwcos of words in problems. With our goal of automatically dis- minimum 5.54 ×10−4 1 0 covering knowledge components and identifying their rela- maximum 0.424 24 1,706 tionships, the random walk that stepped in proportion to mean 0.183 8.35 418.4 its cosine similarity performed best. It was able to discover median 6.66 ×10−3 8.00 340 second-degree relationships that seem reasonable as we ex- std. dev. 0.314 3.64 317.1 plore by eye those matches. Future work will test these re- knowledge-learning-instruction (KLI) framework: Toward bridging the science-practice chasm to enhance robust student learning. Cognitive Science, 36(5):757–798, 2012. [9] K. Krippendorff. Content analysis: An introduction to its methodology. Sage Publications, Beverly Hills, CA, 1980. [10] N. Kushman, Y. Artzi, L. Zettlemoyer, and R. Barzilay. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 271–281, Baltimore, Maryland, June 2014. Association for Figure 4: Distribution of column sums by number of edges Computational Linguistics. in the target node represented by the column. Error plots show the mean and standard error for each type. Black line [11] N. Li, W. W. Cohen, and K. R. Koedinger. is the cubed root of the mean of the column sums of Mbp . Discovering student models with a clustering algorithm using problem content. In Proceedings of the 6th International Conference on Educational Data lationships with student performance data. We should find, Mining, 2013. for example, that if two exercises are conceptually similar, [12] E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual then student outcomes should also be similar and learning search and name disambiguation in email using curves should reveal shared knowledge components. In this graphs. In Proceedings of the 29th Annual respect, such automatically constructed knowledge graphs International ACM SIGIR Conference on Research can create more refined domain models that intelligent tu- and Development in Information Retrieval, SIGIR ’06, toring systems and robust assessments can be built upon. pages 27–34, New York, NY, USA, 2006. ACM. [13] I. Niles and A. Pease. Mapping WordNet to the 6. REFERENCES SUMO ontology. In Proceedings of the IEEE [1] Y. An, J. Janssen, and E. E. Milios. Characterizing International Knowledge Engineering Conference, and mining the citation graph of the computer science pages 23–26, 2003. literature. Knowledge and Information Systems, [14] J. Otterbacher, G. Erkan, and D. R. Radev. Biased 6(6):664–678, 2004. lexrank: Passage retrieval using random walks with [2] J. Berant, V. Srikumar, P.-C. Chen, question-based priors. Information Processing and A. Vander Linden, B. Harding, B. Huang, P. Clark, Management, 45(1):42–54, 2009. and C. D. Manning. Modeling biological processes for [15] L. Page, S. Brin, R. Motwani, and T. Winograd. The reading comprehension. In Proceedings of the 2014 pagerank citation ranking: Bringing order to the web. Conference on Empirical Methods in Natural Language Technical Report 1999-66, Stanford InfoLab, Processing (EMNLP), pages 1499–1510, Doha, Qatar, November 1999. Previous number = October 2014. Association for Computational SIDL-WP-1999-0120. Linguistics. [16] T. M. Pilehvar, D. Jurgens, and R. Navigli. Align, [3] P. Buitelaar, A. Frank, M. Hartung, and S. Racioppa. disambiguate and walk: A unified approach for Ontology-based information extraction and integration measuring semantic similarity. In Proceedings of the from heterogeneous data sources, 2008. 51st Annual Meeting of the Association for [4] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. H. Computational Linguistics (Volume 1: Long Papers), Jr., and T. M. Mitchell. Toward an architecture for pages 1341–1351. Association for Computational never-ending language learning. In Proceedings of the Linguistics, 2013. 24th Conference on Artificial Intelligence (AAAI), [17] J. Scheffczyk, A. Pease, and M. Ellsworth. Linking volume 2, pages 1306–1313, 2010. FrameNet to the suggested upper merged ontology. In [5] G. Erkan and D. R. Radev. Lexrank: Graph-based B. Hennett and C. Fellbaum, editors, Formal Ontology lexical centrality as salience in text summarization. J. in Information Systems, pages 289–. IOS Press, 2006. Artif. Int. Res., 22(1):457–479, Dec. 2004. [18] M. J. Seo, H. Hajishirzi, A. Farhadi, and O. Etzioni. [6] B. Hixon, P. Clark, and H. Hajishirzi. Learning Diagram understanding in geometry questions. In knowledge graphs for question answering through Proceedings of the 28th AAAI Conference on Artificial conversational dialog. In Proceedings of the 2013 Intelligence, 2014. Conference of the North American Chapter of the [19] M. Šimko and M. Bieliková. Automatic concept Association for Computational Linguistics: Human relationships discovery for an adaptive e-course. In Language Technologies, Denver, CO, May-June 2015. Proceedings of the Second International Conference on [7] R. J. L. John, R. J. Passonneau, and T. S. McTavish. Educational Data Mining (EDM), pages 171–179, Semantic similarity graphs of mathematics word 2009. problems: Can terminology detection help? In Proceedings of the Eighth International Conference on Educational Data Mining, 2015. [8] K. R. Koedinger, A. T. Corbett, and C. Perfetti. The