Graph Grammar Induction via Evolutionary Computation Linting Xue Collin F.Lynch Min Chi Electrical Engineering Computer Science Computer Science Department Department Department North Carolina State North Carolina State North Carolina State University University University Raleigh, North Carolina, Raleigh, North Carolina, Raleigh, North Carolina, U.S.A. U.S.A. U.S.A. lxue3@ncsu.edu cflynch@ncsu.edu mchi@ncsu.edu ABSTRACT In this paper we describe our ongoing work on the automatic Augmented Graph Grammars provide a robust formalism induction of Augmented Graph Grammars via Evolutionary for representing and evaluating graph structures. With the Computation (EC). Our long-term goal in this work is to advent of robust graph libraries such as AGG, it has be- develop automated techniques that can extract empirically- come possible to use graph grammars to analyze realistic valid graph rules which can, in turn, be used to classify data. Prior studies have shown that graph rules can be used student-produced argument diagrams and to provide the ba- to evaluate student work and to identify empirically-valid sis for automated student guidance and evaluation. This will substructures using hand-authored rules. In this paper we build upon our prior on the evaluation of a-priori rules for describe proposed work on the automatic induction of graph student arguments. We will begin with background material grammars for student data using evolutionary computation on Augmented Graph Grammars and discuss prior work on via the pyEC system. grammar induction. We will then present an overview of our planned work. Keywords Augmented Graph Grammars, Graph Data, Evolutionary 2. AUGMENTED GRAPH GRAMMARS Computation & ARGUMENT DIAGRAMS Classical graph grammars are designed to deal with fixed 1. INTRODUCTION graphs that are composed from a finite set of static node and Graph Grammars are logical rule representations for graph arc types. Augmented graph grammars are an extension of structures. They can be designed to encode classes of suit- simple graph grammars that allow for complex node and arc able graphs to recognize complex sub-features. They were types, optional substructures, and complex rule expressions introduced by Rosenfeld and Pfaltz in 1969 as “Context-free [12]. Rather than using a fixed alphabet of graph compo- web grammars” [1]. Since then Graph grammars have been nents they are defined by a complex ontology that allows for applied to a wide range of areas, including pattern recogni- subsidiary types such as textual fields, access functions, and tion [2, 3, 4]; visual programming languages [5]; biological directional information. They can also be used to evaluate development [6]; classification of chemical compounds [7, 8]; negated elements as well as quantified expressions. As such and social network analysis [9, 10, 11]. Simple graph gram- they are better suited to rich graph data such as user-system mars are, like string grammars, composed of a set of pro- interaction logs and student-produced argument diagrams. duction rules that map from one structure to another. In this case the rules map from a simple subgraph, typically a Augmented Graph Grammars have previously been used for single node or arc, to a more complex structure. As with the detection of empirically-valid substructures in student- string grammars the node and arc types are drawn from produced argument diagrams [13, 14]. In that work a-priori finite alphabets. Despite their utility, however, graph gram- rules were used to represent key discussion features and ar- mars are very difficult to construct. The data structures gumentative flaws. Argument diagrams are graphical ar- required are complex [5]. Moreover, development of suitable gument representations that reify key features of arguments graph grammars generally requires considerable domain ex- such as hypothesis statements, claims, and citations as nodes pertise. Most existing uses of graph grammars have relied and the supporting, opposing, and informational relation- on hand-authored rules. ships as arcs between them. A sample diagram collected in that work is shown in Figure 1 The diagram includes a central research claim node, which has a single text field indicating the content of the research claim. A set of citation nodes are connected to the claim node via a set of supporting, opposing and undefined arcs colored with green, red and blue respectively. Each citation node contains two fields: one for the citation information, and the other for a summary of the cited work; each arc has a single text field explaining why the relationship holds. At Figure 1: A student-produced LASAD argument diagram representing an introductory argument. the bottom of the diagram, there is a single isolated hypoth- t esis node that contains two text fields, one for a conditional or IF field, and the other for conditional or THEN field. We expect the induced graph grammars from a set of argument (P aredW comp) O S diagrams can be used to evaluate the student thesis work. a ¬c b Figure 2 shows an a-priori rule that was defined as part of that work. This rule is designed to identify a subgraph  t.T pye = “claim00 or“hypothesis00   where a single target node t is connected to two separate  a.T ype = “Citation00   citation nodes a and b such that: a is connected to t via an   00 opposing path; b is connected via a supporting path; and  b.T ype = “Citation  c.T ype = “Comparison00   there exists no comparison arc between a and b. The rules   in that study were implemented using AGG an augmented graph grammar library built in Python [12]. AGG matches Figure 2: A simple augmented graph grammar rule the graphs using recursive stack-based algorithm. The code that detects compared counterarguments. The rule first matches the ground nodes at the top-level of the class shows a two citation nodes (a, & b) that have oppos- (t, a, & b). It then tests for the recursive productions O, ing relationships with a shared claim node (t) and do and S, before finally testing for the negated comparison arc not have a comparison arc (c) drawn between them. c. This rule does not make use of the full range of potential The arcs S and O represent recursive supporting capacity for Augmented Graph Grammars. However it is and opposing path. illustrative of the type of rules we plan to induce here, rules that generalize beyond basic types and draw on existing pro- duction classes but not, at least in the immediate term, use complex textual elements or functional features. algorithm [19]. They are based upon controlled graph walks coupled with indexing. While these algorithms are effective, 3. GRAMMAR INDUCTION particularly on smaller graphs, with low vertex degree they Graph and relational data has grown increasingly prevalent can also overfit simpler graph structures and they do not and graph analysis algorithms have been applied in a wide scale well to larger, denser graph data [20]. range of domains from social network analysis [15] to bioin- formatics [16]. Most of this work falls into one of two cate- The SUBDUE system takes a greedy-compression approach gories of algorithms: frequent subgraph matching, and graph to graph mining. SUBDUE searches for candidate sub- compression. graphs that can best compress the input graphs by replacing a candidate subgraph with a single vertex. Then nodes and A number of algorithms have been developed to discover fre- arcs are added to the vertices to form new candidate sub- quent subgraphs. These include the gSpan algorithm devel- graphs. The process is recursive and relies on the Minimum- oped by Yan and Han [17]; the AGM algorithm developed by Description-Length (MDL) principle to evaluate the candi- Inokuchi et al [18]; and the FSG algorithm developed by Ku- dates. SUBDUE has been applied successfully to extract ramochi and Karypis which is based on the previous Apriori structure from visual programming [5], web search [21], and analyzing user behaviors in games [22]. lution space. Therefore it is well suited to our present needs. This flexibility is costly, however, as EC is far more compu- While these methods are successful they have practical and tationally expensive than more specialized algorithms, and theoretical limitations. Both classes of approaches are lim- applications of EC can require a great deal of tuning for ited to static graphs composed from a finite alphabet of node each use. In the subsections below we will describe the fit- and arc types. The frequent subgraph approaches are based ness function and the operators that we will use in this work. upon iterative graph walks and can be computationally ex- For this work we will rely on pyEC a general purpose evo- pensive and are limited to finding exact matches. They do lutionary computation engine that we have developed [26]. not generalize beyond the exact graphs matched nor do they allow for recursive typing. SUBDUE, by contrast is a greedy 4.2 Dataset algorithm that finds the single most descriptive grammar Our initial analysis will be based upon a corpus of expert and does not allow for weighted matches. graded student produced argument diagrams and essays pre- viously described in [13, 14]. That dataset was collected as For our present purposes, however, our goal is to identify part of a study on students’ use of argument diagrams for multiple heirarchical classes of the type shown in Figure 2 writing that was conducted at the University of Pittsburgh that can: generalize beyond exact node and arc types; can in 2011. For that study we selected a set of students in an draw on recursive rule productions; and can be weighted undergraduate-level course on Psychological Research Meth- based upon the graph quality. Moreover our long-term goal ods. As part of the course the students were tasked with with this work is to explore graph rule induction mechanisms planning and executing an empirical research study and then that can be expanded to include textual rules and complex drafting a written report. The students were permitted to constraints. For that reason we have elected to apply evo- work individually or in teams. This report was structured as lutionary computation. This is a general-purpose machine a standard empirical workshop paper. Prior to drafting the learning mechanism that can be tunes to explore a range of report the students were tasked with diagramming the ar- possible induction mechanisms. gument that they planned to make using LASAD an online tool for argument diagramming and annotation. 4. METHODS Subsequent to this data collection process the diagrams and 4.1 Evolutionary Computation essays were graded by an experienced TA using a set of par- Evolutionary Computation (EC) is a general class of ma- allel grading rubrics. These rubrics focused on the quality of chine learning and optimization methods that are inspired the arguments in the diagrams and essays and were used to by the process of Darwinian evolution through natural se- demonstrate that the structure and quality of the diagrams lection [23] such as Genetic Algorithms [24] or Genetic Pro- can be used to predict the students’ subsequent essay per- gramming [25]. EC algorithms begin with a population of formance. These grades will be used as the weighting metric randomly generated candidate solutions such as snippets of for the diagrams and will be correlated with performance as random code, strings representing a target function, or for- part of the fitness function we describe below. After comple- mal rules. Each of these solutions is ranked by a fitness tion of the data collection, grading, and testing phases and function that is used to evaluate the quality of the individu- accounting for student dropout and incomplete assignments als. These functions can be defined by absolute measures of we collected 105 graded diagram-essay pairs 74 of which were success such as a suite of test cases, or by relative measures authored by teams. such as a competition between chess-playing systems. 4.3 Solution Representation Once the individuals have been ranked a new generation of For the purposes of our present experiments we will use a individuals is produced through a combination of crossover restricted solution representation that relies on a subset of and mutation operations. Crossover operations combine two the augmented graph grammar formalism exemplified by the or more parents to produce one or more candidate children. rule shown in Figure 2. This will include only element types In Genetic Algorithms where the candidate solutions are rep- and recursive productions. In future work we plan to sup- resented as strings this can be accomplished by splitting two port the induction of more complex rules defined by multiple parents at a given index and then exchanging the substrings graph classes, novel productions, and expressions. However to produce novel children. In Genetic Programming the par- for the present study we will focus on the simple case of ents exchange blocks of code, functions, or subtrees. Muta- individual classes coupled with predefined productions. tion operations alter randomly-selected parts of a candidate solution by swapping out one symbol or instruction for an- 4.4 Fitness Function other, adding new sub-solutions, or deleting components. We plan to use the frequency correlation metric previously This process of ranking and regeneration will iterate until employed in [13, 14]. In that study the authors assessed the a target performance threshold is reached or a maximum empirical validity of a set of a-priori diagram rules. The number of generations has passed. validity of each individual rule was assessed by testing the correlation between the frequency of the class in the existing EC methods are highly general algorithms that can be read- graph and the graph grade. The strength of that correlation ily adapted to novel domains by selecting an appropriate was estimated using Spearman’s ρ a non-parametic measure solution representation and modification operations. Thus, of correlation [27]. In that work the authors demonstrated in contrast to more specific methods such as SUBDUE, the that the a-priori rule frequency was correlated with stu- EC algorithm allows us to tune the inductive bias of our dents’ subsequent essay grades and showed that the frequen- search and to explore alternative ways of traversing the so- cies could be used to predict students’ future performance. 4.5 Mutation Our mutation operator will draw on the predefined graph ontology to make atomic changes to an existing graph class. B C D F G The change will be one of the following operations: 1 ∅ 3 A 7 ∅ E Change Node change an existing node’s type. Change Arc Change an existing arc’s type or orientation. 4 5 B ∅ F Delete Node Delete a node and its associated arcs. ∅ C Delete Arc Delete an existing arc. Add Node Add a novel node with a specified type. Figure 3: Canonical matricies for crossover parents. Add Arc Add an arc between existing nodes or add with new nodes. 4.6 Crossover B G D F C By design the crossover operation should, like genetic crossover, be conservative. Two very similar parents should produce 1 ∅ 3 E 7 ∅ A similar offspring. Crossover operations should therefore pre- serve good building blocks and sub-solutions or introns through random behavior [25]. Arbitrary graph alignment and crossover ∅ 5 B 4 F is a challenging problem that risks causing unsustainable changes on each iteration. We therefore treat graph crossover as a matrix problem. ∅ G For each pair of parent classes we will define a pair of di- agonal matricies of the type illustrated in Figure 3. The Figure 4: Canonical matricies for crossover children. letter indicies on the top and right indicate nodes while the numerical indicies internally indicate arcs, and the ∅ symbol indicates that no arc is present. The matricies are gener- grading and feedback. This work represents an improve- ated in a canonical order based upon the order in which the ment over prior graph grammar induction algorithms which nodes were added to the class. Thus on each iteration of the are limited to classical graph grammars and greedy extrac- crossover process the corresponding elements will obtain the tion. This work also represents an extension for evolution- same index. As a consequence good subsolutions will obtain ary computation by shifting it into a new domain. As part the same location and will tend to be preserved over time. of this work we also plan to explore additional extensions to the standard evolutionary computation algorithm to address Once a set of parent matricies has been generated we then problems of over-fitting such as χ2 reduction. generate two child matricies of the same size as the parents and then randomly select the node and arc members. In the example shown in figures 3 and 4 the parents have nodes 6. REFERENCES {A,B,C,D} and {E,F,G} while the children have {E,B,G,D} [1] John L Pfaltz and Azriel Rosenfeld. Web grammars. and {A,F,C}. Thus we align the nodes in canonical order In Proceedings of the 1st international joint conference and, for each node pair, we flip a coin to decide where they on Artificial intelligence, pages 609–619. Morgan are copied. If one parent is larger than the other than any Kaufmann Publishers Inc., 1969. additional nodes, in this case D, will be copied to the larger [2] John L Pfaltz. Web grammars and picture description. child. We then perform a comparable exchange process for Computer Graphics and Image Processing, the arcs. Each arc or potential arc is defined uniquely in the 1(2):193–220, 1972. matricies by its endpoints. We thus align the lists of arcs [3] Horst Bunke. Attributed programmed graph in a comparable manner and then decide randomly which grammars and their application to schematic diagram arc, or empty arc, to copy. As with the nodes, extra arcs interpretation. IEEE Transactions on Pattern from the larger parent, in this case 3,5, and one ∅ are copied Analysis and Machine Intelligence, 4(6):574–582, 1982. directly into the larger of the two children. [4] Michihiro Kuramochi and George Karypis. Finding frequent patterns in a large sparse graph*. Data 5. FUTURE WORK mining and knowledge discovery, 11(3):243–271, 2005. In this paper we presented a method for the induction of [5] Keven Ates, Jacek Kukluk, Lawrence Holder, Diane augmented graph grammars through evolutionary computa- Cook, and Kang Zhang. Graph grammar induction on tion. We are presently applying this work to the automatic structural data for visual programming. In Tools with induction of empirically-valid rules for student-produced ar- Artificial Intelligence, 2006. ICTAI’06. 18th IEEE gument diagrams. This work will serve to extend our prior International Conference on, pages 232–242. IEEE, efforts on the use of augmented graph grammars for student 2006. [6] Francesc Rosselló and Gabriel Valiente. Graph CSREA Press, 2008. transformation in molecular biology. In Formal [17] Xifeng Yan and Jiawei Han. gspan: Graph-based Methods in Software and Systems Modeling, pages substructure pattern mining. In Proceedings of the 116–133. Springer, 2005. IEEE International Conference on Data Mining [7] Luc Dehaspe, Hannu Toivonen, and Ross D King. (ICDM 2002), pages 721–724. IEEE, 2002. Finding frequent substructures in chemical [18] Akihiro Inokuchi, Takashi Washio, and Hiroshi compounds. In KDD, volume 98, page 1998, 1998. Motoda. An apriori-based algorithm for mining [8] Stefan Kramer, Luc De Raedt, and Christoph Helma. frequent substructures from graph data. In Principles Molecular feature mining in hiv data. In Proceedings of Data Mining and Knowledge Discovery, pages of the seventh ACM SIGKDD international conference 13–23. Springer, 2000. on Knowledge discovery and data mining, pages [19] Michihiro Kuramochi and George Karypis. Frequent 136–143. ACM, 2001. subgraph discovery. In Proceedings IEEE International [9] Wenke Lee and Salvatore J Stolfo. A framework for Conference on Data Mining. (ICDM 2001), pages constructing features and models for intrusion 313–320. IEEE, 2001. detection systems. ACM transactions on Information [20] MICHIHIRO Kuramochi and George Karypis. Finding and system security (TiSSEC), 3(4):227–261, 2000. topological frequent patterns from graph datasets. [10] Calvin Ko. Logic induction of valid behavior Mining Graph Data, pages 117–158, 2006. specifications for intrusion detection. In Proceedings of [21] Nitish Manocha, Diane J Cook, and Lawrence B the IEEE Symposium on Security and Privacy. (S&P Holder. Cover story: structural web search using a 2000), pages 142–153. IEEE, 2000. graph-based discovery system. Intelligence, [11] Rakesh Agrawal, Ramakrishnan Srikant, et al. Fast 12(1):20–29, 2001. algorithms for mining association rules. In Proceedings [22] Diane J. Cook, Lawrence B. Holder, and G. Michael of the 20th International Conference on very large Youngblood. Graph-based analysis of human transfer data bases, VLDB, volume 1215, pages 487–499, 1994. learning using a game testbed. IEEE Trans. on [12] Collin F Lynch. Agg: Augmented graph grammars for Knowl. and Data Eng., 19:1465–1478, November 2007. complex heterogeneous data. In Proceedings of the [23] Charles Darwin. On the Origin of Species by Means of first international workshop on Graph-Based Natural Selection, or the Preservation of Favoured Educational Data Mining (GEDM 2014). Races in the Struggle for Life. John Murray: [13] Collin F. Lynch and Kevin D. Ashley. Empirically Albermarle Street: London, United Kingdom, 6 valid rules for ill-defined domains. In John Stamper edition, 1872. and Zachary Pardos, editors, Proceedings of The 7th [24] Melanie Mitchell. An Introduction to Genetic International Conference on Educational Data Mining Algorithms. MIT Press: Cambridge, Massachusetts, (EDM 2014). International Educational Datamining 1999. Society IEDMS, 2014. [25] Wolfgang Banzhaf. Genetic programming: an [14] Collin F. Lynch, Kevin D. Ashley, and Min Chi. Can introduction on the automatic evolution of computer diagrams predict essay grades? In Stefan programs and its applications. Morgan Kaufmann Trausan-Matu, Kristy Elizabeth Boyer, Martha E. Publishers ; Heidelburg : Dpunkt-verlag; San Crosby, and Kitty Panourgia, editors, Intelligent Francisco, California, 1998. Tutoring Systems, Lecture Notes in Computer Science, [26] Collin F. Lynch, Kevin D. Ashley, Niels Pinkwart, and pages 260–265. Springer, 2014. Vincent Aleven. Argument graph classification with [15] Sherry E. Marcus, Melanie Moy, and Thayne Coffman. genetic programming and c4.5. In Ryan Social network analysis. In Diane J. Cook and Shaun Joazeiro de Baker, Tiffany Barnes, and Lawrence B. Holder, editors, Mining Graph Data, Joseph E. Beck, editors, The 1st International chapter 17, pages 443–468. John Wiley & Sons, 2006. Conference on Educational Data Mining, Montreal, [16] Chang Hun You, Lawrence B. Holder, and Diane J. Québec, Canada, June 20-21, 2008. Proceedings, pages Cook. Dynamic graph-based relational learning of 137–146, 2008. temporal patterns in biological networks changing over [27] Wikipedia. Spearman’s rank correlation coefficient — time. In Hamid R. Arabnia, Mary Qu Yang, and wikipedia, the free encyclopedia, 2013. [Online; Jack Y. Yang, editors, BIOCOMP, pages 984–990. accessed 27-February-2013].