=Paper=
{{Paper
|id=Vol-1446/GEDM_2015_Submission_1
|storemode=property
|title=Graph Grammar Induction via Evolutionary Computation
|pdfUrl=https://ceur-ws.org/Vol-1446/GEDM_2015_Submission_1.pdf
|volume=Vol-1446
}}
==Graph Grammar Induction via Evolutionary Computation==
<pdf width="1500px">https://ceur-ws.org/Vol-1446/GEDM_2015_Submission_1.pdf</pdf>
<pre>
     Graph Grammar Induction via Evolutionary Computation

                   Linting Xue                        Collin F.Lynch                         Min Chi
              Electrical Engineering                Computer Science                  Computer Science
                   Department                           Department                        Department
              North Carolina State                 North Carolina State              North Carolina State
                    University                           University                        University
             Raleigh, North Carolina,             Raleigh, North Carolina,          Raleigh, North Carolina,
                      U.S.A.                               U.S.A.                            U.S.A.
               lxue3@ncsu.edu                      cflynch@ncsu.edu                    mchi@ncsu.edu

ABSTRACT                                                        In this paper we describe our ongoing work on the automatic
Augmented Graph Grammars provide a robust formalism             induction of Augmented Graph Grammars via Evolutionary
for representing and evaluating graph structures. With the      Computation (EC). Our long-term goal in this work is to
advent of robust graph libraries such as AGG, it has be-        develop automated techniques that can extract empirically-
come possible to use graph grammars to analyze realistic        valid graph rules which can, in turn, be used to classify
data. Prior studies have shown that graph rules can be used     student-produced argument diagrams and to provide the ba-
to evaluate student work and to identify empirically-valid      sis for automated student guidance and evaluation. This will
substructures using hand-authored rules. In this paper we       build upon our prior on the evaluation of a-priori rules for
describe proposed work on the automatic induction of graph      student arguments. We will begin with background material
grammars for student data using evolutionary computation        on Augmented Graph Grammars and discuss prior work on
via the pyEC system.                                            grammar induction. We will then present an overview of our
                                                                planned work.
Keywords
Augmented Graph Grammars, Graph Data, Evolutionary              2.   AUGMENTED GRAPH GRAMMARS
Computation                                                          & ARGUMENT DIAGRAMS
                                                                Classical graph grammars are designed to deal with fixed
1.    INTRODUCTION                                              graphs that are composed from a finite set of static node and
Graph Grammars are logical rule representations for graph       arc types. Augmented graph grammars are an extension of
structures. They can be designed to encode classes of suit-     simple graph grammars that allow for complex node and arc
able graphs to recognize complex sub-features. They were        types, optional substructures, and complex rule expressions
introduced by Rosenfeld and Pfaltz in 1969 as “Context-free     [12]. Rather than using a fixed alphabet of graph compo-
web grammars” [1]. Since then Graph grammars have been          nents they are defined by a complex ontology that allows for
applied to a wide range of areas, including pattern recogni-    subsidiary types such as textual fields, access functions, and
tion [2, 3, 4]; visual programming languages [5]; biological    directional information. They can also be used to evaluate
development [6]; classification of chemical compounds [7, 8];   negated elements as well as quantified expressions. As such
and social network analysis [9, 10, 11]. Simple graph gram-     they are better suited to rich graph data such as user-system
mars are, like string grammars, composed of a set of pro-       interaction logs and student-produced argument diagrams.
duction rules that map from one structure to another. In
this case the rules map from a simple subgraph, typically a     Augmented Graph Grammars have previously been used for
single node or arc, to a more complex structure. As with        the detection of empirically-valid substructures in student-
string grammars the node and arc types are drawn from           produced argument diagrams [13, 14]. In that work a-priori
finite alphabets. Despite their utility, however, graph gram-   rules were used to represent key discussion features and ar-
mars are very difficult to construct. The data structures       gumentative flaws. Argument diagrams are graphical ar-
required are complex [5]. Moreover, development of suitable     gument representations that reify key features of arguments
graph grammars generally requires considerable domain ex-       such as hypothesis statements, claims, and citations as nodes
pertise. Most existing uses of graph grammars have relied       and the supporting, opposing, and informational relation-
on hand-authored rules.                                         ships as arcs between them.

                                                                A sample diagram collected in that work is shown in Figure
                                                                1 The diagram includes a central research claim node, which
                                                                has a single text field indicating the content of the research
                                                                claim. A set of citation nodes are connected to the claim
                                                                node via a set of supporting, opposing and undefined arcs
                                                                colored with green, red and blue respectively. Each citation
                                                                node contains two fields: one for the citation information,
                                                                and the other for a summary of the cited work; each arc has
                                                                a single text field explaining why the relationship holds. At
      Figure 1: A student-produced LASAD argument diagram representing an introductory argument.


the bottom of the diagram, there is a single isolated hypoth-                                            t
esis node that contains two text fields, one for a conditional
or IF field, and the other for conditional or THEN field. We
expect the induced graph grammars from a set of argument           (P aredW comp)                    O        S
diagrams can be used to evaluate the student thesis work.
                                                                                                 a       ¬c       b
Figure 2 shows an a-priori rule that was defined as part
of that work. This rule is designed to identify a subgraph                            
                                                                                        t.T pye = “claim00 or“hypothesis00 
                                                                                                                           
where a single target node t is connected to two separate                             
                                                                                               a.T ype = “Citation00
                                                                                                                          
citation nodes a and b such that: a is connected to t via an
                                                                                                                          
                                                                                                                   00
opposing path; b is connected via a supporting path; and                                      b.T ype = “Citation         
                                                                                             c.T ype = “Comparison00
                                                                                                                          
there exists no comparison arc between a and b. The rules
                                                                                                                          
in that study were implemented using AGG an augmented
graph grammar library built in Python [12]. AGG matches
                                                                  Figure 2: A simple augmented graph grammar rule
the graphs using recursive stack-based algorithm. The code
                                                                  that detects compared counterarguments. The rule
first matches the ground nodes at the top-level of the class
                                                                  shows a two citation nodes (a, & b) that have oppos-
(t, a, & b). It then tests for the recursive productions O,
                                                                  ing relationships with a shared claim node (t) and do
and S, before finally testing for the negated comparison arc
                                                                  not have a comparison arc (c) drawn between them.
c. This rule does not make use of the full range of potential
                                                                  The arcs S and O represent recursive supporting
capacity for Augmented Graph Grammars. However it is
                                                                  and opposing path.
illustrative of the type of rules we plan to induce here, rules
that generalize beyond basic types and draw on existing pro-
duction classes but not, at least in the immediate term, use
complex textual elements or functional features.
                                                                  algorithm [19]. They are based upon controlled graph walks
                                                                  coupled with indexing. While these algorithms are effective,
3.   GRAMMAR INDUCTION                                            particularly on smaller graphs, with low vertex degree they
Graph and relational data has grown increasingly prevalent        can also overfit simpler graph structures and they do not
and graph analysis algorithms have been applied in a wide         scale well to larger, denser graph data [20].
range of domains from social network analysis [15] to bioin-
formatics [16]. Most of this work falls into one of two cate-     The SUBDUE system takes a greedy-compression approach
gories of algorithms: frequent subgraph matching, and graph       to graph mining. SUBDUE searches for candidate sub-
compression.                                                      graphs that can best compress the input graphs by replacing
                                                                  a candidate subgraph with a single vertex. Then nodes and
A number of algorithms have been developed to discover fre-       arcs are added to the vertices to form new candidate sub-
quent subgraphs. These include the gSpan algorithm devel-         graphs. The process is recursive and relies on the Minimum-
oped by Yan and Han [17]; the AGM algorithm developed by          Description-Length (MDL) principle to evaluate the candi-
Inokuchi et al [18]; and the FSG algorithm developed by Ku-       dates. SUBDUE has been applied successfully to extract
ramochi and Karypis which is based on the previous Apriori        structure from visual programming [5], web search [21], and
analyzing user behaviors in games [22].                          lution space. Therefore it is well suited to our present needs.
                                                                 This flexibility is costly, however, as EC is far more compu-
While these methods are successful they have practical and       tationally expensive than more specialized algorithms, and
theoretical limitations. Both classes of approaches are lim-     applications of EC can require a great deal of tuning for
ited to static graphs composed from a finite alphabet of node    each use. In the subsections below we will describe the fit-
and arc types. The frequent subgraph approaches are based        ness function and the operators that we will use in this work.
upon iterative graph walks and can be computationally ex-        For this work we will rely on pyEC a general purpose evo-
pensive and are limited to finding exact matches. They do        lutionary computation engine that we have developed [26].
not generalize beyond the exact graphs matched nor do they
allow for recursive typing. SUBDUE, by contrast is a greedy      4.2    Dataset
algorithm that finds the single most descriptive grammar         Our initial analysis will be based upon a corpus of expert
and does not allow for weighted matches.                         graded student produced argument diagrams and essays pre-
                                                                 viously described in [13, 14]. That dataset was collected as
For our present purposes, however, our goal is to identify       part of a study on students’ use of argument diagrams for
multiple heirarchical classes of the type shown in Figure 2      writing that was conducted at the University of Pittsburgh
that can: generalize beyond exact node and arc types; can        in 2011. For that study we selected a set of students in an
draw on recursive rule productions; and can be weighted          undergraduate-level course on Psychological Research Meth-
based upon the graph quality. Moreover our long-term goal        ods. As part of the course the students were tasked with
with this work is to explore graph rule induction mechanisms     planning and executing an empirical research study and then
that can be expanded to include textual rules and complex        drafting a written report. The students were permitted to
constraints. For that reason we have elected to apply evo-       work individually or in teams. This report was structured as
lutionary computation. This is a general-purpose machine         a standard empirical workshop paper. Prior to drafting the
learning mechanism that can be tunes to explore a range of       report the students were tasked with diagramming the ar-
possible induction mechanisms.                                   gument that they planned to make using LASAD an online
                                                                 tool for argument diagramming and annotation.
4.    METHODS
                                                                 Subsequent to this data collection process the diagrams and
4.1   Evolutionary Computation                                   essays were graded by an experienced TA using a set of par-
Evolutionary Computation (EC) is a general class of ma-          allel grading rubrics. These rubrics focused on the quality of
chine learning and optimization methods that are inspired        the arguments in the diagrams and essays and were used to
by the process of Darwinian evolution through natural se-        demonstrate that the structure and quality of the diagrams
lection [23] such as Genetic Algorithms [24] or Genetic Pro-     can be used to predict the students’ subsequent essay per-
gramming [25]. EC algorithms begin with a population of          formance. These grades will be used as the weighting metric
randomly generated candidate solutions such as snippets of       for the diagrams and will be correlated with performance as
random code, strings representing a target function, or for-     part of the fitness function we describe below. After comple-
mal rules. Each of these solutions is ranked by a fitness        tion of the data collection, grading, and testing phases and
function that is used to evaluate the quality of the individu-   accounting for student dropout and incomplete assignments
als. These functions can be defined by absolute measures of      we collected 105 graded diagram-essay pairs 74 of which were
success such as a suite of test cases, or by relative measures   authored by teams.
such as a competition between chess-playing systems.
                                                                 4.3    Solution Representation
Once the individuals have been ranked a new generation of        For the purposes of our present experiments we will use a
individuals is produced through a combination of crossover       restricted solution representation that relies on a subset of
and mutation operations. Crossover operations combine two        the augmented graph grammar formalism exemplified by the
or more parents to produce one or more candidate children.       rule shown in Figure 2. This will include only element types
In Genetic Algorithms where the candidate solutions are rep-     and recursive productions. In future work we plan to sup-
resented as strings this can be accomplished by splitting two    port the induction of more complex rules defined by multiple
parents at a given index and then exchanging the substrings      graph classes, novel productions, and expressions. However
to produce novel children. In Genetic Programming the par-       for the present study we will focus on the simple case of
ents exchange blocks of code, functions, or subtrees. Muta-      individual classes coupled with predefined productions.
tion operations alter randomly-selected parts of a candidate
solution by swapping out one symbol or instruction for an-       4.4    Fitness Function
other, adding new sub-solutions, or deleting components.         We plan to use the frequency correlation metric previously
This process of ranking and regeneration will iterate until      employed in [13, 14]. In that study the authors assessed the
a target performance threshold is reached or a maximum           empirical validity of a set of a-priori diagram rules. The
number of generations has passed.                                validity of each individual rule was assessed by testing the
                                                                 correlation between the frequency of the class in the existing
EC methods are highly general algorithms that can be read-       graph and the graph grade. The strength of that correlation
ily adapted to novel domains by selecting an appropriate         was estimated using Spearman’s ρ a non-parametic measure
solution representation and modification operations. Thus,       of correlation [27]. In that work the authors demonstrated
in contrast to more specific methods such as SUBDUE, the         that the a-priori rule frequency was correlated with stu-
EC algorithm allows us to tune the inductive bias of our         dents’ subsequent essay grades and showed that the frequen-
search and to explore alternative ways of traversing the so-     cies could be used to predict students’ future performance.
4.5   Mutation
Our mutation operator will draw on the predefined graph
ontology to make atomic changes to an existing graph class.              B      C      D                  F       G
The change will be one of the following operations:
                                                                         1      ∅      3      A            7      ∅      E
Change Node change an existing node’s type.

Change Arc Change an existing arc’s type or orientation.                        4      5      B                   ∅      F

Delete Node Delete a node and its associated arcs.
                                                                                       ∅      C
Delete Arc Delete an existing arc.

Add Node Add a novel node with a specified type.
                                                                    Figure 3: Canonical matricies for crossover parents.
Add Arc Add an arc between existing nodes or add with
    new nodes.

4.6   Crossover                                                          B      G      D                   F      C
By design the crossover operation should, like genetic crossover,
be conservative. Two very similar parents should produce                 1      ∅      3      E            7      ∅      A
similar offspring. Crossover operations should therefore pre-
serve good building blocks and sub-solutions or introns through
random behavior [25]. Arbitrary graph alignment and crossover                   ∅      5      B                   4      F
is a challenging problem that risks causing unsustainable
changes on each iteration. We therefore treat graph crossover
as a matrix problem.                                                                   ∅      G

For each pair of parent classes we will define a pair of di-
agonal matricies of the type illustrated in Figure 3. The           Figure 4: Canonical matricies for crossover children.
letter indicies on the top and right indicate nodes while the
numerical indicies internally indicate arcs, and the ∅ symbol
indicates that no arc is present. The matricies are gener-          grading and feedback. This work represents an improve-
ated in a canonical order based upon the order in which the         ment over prior graph grammar induction algorithms which
nodes were added to the class. Thus on each iteration of the        are limited to classical graph grammars and greedy extrac-
crossover process the corresponding elements will obtain the        tion. This work also represents an extension for evolution-
same index. As a consequence good subsolutions will obtain          ary computation by shifting it into a new domain. As part
the same location and will tend to be preserved over time.          of this work we also plan to explore additional extensions to
                                                                    the standard evolutionary computation algorithm to address
Once a set of parent matricies has been generated we then           problems of over-fitting such as χ2 reduction.
generate two child matricies of the same size as the parents
and then randomly select the node and arc members. In the
example shown in figures 3 and 4 the parents have nodes             6.   REFERENCES
{A,B,C,D} and {E,F,G} while the children have {E,B,G,D}              [1] John L Pfaltz and Azriel Rosenfeld. Web grammars.
and {A,F,C}. Thus we align the nodes in canonical order                  In Proceedings of the 1st international joint conference
and, for each node pair, we flip a coin to decide where they             on Artificial intelligence, pages 609–619. Morgan
are copied. If one parent is larger than the other than any              Kaufmann Publishers Inc., 1969.
additional nodes, in this case D, will be copied to the larger       [2] John L Pfaltz. Web grammars and picture description.
child. We then perform a comparable exchange process for                 Computer Graphics and Image Processing,
the arcs. Each arc or potential arc is defined uniquely in the           1(2):193–220, 1972.
matricies by its endpoints. We thus align the lists of arcs          [3] Horst Bunke. Attributed programmed graph
in a comparable manner and then decide randomly which                    grammars and their application to schematic diagram
arc, or empty arc, to copy. As with the nodes, extra arcs                interpretation. IEEE Transactions on Pattern
from the larger parent, in this case 3,5, and one ∅ are copied           Analysis and Machine Intelligence, 4(6):574–582, 1982.
directly into the larger of the two children.                        [4] Michihiro Kuramochi and George Karypis. Finding
                                                                         frequent patterns in a large sparse graph*. Data
5.    FUTURE WORK                                                        mining and knowledge discovery, 11(3):243–271, 2005.
In this paper we presented a method for the induction of             [5] Keven Ates, Jacek Kukluk, Lawrence Holder, Diane
augmented graph grammars through evolutionary computa-                   Cook, and Kang Zhang. Graph grammar induction on
tion. We are presently applying this work to the automatic               structural data for visual programming. In Tools with
induction of empirically-valid rules for student-produced ar-            Artificial Intelligence, 2006. ICTAI’06. 18th IEEE
gument diagrams. This work will serve to extend our prior                International Conference on, pages 232–242. IEEE,
efforts on the use of augmented graph grammars for student               2006.
 [6] Francesc Rosselló and Gabriel Valiente. Graph                   CSREA Press, 2008.
     transformation in molecular biology. In Formal              [17] Xifeng Yan and Jiawei Han. gspan: Graph-based
     Methods in Software and Systems Modeling, pages                  substructure pattern mining. In Proceedings of the
     116–133. Springer, 2005.                                         IEEE International Conference on Data Mining
 [7] Luc Dehaspe, Hannu Toivonen, and Ross D King.                    (ICDM 2002), pages 721–724. IEEE, 2002.
     Finding frequent substructures in chemical                  [18] Akihiro Inokuchi, Takashi Washio, and Hiroshi
     compounds. In KDD, volume 98, page 1998, 1998.                   Motoda. An apriori-based algorithm for mining
 [8] Stefan Kramer, Luc De Raedt, and Christoph Helma.                frequent substructures from graph data. In Principles
     Molecular feature mining in hiv data. In Proceedings             of Data Mining and Knowledge Discovery, pages
     of the seventh ACM SIGKDD international conference               13–23. Springer, 2000.
     on Knowledge discovery and data mining, pages               [19] Michihiro Kuramochi and George Karypis. Frequent
     136–143. ACM, 2001.                                              subgraph discovery. In Proceedings IEEE International
 [9] Wenke Lee and Salvatore J Stolfo. A framework for                Conference on Data Mining. (ICDM 2001), pages
     constructing features and models for intrusion                   313–320. IEEE, 2001.
     detection systems. ACM transactions on Information          [20] MICHIHIRO Kuramochi and George Karypis. Finding
     and system security (TiSSEC), 3(4):227–261, 2000.                topological frequent patterns from graph datasets.
[10] Calvin Ko. Logic induction of valid behavior                     Mining Graph Data, pages 117–158, 2006.
     specifications for intrusion detection. In Proceedings of   [21] Nitish Manocha, Diane J Cook, and Lawrence B
     the IEEE Symposium on Security and Privacy. (S&P                 Holder. Cover story: structural web search using a
     2000), pages 142–153. IEEE, 2000.                                graph-based discovery system. Intelligence,
[11] Rakesh Agrawal, Ramakrishnan Srikant, et al. Fast                12(1):20–29, 2001.
     algorithms for mining association rules. In Proceedings     [22] Diane J. Cook, Lawrence B. Holder, and G. Michael
     of the 20th International Conference on very large               Youngblood. Graph-based analysis of human transfer
     data bases, VLDB, volume 1215, pages 487–499, 1994.              learning using a game testbed. IEEE Trans. on
[12] Collin F Lynch. Agg: Augmented graph grammars for                Knowl. and Data Eng., 19:1465–1478, November 2007.
     complex heterogeneous data. In Proceedings of the           [23] Charles Darwin. On the Origin of Species by Means of
     first international workshop on Graph-Based                      Natural Selection, or the Preservation of Favoured
     Educational Data Mining (GEDM 2014).                             Races in the Struggle for Life. John Murray:
[13] Collin F. Lynch and Kevin D. Ashley. Empirically                 Albermarle Street: London, United Kingdom, 6
     valid rules for ill-defined domains. In John Stamper             edition, 1872.
     and Zachary Pardos, editors, Proceedings of The 7th         [24] Melanie Mitchell. An Introduction to Genetic
     International Conference on Educational Data Mining              Algorithms. MIT Press: Cambridge, Massachusetts,
     (EDM 2014). International Educational Datamining                 1999.
     Society IEDMS, 2014.                                        [25] Wolfgang Banzhaf. Genetic programming: an
[14] Collin F. Lynch, Kevin D. Ashley, and Min Chi. Can               introduction on the automatic evolution of computer
     diagrams predict essay grades? In Stefan                         programs and its applications. Morgan Kaufmann
     Trausan-Matu, Kristy Elizabeth Boyer, Martha E.                  Publishers ; Heidelburg : Dpunkt-verlag; San
     Crosby, and Kitty Panourgia, editors, Intelligent                Francisco, California, 1998.
     Tutoring Systems, Lecture Notes in Computer Science,        [26] Collin F. Lynch, Kevin D. Ashley, Niels Pinkwart, and
     pages 260–265. Springer, 2014.                                   Vincent Aleven. Argument graph classification with
[15] Sherry E. Marcus, Melanie Moy, and Thayne Coffman.               genetic programming and c4.5. In Ryan
     Social network analysis. In Diane J. Cook and                    Shaun Joazeiro de Baker, Tiffany Barnes, and
     Lawrence B. Holder, editors, Mining Graph Data,                  Joseph E. Beck, editors, The 1st International
     chapter 17, pages 443–468. John Wiley & Sons, 2006.              Conference on Educational Data Mining, Montreal,
[16] Chang Hun You, Lawrence B. Holder, and Diane J.                  Québec, Canada, June 20-21, 2008. Proceedings, pages
     Cook. Dynamic graph-based relational learning of                 137–146, 2008.
     temporal patterns in biological networks changing over      [27] Wikipedia. Spearman’s rank correlation coefficient —
     time. In Hamid R. Arabnia, Mary Qu Yang, and                     wikipedia, the free encyclopedia, 2013. [Online;
     Jack Y. Yang, editors, BIOCOMP, pages 984–990.                   accessed 27-February-2013].

</pre>