AGG: Augmented Graph Grammars for Complex Heterogeneous Data Collin F. Lynch Intelligent Systems Program and Learning Research & Development Center Pittsburgh, Pennsylvania, U.S.A. collinl@cs.pitt.edu ABSTRACT Individual student assignments can also contain heteroge- The central goal of educational datamining is to derive cru- neous data. Argument diagrams, for example, have been cial pedagogical insights from student, course, and tutorial used to teach writing, argumentation, and scientific reason- data. Real-world educational datasets are complex and het- ing [10, 2, 19]. These structures reify real-world arguments erogeneous comprising relational structures, social connec- as graphs using complex node and arc types to represent tions, demographic information, and long-term assignments. argumentative components such as hypothesis statements, In this paper I describe Augmented Graph Grammars a ro- citations, and claims. These complex elements can include bust formalism for graph rules that provides a natural struc- types, text fields for short notes or free-text assertions, links ture for evaluating complex heterogeneous graph data. I also to external resources, and other data. describe AGG an Augmented Graph Grammar engine writ- ten in Python and briefly describe its use. A sample student-produced argument diagram drawn from my thesis work at the University of Pittsburgh is shown in Figure 1. This work focused on the use of argument Keywords diagrams to support students in developing written scien- Augmented Graph Grammars, Graph Analysis, Argument tific reports and in identifying pedagogically-relevant dia- Diagrams, Complex Data, Heterogeneous Data gram structures that can be used to predict students’ subse- quent performance (see [8]). The diagram contains a central claim node representing a research claim. This node has 1. INTRODUCTION a single text field in which the claim is stated. This is, in The central goal of educational datamining is to draw peda- turn, connected to a set of citation nodes representing re- gogical insights from real-world student data, insights which lated work via a set of supporting, opposing, and undefined can inform instructors, students, and other researchers. While arcs colored green, red, and grey, respectively. The citation robust analytical formalisms have been defined for categor- nodes each contain two text fields, one for the citation in- ical, numerical, and relational data most real-world educa- formation and the other for a summary of the cited work, tional data is complex and heterogeneous combining textual, while the arcs contain a single text field for the warrant or numerical, and relational features. In large course settings explanation of why the relationship holds. At the top of such as a lecture course or MOOC, for example, students the diagram there is a single disjoint hypothesis node which may form dynamic working groups and collaborate on com- contains two text fields: a conditional or IF field, and a plex assignments. They may also be given a flexible set conditional or THEN field. of reading, writing, or problem-solving tasks that they can choose to complete in any order. This process data can be This diagram contains a number of pedagogically-relevant encoded as a graph with nodes representing individual as- issues. Some of them are purely structural such as the dis- signments and reading materials and arcs representing group joint hypothesis node, and the fact that the supporting and relationships or traversal order. In order to capture impor- opposing arcs are drawn from the claim to the citations and tant features of this rich graph data and to identify key not vice-versa. It also contains more complex semantic is- relationships between teamwork, written text, and perfor- sues such as the fact that the text fields on the arcs contain mance, it is necessary to apply a rule structure that can summary information for the cites not explanations of the capture them naturally. relationship, and the fact that the opposing citations, cita- tions that disagree about the central claim node have not been distinguished from one-another via a comparison arc. Problems such as these can be detected via complex rules, and I have previously shown that the presence of such prob- lems are predictive of students’ subsequent performance [8, 10, 9]. This detection and remediation, however requires the development of rules that can incorporate complex struc- tural and textual information. Figure 1: A segment of a student-produced LASAD diagram representing an introductory argument. It contains a central claim node surrounded by citation nodes. The isolated node is a hypothesis that has not been integrated into the argument. Automatic graph analysis is central to a number of research graph elements, and whose productions transpose one graph domains including strategy transfer in games [4], automatic to another [17]. More formally, they define graph-grammars recommendations [1], cheminformatics [12], and social net- and productions as: work detection [11]. Graph analysis algorithms have been used to define educational communities [15, 16, 5]) and to automatically grade existing datasets [8, 10, 9]. Graphical Definition 3.6 A graph grammar GG is a tu- structures have also been used in tutoring contexts to repre- ple (A; P ), with A a nonempty initial graph (the sent student work via argument diagrams of the type shown axiom), and P a set of graph grammar produc- above (see [14, 7] or to provide connection representations tions. To simplify forthcoming definitions, the [19] for student guidance. initial graph A will be treated as a special case of a production with an empty left-hand side. My focus in the present work is on the development of graph The set of all potential production instances of rules that is logical graph patterns that match arbitrary GG is abbreviated with P I(GG). graph structures based upon content and structure informa- tion. While arbitrary graph matching is NP-Hard (see [18]) it is of practical importance, particularly in relational do- Definition 3.2 A (graph grammar) production mains such as argument diagrams or student groups where p := (L; R) is a tuple of graphs over the same our goal is to identify complex structures that may be evi- alphabets of vertex and edge labels LV and LE. dence of deeper pedagogical issues. To that end, I will intro- Its left-hand side lhs(p) := L and its right-hand duce Augmented Graph Grammars a robust rule formalism side rhs(p) := R may have a common (context) for complex graph rules and will describe AGG and aug- subgraph K if the following restrictions are ful- mented graph grammar engine for educational datamining. filled: Both were developed as part of my thesis work at the Uni- versity of Pittsburgh. • ∀e ∈ E(K) ⇒ s(e) ∈ V (K) ∧ t(e) ∈ E(K) with E(K) := E(L) ∩ E(R) and V (K) := V (L)∩V (R) i.e. sources and targets of com- 2. AUGMENTED GRAPH GRAMMARS mon edges are common verticies of L and R, Graph Grammars, as described by Rekers and Schürr, are too. formal grammars whose atomic components are graphs or • ∀x ∈ L∩R ⇒ lL (x) = lR (x) i.e. common el- { ements of L and R do not differ with respect Nodes:{ to their labels in L and R. Citation:{ Cite(String) Cite.Words(StringSet) Thus graph grammars are systems of production rules anal- Summary(String) ogous to context-sensitive string grammars (see [18]). For Summary.Words(StringSet) reasons of efficiency Rekers and Schürr restrict their focus } to layered graph-grammars where all productions must be expansive with the left-hand-side being a subgraph of the Hypothesis: { right. Classical graph grammars, like string grammars, as- If(String) sume a fixed alphabet of simple statically-typed node and If.Words(StringSet) arcs and can be used both to generate matching graphs pro- Then(String) grammatically or to parse matching graphs via mapping and Then.Words(StringSet) decomposition. My focus in the present work is on graph } matching which occurs via iterative mapping. } Arcs:{ Let Gi =< {no , . . .}, {e(np , nq ), . . .} > and Gj =< {mo , . . .}, Comparison: { {e(mk , ml ), . . .} > be graphs and let M = {< na , M − b > ... . . .} me a mapping from Gi to Gj that links nodes of the two. In the context of a mapping, Gi and Gk are called the Types: { String, StringSet } source and target graphs respectively. A mapping MGi ,Gj ... from Gi to Gj is valid if and only if the following holds: ∀nx ∈ Gi : ∃ < nx , my >∈ MGi ,Gj Figure 2: An illustrative subset of a sample graph ontology for scientific argument diagrams. ¬∃{< nx , my >, < nr , mk >} ⊆ MGi ,Gj : (x = r) ∨ (y = k) ∀e(nx , ny ) ∈ Gi : {< nx , my >, < nr , mk >} ⊆ MGi ,Gj : ∃e(my , mk ) ∈ Gj 2. This illustrates the field definitions for the citation and hypothesis nodes shown above. Both node types contain For the remainder of this paper all elements in a source two sub-fields of type String. For each of these fields an graph will be labeled alphabetically (e.g. a, Q) while ele- additional function is defined ’*.Words’ which returns a set ments in the target graphs will be referenced numerically of all the words found in the field. −−−−→ (e.g. 1, 2, e(2, 3), e(4, 5)). Augmented Graph Grammars are a richer formalism for graph 2.2 Graph Classes rules that treat nodes and arcs as complex components with The core component of an augmented graph grammar is the optional sub-fields including flexible text elements or other graph class. A class Ci is defined by a 2-tuple < Si , Oi > types. Augmented graph grammars have been previously de- where Si is a graph schema and Oi is a set of constraints. scribed by Pinkwart et al. in [13]. There the authors focused A class defines a space of possible graphs which satisfy both on the use of augmented graph grammars for tutoring. An the schema and the constraints. Classes are not required Augmented Graph Grammar is defined by: a graph ontol- to be unique nor are the set of matching graphs for a given ogy that specifies the complex graph elements and functions pair of classes required to be disjoint. A sample named available; a set of graph classes that define matching graphs; class R07a is shown in 3. This class is designed to detect and optional graph productions and expressions that provide instances of Related Uncompared Opposition in scientific ar- for recursive class mapping and logical scoping. I will de- gument diagrams. That is subgraphs where there exists a scribe each of these components briefly below. For a more pair of citation nodes a, and b that disagree about a shared detailed description see [8]. target node t, are not connected via a comparison arc c, and which share some relevant textual content. As I noted above, this type of structure can be found in Figure 1. 2.1 Graph Ontology In a simple graph grammar of the type used by P Rekers and Schürr the set of possible node and arc types ( ) is fixed 2.2.1 Graph Schema with the elements being atomic, static, and unique. In order A Schema is a graph structure that defines a space of pos- to process complex structures such as the argument diagram sible graphs topologically. Schema are defined by a set of shown in Figure 1, a more complex structure is required. ground nodes (e.g. t, a, & b in Figure 3) which must match Thus augmented graph grammar ontologies are defined by a a single node in a target graph, a set of ground arcs that set of element types O = {N0 , . . . Nm , E0 , . . . , Ep } such that must likewise match a single arc in the target graph (e.g. each element has a unique list of fields and field types as well c), and an optional set of variable arcs which must match a as applicable functions over those fields. The ontology must nonempty subgraph defined by a graph production. By con- also specify appropriate relationships between the fields and vention, ground elements are denoted via lower-case names operations that can be used on them. while variable elements are denoted by capitalized names. While showing a complete ontology is beyond the scope of In addition to the ground and variable distinctions arcs within this paper an illustrative example can be found in Figure a schema may be one of four types: directed (e.g. O, & t c (R07a) O S (SC ) S a ¬c b a t.T ype ∈ {“Hypothesis00 , “Claim00 }   00   a.T ype = “Citation       b.T ype = “Citation 00 b     c.T ype = “Comparison00     (a.summary.words ∩ b.summary.words) 6= ∅ (SP 1 ) q S Figure 3: Related Uncompared Opposition A simple a c augmented graph grammar rule that detects related but uncompared counterarguments. The rule shows q.T ype = “Supporting 00  a two citation nodes (a, & b) that have opposing re- lationships with a shared hypothesis or claim node c (t) and do not have a comparison arc (c) drawn be- tween them. The arcs S and O represent recursive supporting and opposing paths. (SP 2 ) q a S ), of unknown direction, undirected (e.g. c), and unde- q.T ype = “Supporting 00  fined. Directed arcs will only match directed arcs in the base graph oriented in the same direction. Thus, given a − −−− → base graph containing an arc e(1, 2) and a schema with a −−−−→ −−−−→ S(a, c) = [ Sc ⇒ SP 1 | SP 2 ] directed arc e(n, m) the schema will only match cases where [2,∗] {< n, 1 >, < m, 2 >} ⊆ M . Unknown direction schema arcs may match a directed arc oriented in any order but will not match an undirected arc (e.g. e(2, 3)). Undirected arcs (e.g. Figure 4: A simple recursive rule production for S ¬c) will not match a directed arc. And, undefined arcs may that defines a supporting path. match a directed or undirected arc in any order. As the example shows arcs may be also be negated (e.g. ¬c) in which case the schema matches a graph if and only if no 2.3 Graph Productions match can be found for the negated arc. Thus the schema A graph production Cl ⇒ Cr1 |Cr2 ... is a context-sensitive shown will only match ground graphs with no arc between production rule that maps from a graph class containing a the elements assigned to a and b. More complicated cases of single production variable to one or more alternate expan- negation may be formed using graph expressions which are sions. Graph productions are used to match layered sub- defined below. graphs to the variable arcs. A simple recursive production − −−− → rule for the variable element S(b, t) is shown in Figure 4. The elements of a Schema must also be non-repeating that is, no two elements in a schema may be matched to the same The rule is defined by the context class SC , and the two pro- element in the target graph. Thus each element in a schema duction classes SP 1 and SP 2 . The context class is used as a must match at least one unique node or arc with variable key for the production application. It must contain exactly elements possibly accounting for more than one element. one variable arc, the production variable, and no constraints. The ground nodes a and c are context nodes and are used to ground the production for mapping. They must be present 2.2.2 Constraints in all of the production rules. All production rules must Constraints represent individual bounds or limits on the be expansive with each of the production classes contain- ground elements of a schema. Constraints are specified using ing at least one ground element not present in the context a set-theory syntax (e.g. t.T ype ∈ {“Hypothesis00 , “Claim00 }) class. Recursive productions are thus handled by iteratively and may draw on any of the node or arc features, subfields, grounding the mapping with additional context and, as per or functions specified in the ontology. Unary Constraints ap- the non-repeating requirement, these rules must consume ply to a single element (e.g. a.T ype = “Citation00 ). Binary additional elements of the graph. Production rules are thus Constraints (e.g. (a.summary.words ∩ b.summary.words) mapped in a layered fashion like the grammars defined by 6= ∅) specify a relationship between two distinct ground ele- Rekers and Schürr. ments. (C0 ) h At present the system uses a straightforward depth-first stack matching algorithm. Given a graph and a set of named rules, defined by a single graph class or expression, the sys- h.T ype = “Hypothesis00  tem will first match all ground nodes and arcs in the leftmost target class. Once each ground element has been matched h then the system will recursively match all variable elements in the target. If at any point the system cannot continue to (C1 ) O match elements it will pop up the stack and repeat. Rule matching is governed by the aforementioned restrictions of expansiveness and non-repetition. If a rule is defined by a c graph expression then each class match will set the context for subsequent rightmost matches. Rules defined by a single  h.T ype = “Hypothesis00  class are complete once a single match is found. The sys- c.T ype = “Supporting 00 tem is designed to find matches serially and can be called iteratively to extract all matching items. ∀C0 |¬∃C1 In addition to basic graph grammars the AGG toolkit has the capacity to define named rules. These are named graph Figure 5: A simple Graph expression that tests for expressions or individual classes that will be recorded if they unopposed hypotheses. match. In my thesis work, I applied the AGG engine to de- velop a set of 42 such rules the scientific argument diagrams. These ranged in complexity from graph classes defined by a single node to more complex recursive expressions that 2.4 Graph Expressions sought to identify disjoint subgraphs and unsupported hy- Graph expressions are logical rules of the form: potheses. The example rules and expressions shown in fig- ures 3 - 5 were adapted from this set. The rules were used for S0 C0 | S 1 C1 | ... | Sm Cm offline processing of the graphs and for prediction of student where each Ci is a graph class and each Si is a logical quan- grades [10, 9]. tifier from the set: {∀, ¬∀, ∃, ¬∃}. The expressions allow for existential and universal scoping and arbitrary negation As part of the analysis process the rules were evaluated on of graph classes. The expressions represent chained logical a set of 526 diagrams containing between 0 and 41 nodes structures with each ’|’ being read as “. . . such that . . . ”. A each. While exact efficiency data was not retained the per- sample graph expression is shown in Figure 5. This sample formance of the rules varied widely depending upon their expression asserts that for all hypothesis nodes in the target construction. General recursive rules such as a test for dis- graph there exist no citation nodes that oppose the target joint subgraphs performed quite inefficiently while smaller hypothesis. Thus it is a universal claim about a negated chained expressions were able to evaluate in a matter of sec- existential item. As this example illustrates graph expres- onds on a quad-core system. sions allow for more complex negation structures than are supported by the graph schema. 4. APPLICATIONS & FUTURE WORK The focus of this paper was on introducing Augmented Graph Graph expressions must be expansive or right-grounded such Grammars and the AGG engine. The formalism provides for that the following constraints hold: a natural and robust representation of complex graph rules for heterogeneous datasets. In prior work at the University ∀Cm≤i>0 ∈ E : Ci−1 ⊆g Ci of Pittsburgh I applied Augmented Graph Grammars to the detection of pedagogically relevant structures like Related Sm ∈ {∃, ¬∃} Uncompared Opposition (see Figure 3) in argument diagrams That is, the schema component of class Ci must be a sub- of the type shown in Figure 1. The focus of that study was graph of all classes class Ci+n . This also holds true for on testing whether student-produced argument diagrams are the constraints with all constraints present in class Ci be- diagnostic of their ability to produce written argumentative ing present in classes Ci+n . And the rightmost class in the essays. The study was conducted in a course on Psycholog- expression must also be an existential (∃) test with optional ical Research Methods at the University of Pittsburgh. negation. The graph features examined in that study included chained counterarguments which feature chains of oppositional infor- 3. AGG mation, and ungrounded hypotheses which are unrelated to AGG is a general-purpose augmented graph grammar en- cited works, and so on. The study is described in detail in [8], gine that implements recursive graph matching. The system and a discussion of the empirical validity of the individual was developed in Python to support analysis of the student- rules can be found in [9]. The rules were also used as the ba- produced argument diagrams described above. As such it is sis of predictive models for student grades described in [10]. flexible, functions across platforms, and supports complex The Augmented Graph Grammars were ideally-suited for graph ontologies and user-defined functions. The system this task as they allowed me to define clear and robust rules was designed in a modular fashion and can be linked with that incorporated the structural information in the graph, third-party libraries such as the NLTK [6]. textual information within the nodes and arcs, and the static element types. It was also possible to clearly present these pages 260–265. Springer, 2014. rules to domain experts for evaluation. [11] Sherry E. Marcus, Melanie Moy, and Thayne Coffman. Social network analysis. In Cook and Holder [3], While the AGG system is robust more work remains to be chapter 17, pages 443–468. done to make it widely available, and several open problems [12] Takashi Okada. Mining from chemical graphs. In Cook remain for future development. As noted above, arbitrary and Holder [3], chapter 14, pages 347–379. graph parsing is NP-Hard. Consequently, many rule classes [13] Niels Pinkwart, Kevin D. Ashley, Vincent Aleven, and are extremely inefficient. Despite this limitation, however, Collin F. Lynch. Graph grammars: An its technology real efficiency gains may be made via parallelization and for diagram representations. In David Wilson and memoization. I am presently researching possible improve- H. Chad Lane, editors, FLAIRS Conference, pages ments to the system and plan to test them with additional 433–438. AAAI Press, 2008. datasets. [14] Niels Pinkwart, Kevin D. Ashley, Collin F. Lynch, and Vincent Aleven. Evaluating an intelligent tutoring Acknowledgments system for making legal arguments with hypotheticals. This work was supported by National Science Foundation International Journal of Artificial Intelligence in Award No. 1122504, “DIP: Teaching Writing and Argumen- Education, 19(4):401–424, 2009. tation with AI-Supported Diagramming and Peer Review,” [15] Eve Powell and Tiffany Adviser-Barnes. A framework Kevin D. Ashley PI with Chris Schunn and Diane Litman, for the design and analysis of socially pervasive games. co-PIs. 2012. [16] Eve M Powell, Samantha Finkelstein, Andrew Hicks, 5. REFERENCES Thomas Phifer, Sandhya Charugulla, Christie [1] Euijin Choo, Ting Yu, Min Chi, and Yan Lindsay Sun. Thornton, Tiffany Barnes, and Teresa Dahlberg. Snag: Revealing implicit communities to incorporate into social networking games to facilitate interaction. In recommender systems. In Proceedings of the 15th CHI’10 Extended Abstracts on Human Factors in ACM Conference on Economics and Computation, Computing Systems, pages 4249–4254. ACM, 2010. Palo Alto, CA, 2014. Association For Computing [17] J. Rekers and Andy Schürr. Defining and parsing Machinery. (in press). visual languages with layered graph grammars. J. Vis. [2] Evi Chryssafidou and Mike Sharples. Lang. Comput., 8(1):27–55, 1997. Computer-supported planning of essay argument [18] Michael Sipser. Introduction to the Theory of structure. In Proceedings of the 5th International Computation. PWS Publishing Company, San Conference of Argumentation, June 2002. Francisco, 1997. [3] Diane J. Cook and Lawrence B. Holder, editors. [19] Daniel D. Suthers. Empirical studies of the value of Mining Graph Data. John Wiley & Sons, 2006. conceptually explicit notations in collaborative [4] Diane J. Cook, Lawrence B. Holder, and G. Michael learning. In Alexandra Okada, Simon Buckingham Youngblood. Graph-based analysis of human transfer Shum, and Tony Sherborne, editors, Knowledge learning using a game testbed. IEEE Trans. on Cartography, pages 1–23. Springer Verlag, 2008. Knowl. and Data Eng., 19:1465–1478, November 2007. [5] Rosta Farzan and Peter Brusilovsky. Annotated: A social navigation and annotation service for web-based educational resources. Journal of the New Review of Hypermedia and Multimedia (NRHM), 2008. [6] Dan Garrette, Peter Ljunglöf, Joel Nothman, Mikhail Korobov, Morten Minde Neergaard, and Steven Bird. The natural language toolkit for python (NLTK), 2014. [Online; accessed 04-29-2014]. [7] Frank Loll and Niels Pinkwart. Lasad: Flexible representations for computer-based collaborative argumentation. Int. J. Hum.-Comput. Stud., 71(1):91–109, 2013. [8] Collin F. Lynch. The Diagnosticity of Argument Diagrams, 2014. (defended January 30th 2014). [9] Collin F. Lynch and Kevin D. Ashley. Empirically valid rules for ill-defined domains. In John Stamper and Zachary Pardos, editors, Proceedings of The 7th International Conference on Educational Data Mining (EDM 2014). International Educational Datamining Society IEDMS, 2014. (In Press). [10] Collin F. Lynch, Kevin D. Ashley, and Min Chi. Can diagrams predict essays? In Stefan Trausan-Matu, Kristy Elizabeth Boyer, Martha E. Crosby, and Kitty Panourgia, editors, Intelligent Tutoring Systems, volume 8474 of Lecture Notes in Computer Science,