=Paper= {{Paper |id=Vol-1183/gedm_paper06 |storemode=property |title=Graph Mining and Outlier Detection Meet Logic Proof Tutoring |pdfUrl=https://ceur-ws.org/Vol-1183/gedm_paper06.pdf |volume=Vol-1183 |dblpUrl=https://dblp.org/rec/conf/edm/VaculikNP14 }} ==Graph Mining and Outlier Detection Meet Logic Proof Tutoring== https://ceur-ws.org/Vol-1183/gedm_paper06.pdf
       Graph Mining and Outlier Detection Meet Logic Proof
                           Tutoring

                    Karel Vaculík                           Leona Nezvalová                       Luboš Popelínský
             Knowledge Discovery Lab                    Knowledge Discovery Lab                Knowledge Discovery Lab
               Faculty of Informatics                     Faculty of Informatics                 Faculty of Informatics
                Masaryk University                         Masaryk University                     Masaryk University
               Brno, Czech Republic                       Brno, Czech Republic                   Brno, Czech Republic
               xvaculi4@fi.muni.cz                      324852@mail.muni.cz                       popel@fi.muni.cz

ABSTRACT                                                                 14] and showed its good performance. Solutions were manually
We introduce a new method for analysis and evaluation of logic           rewritten into GraphML and then analysed. First, the frequent pat-
proofs constructed by undergraduate students, e.g. resolution or         terns were found by Sleuth [16], which was suitable for this type
tableaux proofs. This method employs graph mining and outlier            of data—unordered rooted trees. This algorithm finds all frequent
detection. The data has been obtained from a web-based system            subtrees from a set of trees for a given minimum support value.
for input of logic proofs built at FI MU. The data contains a tree       Such frequent subgraphs were generalized and these generaliza-
structure of the proof and also temporal information about all ac-       tions used as new attributes.
tions that a student performed, e.g. a node insertion into a proof,
or its deletion, drawing or deletion of an edge, or text manipula-       The main drawback of a frequent subgraph mining algorithm it-
tions. We introduce a new method for multi-level generalization          self is its strong dependence on a particular task, i.e. on the input
of subgraphs that is useful for characterization of logic proofs. We     set of clauses that has to be proved, or unproved, as contradictory.
use this method for feature construction and perform class-based         Moreover, a usage of such an algorithm is quite limited, because
outlier detection on logic proofs represented by these new features.     by setting the minimum support to a very small value, the algo-
We show that this method helps to find unusual students’ solutions       rithm may end up generating excessively many frequent subtrees,
and to improve semi-automatic evaluation of the solutions.               which consumes both time and space. The problem is that we wish
                                                                         to include the infrequent substructures as well because they often
Keywords                                                                 represent mistakes in students’ solutions.
logic proofs, resolution, educational data mining, graph mining,
outlier detection                                                        In this paper we propose a novel way of subgraph generalization
                                                                         that solves the problems mentioned above and is independent on
                                                                         the input set of clauses. We show that by means of graph mining
1.    INTRODUCTION                                                       and class outlier detection, we are able to find outlying students’
Resolution method is, together with tableaux proof method, one of
                                                                         solutions and use them for the evaluation improvement.
the advanced methods taught in undergraduate courses of logic. To
evaluate a student solution properly, a teacher needs not only to
                                                                         The structure of this paper is following. Section 2 brings related
check the result of a solution (the set of clauses is or is not con-
                                                                         work. In Section 3 we introduce the source data. In Section 4
tradictory) but also to analyse the sequence of steps that a student
                                                                         we introduce the improved method for construction of generalized
performed—with respect to correctness of each step and with re-
                                                                         resolution graphs. In Section 5 we bring the main result—detection
spect to correctness of that sequence. We need to take into account
                                                                         of anomalous student solutions. Discussion and conclusion are in
all of that when we aim at building a tool for analysis of students’
                                                                         Sections 6 and 7, respectively.
solutions. It has to be said that for an error detection (e.g. resolu-
tion on two propositional letters, which is the most serious one) we
can use a search method. However, detection of an error does not
necessarily mean that the solution was completely incorrect. More-       2.    RELATED WORK
over, by a search we can hardly discover patterns, or sequence of
                                                                         Overview of graph mining methods can be found in [5]. Up to
patterns, that are typical for wrong solutions.
                                                                         our knowledge, there is no work on analysis of student solutions of
                                                                         logical proofs by means of graph mining. Definitely, solving logic
To find typical patterns in wrong solutions, we developed a new
                                                                         proofs, especially by means of resolution principle, is one of the
method for analysis of students’ solutions of resolution proofs [13,
                                                                         basic graph-based models of problem solving in logic. In problem-
                                                                         solving processes, graph mining has been used in [15] for mining
                                                                         concept maps, i.e. structures that model knowledge and behaviour
                                                                         patterns of a student, for finding commonly observed subconcept
                                                                         structures. Combination of multivariate pattern analysis and hid-
                                                                         den Markov models for discovery of major phases that students go
                                                                         through in solving complex problems in algebra is introduced in
                                                                         [1]. Markov decision processes for generating hints to students in
                                                                         logic proof tutoring from historical data has been solved in [2, 3,
                                                                         12].
                                                                         ficient as the structure of the graph also plays a significant role.
                                                                         Common practice is to use substructures of graphs as new features
                                                                         [5]. More specifically, boolean features are used and the value of a
                                                                         feature depends on whether the corresponding substructure occurs
                                                                         in the given instance or not.


                                                                                        800




                                                                                        600




                                                                            frequency
                                                                                        400




                                                                                        200




                                                                                          0

                                                                                               0               100                   200             300
                                                                                                                      clause index


                                                                                          Figure 2: Distribution of clause labels ordered by frequency.


                                                                         As we showed earlier, a frequent subgraph mining algorihm is inap-
          Figure 1: A correct and an incorrect resolution proof.         propriate. To overcome the discussed problems, we created a new
                                                                         method for feature construction from our data. The idea of feature
                                                                         construction is to unify subgraphs which carry similar information
3.    DATA                                                               but they differ in form. An example of two subgraphs, which differ
By means of a web-based tool, each of 351 students solved at least       only in variable letters and ordering of nodes and literals, is shown
three tasks randomly chosen from 19 exercises. All solutions were        on the left side of Fig. 3. The goal is to process such similar graphs
stored in a PostgreSQL database. The data set contained 873 dif-         to get one unique graph, as shown in the same figure on the right.
ferent students’ solutions of resolution proofs in propositional cal-    In this way, we can better deal with different sets of clauses with
culus, 101 of them being incorrect and 772 correct. Two examples         different sets of variable letters. To deal with the minimum-support
of solutions are shown in Fig. 1.                                        problem, the algorithm for frequent subgraphs was left out com-
                                                                         pletely and all 3-node subgraphs, which are described later, were
Common errors in resolution proofs are the following: repetition         looked up.
of the same literal in the clause, resolving on two literals at the
same time, incorrect resolution—the literal is missing in the re-
solved clause, resolving on the same literals (not on one positive
                                                                         4.1                  Unification on Subgraphs
and one negative), resolving within one clause, resolved literal is      To unify different tasks that may appear in student tests, we defined
not removed, the clause is incorrectly copied, switching the order       a unification operator on subgraphs that allows finding of so called
of literals in the clause, proof is not finished, resolving the clause   generalized subgraphs. Briefly saying, a generalized subgraph de-
and the negation of the second one (instead of the positive clause).     scribes a set of particular subgraphs, e.g., a subgraph with parents
For each kind of error we defined a query that detects the error. For    {A, ¬B} and {A, B} and with the child {A} (the result of a correct
automatic evaluation we used only four of them, see Table ERRORS         use of a resolution rule), where A, B, C are propositional letters,
described in appendix A. As the error of resolving on two literals at    is an instance of generalized graph {Z, ¬Y }, {Z,Y } → {Z}, where
the same time is very common and referred later in text, we denote       Y, Z are variables (of type proposition). An example of incorrect
this error as E3.                                                        use of resolution rule {A, ¬B}, {A, B} → {A, A} matches with the
                                                                         generalized graph {Z, ¬Y }, {Z,Y } → {Z, Z}. In other words, each
All actions that a student performed, like adding/deleting a node,       subgraph is an instance of one generalized subgraph. We can see
drawing/removing an edge, writing/deleting a text into a node, were      that the common set unification rules [6] cannot be used here.
saved into a database together with time stamps. More details on
this database and its tables can be found in appendix A.                 In this work we focused on generalized subgraphs that consist of
                                                                         three nodes, two parents and their child. Then each generalized
In the data there were 303 different clauses occurring in 7869 ver-      subgraph corresponds to one way—correct or incorrect—of reso-
tices, see frequency distribution in Fig. 2. Approximately half of       lution rule application.
the clauses had absolute frequency less than or equal to three.
                                                                         4.2                  Ordering on Nodes
4.    GENERALIZED SUBGRAPHS                                              As a resolution proof is, in principal, an unordered tree, there is
In this section we describe feature construction from graph data.        no order on parents in those three-node graphs. To unify two res-
Representing a graph by values of its vertices and edges is insuf-       olution steps that differ only in order of parents we need to define
                                                                                                                                         
                                                                         first step, ordering of parents must be done, which takes O r for
                                                                                                                                         
                                                                         counting the number of negative and positive literals, O r log r for
                                                                                           
                                                                         sorting and O r for comparison of two sorted lists. Letter substi-
                                                                         tution in
                                                                                  the second step consists of counting
                                                                                                                       literals on merged list
                                                                         in O r , sorting the counts in O r log r and renaming of letters in
                                                                              
                                                                         O r . Lexicographical reordering is performed in the last step and
                                                                                         
                                                                         takes O r log r . As construction of advanced generalized patterns
                                                                         is less complex than the construction of patterns mentioned above,
                                                                         we can conclude that the time complexity for whole generalization
                                                                         process on m patterns with duplicity removal is O m2 + m(4r +
                                                                                  
                                                                         3r log r) .

              Figure 3: An example of pattern unification.               4.5    Higher-level Generalization
                                                                         To improve performance of used algorithms, e.g. outlier detection
                                                                         algorithms, we created a new generalization method. This method
ordering on parent nodes1 . We take a node and for each proposi-         can be viewed as a higher-level generalization as it generalizes the
tional letter we first count the number of negative and the number of    method described in previous paragraphs. This method uses do-
positive occurrences of the letter, e.g., for {¬C, ¬B, A,C} we have      main knowledge about general resolution principle. It goes through
these counts: (0,1) for A, (1,0) for B, and (1,1) for C. Following       all literals in a resolvent and deletes those which also appear in at
the ordering Ω defined as follows: (X,Y ) ≤ (U,V ) iff (X < U ∨          least one parent. Each such literal is also deleted from the corre-
(X = U ∧ Y ≤ V )), we have a result for the node {C, ¬B, A, ¬C}:         sponding parent or parents in case it appears in both of them. In the
{A, ¬B,C, ¬C} with description ∆ = ((0,1), (1,0), (1,1)). We will        next step, remaining literals in parents are merged into a new list
compute this transformation for both parent nodes. Then we say           dropped and remaining literals in the resolvent form another list,
that a node is smaller if the description ∆ is smaller with respect to   added. These two lists form a pattern of the higher-level general-
the Ω ordering applied lexicographically per components. Contin-         ization and we will write such patterns in the following format:
uing with our example above, let the second node be {B,C, A, ¬A}
with ∆ = ((0,1), (0,1), (1,1)). Then this second node is smaller than
the first node {A, ¬B,C, ¬C}, since the first components are equal                        {Li1 , Li2 , ..., Lin }; {L j1 , L j2 , ..., L jm }
and (1,0) is greater than (0,1) in case of second components.                                 (added)                   (dropped)

4.3    Generalization of Subgraphs                                       For example, if we take the generalized subgraph from the right
Now we can describe how the generalized graphs are built. After          side of Fig. 3, there is only one literal in the resolvent, ¬Y . We re-
the reordering introduced in the previous paragraph, we assign vari-     move it from the resolvent and both parents and we get dropped =
ables Z,Y,X,W,V,U,. . . to propositional letters. To accomplish this,    [Z, ¬Z], added = [].
we initially merge literals from all nodes into one list and order it
using the Ω ordering. After that, we assign variable Z to the let-       As a result, there may be patterns which differ only in used letters
ter with the smallest value, variable Y to the letter with the second    and order of literals in lists dropped and added. For this reason
smallest value, etc. If two values are equal, we compare the cor-        we then apply similar method as in the lower-level generalization.
responding letters only within the first parent, alternatively within    Specifically, we merge lists dropped and added and compute num-
the second parent or child. For example, let a student’s (incorrect)     ber of negative and positive literals for each letter in this new list.
resolution step be {C, ¬B, A, ¬C}, {B,C, A, ¬A} → {A,C}. We or-          The letters are then ordered primarily according to number of oc-
der the parents getting the result {B,C, A, ¬A}, {C, ¬B, A, ¬C} →        currences of negative literals and secondly according to number of
{A,C}. Next we merge all literals into one list, keeping multi-          occurrences of positive literals. In case of tie we check ordering
ple occurrences: {B,C, A, ¬A,C, ¬B, A, ¬C, A,C}. After reorder-          of affected letters only in added list and if needed, then also in
ing, we get {B, ¬B,C,C,C, ¬C, A, A, A, ¬A} with ∆ = ((1,1), (1,3),       dropped list. If tie occurs also in these lists, then the order does
(1,3)). This leads to the following renaming of letters: B → Z,          not matter. At the end, the old letters are one by one replaced by
C → Y , and A → X. Final generalized subgraph is {Z,Y, X, ¬X},           the new ones according to the ordering and the new lists are sorted
{Y, ¬Z, X, ¬Y } → {X,Y }. In case that one node contains more            lexicographically. For example, let dropped = [X, ¬X], added =
propositional letters and the nodes are equal (with respect to the or-   [Y, Z, Z, ¬Z]. Then merged = [X, ¬X,Y, Z, Z, ¬Z] and number of
dering) on the intersection of propositional letters, the longer node    occurrences can be listed as count(X, merged) = (1, 1), count(Y,
is defined as greater. At the end, the literals in each node are lexi-   merged) = (0, 1), count(Z, merged) = (1, 2). Ordering on letters
cographically ordered to prevent from duplicities such as {Z, ¬Y }       can be expressed as Y ≤ X ≤ Z. Using letters from the end of
and {¬Y, Z}.                                                             alphabet we perform following substitution according to created
                                                                         ordering: Y → Z, X → Y , Z → X. Final pattern will have lists
4.4    Complexity of Graph Pattern Construc-                             dropped = [¬Y,Y ], added = [¬X, X, X, Z], provided that ¬ sign is
                                                                         lexicographically before alphabetic characters. Examples of pat-
       tion                                                              terns with absolute support ≥ 10 are shown in Tab. 1.
Complexity of pattern generalization depends on the number of
patterns and the number of literals within each pattern. Let r be
the maximum number of literals within a 3-node pattern. In the
                                                                         4.6    Generalization Example
                                                                         In this section we illustrate the whole generalization process by an
1 Ordering on nodes, not on clauses, as a student may write a text       example. Assume that the following 3-node subgraph has to be
that does not correspond to any clause, e.g., {A, A}.                    generalized:
            Table 1: Higher-level patterns with support ≥ 10             the letters once again. Lists added and dropped are merged together
                                                                         and the same subroutine is used as before—now the lists can be
                                                                         seen as two nodes instead of three. In this case, the renaming goes
               Pattern (added; dropped)          Support
                                                                         as follows: X → Z,Y → Y, Z → X. At the end, literals in both lists
               {}; {¬Z, Z}                       3345
                                                                         are lexicographically sorted and the final higher-level pattern is:
               {}; {¬Y, ¬Z,Y, Z}                 59
               {¬Z}; {¬Y,Y }                     18
               {}; {¬Z}                          13                                           {Z}; {¬X, ¬X, ¬Y, X, X,Y }
               {}; {}                            10                                           (added)         (dropped)

                                                                         4.7    Use of Generalized Subgraphs
P1 = {¬C, ¬A, ¬C, D, ¬D}, P2 = {¬D, ¬A, D,C} → {¬A, A, ¬C}               This section puts all the information from previous sections to-
                                                                         gether and describes how generalized patterns are used as new fea-
                                                                         tures. Input data in form of nodes and edges are transformed into
First, the parents are checked and possibly reordered. For each          attributes of two types. Generalized patterns of the lower level can
letter we compute the number of negative and positive literals in        be considered as the first type and the patterns of higher-level gen-
either parent. Specifically, count(A, P1) = (1,0), count(C, P1) =        eralization as the second type. One boolean attribute is created for
(2,0), count(D, P1) = (1,1), count(A, P2) = (1,0), count(C, P2) =        each generalized pattern. Value of such attribute is equal to T RUE,
(0,1), count(D, P2) = (1,1). Obtained counts are lexicographically       if the corresponding pattern occurs in the given resolution proof,
sorted for both parents and both chains are lexicographically com-       and it is equal to FALSE otherwise. Thus following this procedure,
pared:                                                                   the resolution proofs can be transformed into an attribute-value rep-
                                                                         resentation as shown in Table 2. Such representation allows us to
                                                                         use a lot of existing machine learning algorithms.
            ((1, 0), (1, 1), (2, 0)) > ((0, 1), (1, 0), (1, 1))
                                                                                Table 2: Attribute-value representation of resolution proofs

In this case, the result was already obtained by comparing the first
two pairs, (1,0) and (0,1). Thus, the second parent is smaller and                Instance     Pattern1     Pattern2      ...   Patternm
the parents should be switched:                                                       1         TRUE        FALSE         ...    FALSE
                                                                                     ...          ...          ...        ...      ...
                                                                                      n        FALSE        FALSE         ...    TRUE
P10 = {¬D, ¬A, D,C}, P20 = {¬C, ¬A, ¬C, D, ¬D} → {¬A, A, ¬C}


Now, all three nodes are merged into one list:                           5. OUTLIER DETECTION
                                                                         5.1 Mining Class Outliers
        S = {¬D, ¬A, D,C, ¬C, ¬A, ¬C, D, ¬D, ¬A, A, ¬C}                  In this section we present the main result, obtained from outlier
                                                                         detection. We observed that student creativity is more advanced
                                                                         than ours, and that results of the queries for error detection must
Once again, the numbers of negative and positive literals are com-       be used carefully. Detection of anomalous solutions—either ab-
puted: count(A, S) = (3,1), count(C, S) = (3,1), count(D, S) = (2,2).    normal, with picturesque error, or incorrectly classified—helps to
Since count(A, S) = count(C, S), we also check the counts in the         improve the tool for automatic evaluation, as will be shown later.
first parent, P1’. As count(C, P1’) = count(C, P2) < count(A, P2) =
count(A, P1’), letter C is inserted before A. Finally, the letters are   Here we focus only on outliers for classes created from error E3, the
renamed according to the created order: D → Z,C → Y, A → X. Af-          resolution on two literals at the same time, as it was the most com-
ter the renaming and lexicographical reordering of literals, we get      mon error. This means that the data can be divided into two groups,
the following generalized pattern:                                       depending whether the instances contain error E3 or not. For other
                                                                         types of errors, the analysis would be similar. We also present
                                                                         only results computed on higher-level generalized patterns. The
       {¬X, ¬Z,Y, Z}, {¬X, ¬Y, ¬Y, ¬Z, Z} → {¬X, ¬Y, X}                  reason is that they generally achieved much higher outlier scores
                                                                         than lower-level patterns.

Next, we want to get also the higher-level generalization of that        The data we processed had been labeled. Unlike in common outlier
pattern. The procedure goes through all literals in the resolvent and    detection, where we look for outliers that differ from the rest of
deletes those literals that occur in at least one parent. This step      "normal" data, we needed to exploit information about a class. That
results in a prunned version of the pattern:                             is why we used weka-peka [9] that looks for class outliers [8, 10]
                                                                         using Random Forests (RF) [4]. The main idea of weka-peka lies in
                                                                         different computation of proximity matrix in RF—it also exploits
                  {¬Z,Y, Z}, {¬Y, ¬Z, Z} → {X}                           information about a class label [9]. We used the following settings:


Parents from the pruned pattern are merged into a new list dropped       NumberOfTrees=1000
and the resolvent is used in a list added. Thus, added = {X} and         NumberOfRandomFetaures=7
dropped = {¬Z,Y, Z, ¬Y, ¬Z, Z}. Now it is necessary to rename            FeatureRanking=gini
                                       Table 3: Top outliers for data grouped by error E3

       instance   error E3   outlier score      significant patterns                        significant missing patterns
                                                [(AScore) added; dropped]                   [(AScore) added; dropped]
           270       no              131.96     (0.96) looping                              (−0.99) {}; {¬Z, Z}
           396       no              131.96     (0.96) looping                              (−0.99) {}; {¬Z, Z}
           236       no               73.17     (0.99) {}; {¬Y, ¬Z,Y }
           187       no               61.03     (0.99) {¬Z}; {¬Y,Y }
                                                (0.99) {}; {¬Y, ¬Z,Y }
           438      yes               54.43     (1.00) {Z}; {¬X, ¬Y, X,Y }                  (−0.94)   {}; {¬Y, ¬Z,Y, Z}
           389      yes               52.50     (1.00) {}; {¬Y, ¬Z,Y }                      (−0.94)   {}; {¬Y, ¬Z,Y, Z}
                                                                                            (−0.81)   {}; {¬Z, Z}
            74      yes               15.91     (0.98)   {¬Z}; {¬X, ¬Y, X,Y }               (−0.94)   {}; {¬Y, ¬Z,Y, Z}
                                                (0.98)   {}; {¬X, ¬Y, ¬Z, X,Y, Z}
           718      yes               15.91     (0.98)   {¬Z}; {¬X, ¬Y, X,Y }               (−0.94) {}; {¬Y, ¬Z,Y, Z}
                                                (0.98)   {}; {¬X, ¬Y, ¬Z, X,Y, Z}




                                   Figure 4: Drawings of the outlying instances from Table 3.

                                     Table 4: Classification results for frequent subgraphs


Used attributes                     Algorithm                 Accuracy [%]             Precision for incorrect proofs      Recall
low-level generalization            SVM (SMO)                    *95.2                              0.94                    0.61
both levels of generalization       SVM (SMO)                    *96.9                              0.95                    0.74
both levels of generalization       J48                           96.1                             *0.98                    0.68
both levels of generalization E3    J48                          *95.4                              0.87                    0.72
MaxDepthTree=unlimited                                                   5.3     Interpretation of the Outliers
Bootstrapping=yes                                                        Using the AScore metrics we found the patterns which are interest-
NumberOfOutliersForEachClass=50                                          ing for outliers in Table 3. Patterns, with AScore > 0.8 are listed in
                                                                         the significant patterns column and patterns with AScore < -0.8 in
                                                                         the significant missing patterns column.
Main results of outlier detection process are summarized in Table 3.
When analyzing the strongest outliers that weka-peka found, we           All outliers from Table 3, except for the last one as it is almost
can see that there are three groups according to the outlier score.      identical to the penultimate one, are also displayed in Fig. 4. Anal-
The two most outlying examples, instances numbered 270 and 396,          ysis of individual outliers let us draw several conclusions. Let us
significantly differ from the others. The second cluster consists of     remind that higher-level patterns listed in Table 3 are derived from
four examples with the outlier score between 50 and 100, and the         lower-level patterns consisting of three nodes, two parents and one
last group is comprised of instances with the lowest score of 15.91.     resolvent, and that the component added simply denotes literals
                                                                         which were added erroneously to the resolvent and the component
As weka-peka is based on Random Forest, we can interpret an out-         dropped denotes literals from parents which participated in the res-
lier by analyzing trees that classify given instance to a different      olution process. Two most outlying instances, numbered 270 and
class than it was labeled. Such trees show which attribute or com-       396, also contain one specific pattern, looping. This pattern repre-
bination of attributes lead to the resulting class. If we search for     sents the ellipsis in a resolution tree, which is used for tree termi-
repeating patterns in those trees, we can find the most important        nation if the tree cannot lead to a refutation. Both instances contain
attributes making the given instance an outlier. Using this method       this pattern, but neither of them contains the pattern of correct us-
to interpret the instance 270, we found out that high outlier score      age of the resolution rule, which is listed in the significant missing
is caused by not-applying one specific pattern (see Table 3). When       patterns column. The important thing is that these two instances do
setting this attribute equal TRUE, outlier score decreases to -0,40.     not contain error E3, but also any other error. In fact, they are cre-
Values of attributes of instances 396 and 270 are equal, it means        ated from an assignment which always leads to the looping pattern.
that also interpretation is the same as in previous case. Similary, we   This shows that it is not sufficient to find all errors and check the
found that outlierness of instance 236 is given by occurence of spe-     termination of proofs, but we should also check whether the student
cific pattern in solution and non-occurence of another pattern. The      performed at least few steps by using the resolution rule. Otherwise
value of the corresponding attribute is the only difference between      we are not able to evaluate the student’s skills. Moreover, there may
instance 236 and 187. Occurence/non-occurence of this pattern is         be situations in which a student only copies the solution.
therefore the reason why instance numbered 236 achieves higher
outlier score than instance 187. See again Table 3 for information       Instances with the outlier score less than 100 are less different from
about particular patterns. We further elaborated this approach of        other instances. In particular, instances number 236 and 187 are
outlier explanation in the following section.                            more similar to correct resolution proofs than the instances dis-
                                                                         cussed above. Yet, they both contain anomalous patterns such as
5.2    Finding Significant Patterns                                      {}; {¬Y, ¬Z,Y }. This particular error pattern does not indicate er-
As the outlier score is the only output information about the out-       ror E3, as can be seen in Table 3. It is actually not marked as any
liers, we created a simple method for finding the attributes with the    type of error, which tells us that it is necessary to extend our list of
most unusual values. Let xi j denote the value of the jth attribute      potential errors in the automatic evaluator.
of the ith instance, which is either T RUE or FALSE for the pattern
attributes, and cl(i) denote the class of the ith instance. Then for     Continuing with outlier instances we get to those which contain er-
instance i we compute the score of attribute j as:                       ror E3. Two of them exceed the boundary of outlier score 50, which
                                                                        suggests that they are still relatively anomalous. The first outlier,
                 |{k|k6=i∧cl(i)=cl(k)∧xk j =FALSE}|   if xi j = T RUE   instance number 438, differ from other instances in an extra lit-
                         |{k|k6=i∧cl(i)=cl(k)}|
AScore(i, j) =                                                           eral which was added into a resolvent. Specifically, the number 1,
                 − |{k|k6=i∧cl(i)=cl(k)∧xk j =T RUE}| if xi j = FALSE
                          |{k|k6=i∧cl(i)=cl(k)}|                         which is not even a variable, can be seen at the bottom of the reso-
                                                                         lution proof in Fig. 4. More interesting is the second instance with
AScore expresses the proportion of other instances from the same         number 389. Error E3 was detected already in the first step of res-
class which have different value of the given attribute. If outlier’s    olution, specifically when resolved on parents {s,t} and {¬t, ¬s}.
attribute equals FALSE, then the only difference is in the sign of the   This would not be a strange thing, if the resolvent was not s. Such
score. For example, consider our data set of 873 resolution proofs,      a resolvent raises a question whether it is an error of type E3 or just
out of which 53 proofs contain error E3. Assume that one of the          a typing error. The latter is a less serious error.
53 proofs is an outlier with an attribute equal to T RUE and from
the rest of 52 proofs only two proofs have the same value of this        Last two outliers in the table are almost the same so only the in-
attribute as the outlier. Then the outlier’s AScore on this attribute    stance number 74 is depicted in Fig. 4. These two instances have
is approximately 50/52 = 0.96 and it indicates that the value of this    quite low outlier score and they do not expose any shortcomings of
attribute is quite unusual.                                              our evaluation tool. Yet, they exhibit some outlying features such
                                                                         as resolving on three literals at the same time.
In general, the AScore ranges from -1 to 1. If the outlier resolu-
tion graph contains a pattern which is unique for the class of the       6.    DISCUSSION
graph, then the AScore of the corresponding attribute is equal to        As we observed it is not sufficient to detect only the errors but we
1. On the other hand, if the outlier misses a pattern and all other      need to analyze a context in which an error appeared. Moreover,
graphs contain it, then the AScore is equal to -1. An AScore equal       there are solutions that are erroneous because they do not contain
to 0 means that all other instances are equal to the outlier on the      a particular pattern or patterns. Outlier detection helped to find
specified attribute.                                                     wrong students’ solutions that could not be detected by the system
of queries even though the set of queries has been carefully built        8.   REFERENCES
and tested on the test data. We also found a situation when a query        [1] J. R. Anderson. Discovering the structure of mathematical
did not detected an error although it appeared in the solution. We             problem solving. In Proceedings of EDM, 2013.
are convinced that with increasing number of solutions we will be          [2] T. Barnes and J. Stamper. Toward automatic hint generation
able to further increase performance of wrong solution detection.              for logic proof tutoring using historical student data. In
                                                                               Proceedings of the 9th International Conference on
As we stressed in the introduction, this method has not been devel-            Intelligent Tutoring Systems, pages 373–382, 2008.
oped for recognition of correct or incorrect solutions. However, to        [3] T. Barnes and J. Stamper. Automatic hint generation for logic
verify that the feature construction is appropriate, we also learned           proof tutoring using historical data. Educational Technology
various classifiers of that kind. In previous work we used only gen-           and Society, 13(1):3–12, 2010.
eralized patterns as attributes for classification with allerrors class    [4] L. Breiman. Random forests. Mach. Learn., 45(1):5–32, Oct.
attribute. However, these patterns were not sufficient for our cur-            2001.
rent data. Repeating the same experiments we got the best result for       [5] D. J. Cook and L. B. Holder. Mining Graph Data. John
SMO Support Vector Machines from Weka [7], which had 95.2%                     Wiley & Sons, 2006.
accuracy, see Table 4. Precision and recall for the class "incorrect"
                                                                           [6] A. Dovier, E. Pontelli, and G. Rossi. Set unification. CoRR,
were 0.94 and 0.61, respectively. Minimum support for pattern se-
                                                                               cs.LO/0110023, 2001.
lection was 0% in this case. To improve performance of classifica-
tion we used the new level of generalization. Using the same set-          [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
tings, but now with both levels of generalized patterns, we achieved           and I. H. Witten. The weka data mining software: An update.
96.9% accuracy, 0.95 precision and 0.74 recall for the class "incor-           SIGKDD Explor. Newsl., 11(1):10–18, Nov. 2009.
rect". Similar results were obtained when only the new level of            [8] N. Hewahi and M. Saad. Class outliers mining:
generalization was used, again with SMO. When ordered accord-                  Distance-based approach. International Journal of Intelligent
ing to precision, value 0.98 was achieved by J48, but the accuracy             Technology, 2.
and recall were only 96.1 and 0.68, respectively.                          [9] Z. Pekarcikova. Supervised outlier detection, 2013.
                                                                               http://is.muni.cz/th/207719/fi_m/diplomova_
As one of the most common errors in resolution proofs is usage of              praca_pekarcikova.pdf.
resolution rule on two pairs of literals at the same time, we repeated    [10] P. Spiros and F. Christos. Cross-outlier detection. In
the experiment, but now discarding all patterns capturing this spe-            Proceedings of SSTD, pages 199–213, 2003.
cific kind of error. In this scenario the performance slightly dropped    [11] A. Srinivasan. The Aleph Manual, 2001.
but remained still high—J48 achieved 95.4% accuracy, 0.87 preci-               http://web.comlab.ox.ac.uk/oucl/research/areas/
sion and 0.72 recall. For the sake of completeness, F1 score for               machlearn/Aleph/ [Accessed: 2014-01-09].
the class "correct" varied between 0.97 and 0.99 in all the results       [12] J. C. Stamper, M. Eagle, T. Barnes, and M. J. Croy.
above.                                                                         Experimental evaluation of automatic hint generation for a
                                                                               logic tutor. I. J. Artificial Intelligence in Education,
We also checked whether inductive logic programming (ILP) can                  22(1-2):3–17, 2013.
help to improve the performance under the same conditions. To             [13] K. Vaculik and L. Popelinsky. Graph mining for automatic
ensure it, we did not use any domain knowledge predicates that                 classification of logical proofs. In Proceedings of the 6th
would bring extra knowledge. For that reason, the domain knowl-                International Conference on Computer Supported Education
edge contained only predicates common for the domain of graphs,                CSEDU 2014, 2014.
like node/3, edge/3, resolutionStep/3 and path/2. We used Aleph           [14] K. Vaculik, L. Popelinsky, E. Mrakova, and J. Jurco.
system [11]. The results were comparable with the method de-                   Tutoring and automatic evaluation of logic proofs. In
scribed above.                                                                 Proceedings of the 12th European Conference on e-Learning
                                                                               ECEL 2013, pages 495–502, 2013.
7.    CONCLUSION AND FUTURE WORK                                          [15] J. S. Yoo and M. H. Cho. Mining concept maps to understand
In this paper we introduced a new level of generalization method               university students’ learning. In Proceedings of EDM, 2012.
for subgraphs of resolution proof trees built by students. Gener-         [16] M. J. Zaki. Efficiently mining frequent embedded unordered
alized subgraphs created by this special graph mining method are               trees. Fundam. Inf., 66(1-2):33–52, Jan. 2005.
useful for representation of logic proofs in an attribute-value fash-
ion. We showed how a class-based outlier detection method can
be used on these logic proofs by utilization of the generalized sub-
graphs. We also discussed how the outlying proofs may be used for
performance improvement of our automatic proof evaluator. This
method may also be used for other types of data such as tableaux
proofs.

As a future work we are going to analyse the temporal information,
which was saved together with the structural information of logic
proofs.

ACKNOWLEDGEMENTS
This work has been supported by Faculty of Informatics, Masaryk
University and the grant CZ.1.07/2.2.00/28.0209 Computer-aided-
teaching for computational and constructional exercises.
APPENDIX
A. DESCRIPTION OF DATA
CLAUSE - list of nodes from all graphs
. idclause - ID of the node
. coordinatex - x position in drawing
. coordinatey - y position in drawing
. timeofcreation - when the node was created
. timeofdeletion - when the node was deleted (if not deleted, value is "NA")
. idgraph - in which graph the node appears
. text - text label

EDGE - list of (directed) edges from all graphs
. idedge - ID of the edge
. starting - ID of the node from which this edge goes
. ending - ID of the node to which this edge goes
. timeofcreation
. timeofdeletion
. idgraph

ERRORS - errors found in resolution graphs (found by means of SQL queries)
. idgraph - ID of the graph
. error3 - resolving on two literals at the same time (1 = error occurred, 0 = not occurred)
. error4 - repetition of the same literal in a set
. error5 - resolving on identical literals
. error8 - no resolution performed, only union of two sets
. allerrors - any of the previously listed errors occurred / not occured

GRAPH - list of graphs
. idgraph - ID of the graph
. logintime - start of graph creation
. clausetype - either set or ordered list
. resolutiontype - type of resolution, encoded by numbers (see table RESOLUTIONTYPES)
. assignment - textual assignment of task
. endtime - end of graph creation

MOVEMENT - list of coordinate changes of nodes
. idmovement - ID of the change
. idclause - ID of the node whose coordinates were changed
. coordinatex - new x coordinate
. coordinatey - new y coordinate
. time - time of the change

RESOLUTIONTYPES - encoding of resolution types
. typeid - ID (numeric encoding)
. typetext - textual value

TEXT - list of text (label) changes of nodes.
. idtext - ID of the change
. idclause - ID of the node whose text label was changed
. time - time of the change
. text - new text (label) value

TYPES - list of resolution type and clause type changes
. idtypes - ID of the change
. resolutiontype - new value of resolution type for specific graph
. clasetype - new value of clause type for specific graph
. timeofchange - time of the change
. idgraph - ID of the graph whose values were changed