=Paper= {{Paper |id=Vol-1183/gedm_paper06 |storemode=property |title=Graph Mining and Outlier Detection Meet Logic Proof Tutoring |pdfUrl=https://ceur-ws.org/Vol-1183/gedm_paper06.pdf |volume=Vol-1183 |dblpUrl=https://dblp.org/rec/conf/edm/VaculikNP14 }} ==Graph Mining and Outlier Detection Meet Logic Proof Tutoring== https://ceur-ws.org/Vol-1183/gedm_paper06.pdf

Graph Mining and Outlier Detection Meet Logic Proof
Tutoring

Karel Vaculík Leona Nezvalová Luboš Popelínský
Knowledge Discovery Lab Knowledge Discovery Lab Knowledge Discovery Lab
Faculty of Informatics Faculty of Informatics Faculty of Informatics
Masaryk University Masaryk University Masaryk University
Brno, Czech Republic Brno, Czech Republic Brno, Czech Republic
xvaculi4@fi.muni.cz 324852@mail.muni.cz popel@fi.muni.cz

ABSTRACT 14] and showed its good performance. Solutions were manually
We introduce a new method for analysis and evaluation of logic rewritten into GraphML and then analysed. First, the frequent pat-
proofs constructed by undergraduate students, e.g. resolution or terns were found by Sleuth [16], which was suitable for this type
tableaux proofs. This method employs graph mining and outlier of data—unordered rooted trees. This algorithm finds all frequent
detection. The data has been obtained from a web-based system subtrees from a set of trees for a given minimum support value.
for input of logic proofs built at FI MU. The data contains a tree Such frequent subgraphs were generalized and these generaliza-
structure of the proof and also temporal information about all ac- tions used as new attributes.
tions that a student performed, e.g. a node insertion into a proof,
or its deletion, drawing or deletion of an edge, or text manipula- The main drawback of a frequent subgraph mining algorithm it-
tions. We introduce a new method for multi-level generalization self is its strong dependence on a particular task, i.e. on the input
of subgraphs that is useful for characterization of logic proofs. We set of clauses that has to be proved, or unproved, as contradictory.
use this method for feature construction and perform class-based Moreover, a usage of such an algorithm is quite limited, because
outlier detection on logic proofs represented by these new features. by setting the minimum support to a very small value, the algo-
We show that this method helps to find unusual students’ solutions rithm may end up generating excessively many frequent subtrees,
and to improve semi-automatic evaluation of the solutions. which consumes both time and space. The problem is that we wish
to include the infrequent substructures as well because they often
Keywords represent mistakes in students’ solutions.
logic proofs, resolution, educational data mining, graph mining,
outlier detection In this paper we propose a novel way of subgraph generalization
that solves the problems mentioned above and is independent on
the input set of clauses. We show that by means of graph mining
1. INTRODUCTION and class outlier detection, we are able to find outlying students’
Resolution method is, together with tableaux proof method, one of
solutions and use them for the evaluation improvement.
the advanced methods taught in undergraduate courses of logic. To
evaluate a student solution properly, a teacher needs not only to
The structure of this paper is following. Section 2 brings related
check the result of a solution (the set of clauses is or is not con-
work. In Section 3 we introduce the source data. In Section 4
tradictory) but also to analyse the sequence of steps that a student
we introduce the improved method for construction of generalized
performed—with respect to correctness of each step and with re-
resolution graphs. In Section 5 we bring the main result—detection
spect to correctness of that sequence. We need to take into account
of anomalous student solutions. Discussion and conclusion are in
all of that when we aim at building a tool for analysis of students’
Sections 6 and 7, respectively.
solutions. It has to be said that for an error detection (e.g. resolu-
tion on two propositional letters, which is the most serious one) we
can use a search method. However, detection of an error does not
necessarily mean that the solution was completely incorrect. More- 2. RELATED WORK
over, by a search we can hardly discover patterns, or sequence of
Overview of graph mining methods can be found in [5]. Up to
patterns, that are typical for wrong solutions.
our knowledge, there is no work on analysis of student solutions of
logical proofs by means of graph mining. Definitely, solving logic
To find typical patterns in wrong solutions, we developed a new
proofs, especially by means of resolution principle, is one of the
method for analysis of students’ solutions of resolution proofs [13,
basic graph-based models of problem solving in logic. In problem-
solving processes, graph mining has been used in [15] for mining
concept maps, i.e. structures that model knowledge and behaviour
patterns of a student, for finding commonly observed subconcept
structures. Combination of multivariate pattern analysis and hid-
den Markov models for discovery of major phases that students go
through in solving complex problems in algebra is introduced in
[1]. Markov decision processes for generating hints to students in
logic proof tutoring from historical data has been solved in [2, 3,
12].
ficient as the structure of the graph also plays a significant role.
Common practice is to use substructures of graphs as new features
[5]. More specifically, boolean features are used and the value of a
feature depends on whether the corresponding substructure occurs
in the given instance or not.

800

600

frequency
400

200

0 100 200 300
clause index

Figure 2: Distribution of clause labels ordered by frequency.

As we showed earlier, a frequent subgraph mining algorihm is inap-
Figure 1: A correct and an incorrect resolution proof. propriate. To overcome the discussed problems, we created a new
method for feature construction from our data. The idea of feature
construction is to unify subgraphs which carry similar information
3. DATA but they differ in form. An example of two subgraphs, which differ
By means of a web-based tool, each of 351 students solved at least only in variable letters and ordering of nodes and literals, is shown
three tasks randomly chosen from 19 exercises. All solutions were on the left side of Fig. 3. The goal is to process such similar graphs
stored in a PostgreSQL database. The data set contained 873 dif- to get one unique graph, as shown in the same figure on the right.
ferent students’ solutions of resolution proofs in propositional cal- In this way, we can better deal with different sets of clauses with
culus, 101 of them being incorrect and 772 correct. Two examples different sets of variable letters. To deal with the minimum-support
of solutions are shown in Fig. 1. problem, the algorithm for frequent subgraphs was left out com-
pletely and all 3-node subgraphs, which are described later, were
Common errors in resolution proofs are the following: repetition looked up.
of the same literal in the clause, resolving on two literals at the
same time, incorrect resolution—the literal is missing in the re-
solved clause, resolving on the same literals (not on one positive
4.1 Unification on Subgraphs
and one negative), resolving within one clause, resolved literal is To unify different tasks that may appear in student tests, we defined
not removed, the clause is incorrectly copied, switching the order a unification operator on subgraphs that allows finding of so called
of literals in the clause, proof is not finished, resolving the clause generalized subgraphs. Briefly saying, a generalized subgraph de-
and the negation of the second one (instead of the positive clause). scribes a set of particular subgraphs, e.g., a subgraph with parents
For each kind of error we defined a query that detects the error. For {A, ¬B} and {A, B} and with the child {A} (the result of a correct
automatic evaluation we used only four of them, see Table ERRORS use of a resolution rule), where A, B, C are propositional letters,
described in appendix A. As the error of resolving on two literals at is an instance of generalized graph {Z, ¬Y }, {Z,Y } → {Z}, where
the same time is very common and referred later in text, we denote Y, Z are variables (of type proposition). An example of incorrect
this error as E3. use of resolution rule {A, ¬B}, {A, B} → {A, A} matches with the
generalized graph {Z, ¬Y }, {Z,Y } → {Z, Z}. In other words, each
All actions that a student performed, like adding/deleting a node, subgraph is an instance of one generalized subgraph. We can see
drawing/removing an edge, writing/deleting a text into a node, were that the common set unification rules [6] cannot be used here.
saved into a database together with time stamps. More details on
this database and its tables can be found in appendix A. In this work we focused on generalized subgraphs that consist of
three nodes, two parents and their child. Then each generalized
In the data there were 303 different clauses occurring in 7869 ver- subgraph corresponds to one way—correct or incorrect—of reso-
tices, see frequency distribution in Fig. 2. Approximately half of lution rule application.
the clauses had absolute frequency less than or equal to three.
4.2 Ordering on Nodes
4. GENERALIZED SUBGRAPHS As a resolution proof is, in principal, an unordered tree, there is
In this section we describe feature construction from graph data. no order on parents in those three-node graphs. To unify two res-
Representing a graph by values of its vertices and edges is insuf- olution steps that differ only in order of parents we need to define

first step, ordering of parents must be done, which takes O r for

counting the number of negative and positive literals, O r log r for

sorting and O r for comparison of two sorted lists. Letter substi-
tution in
the second step consists of counting
literals on merged list
in O r , sorting the counts in O r log r and renaming of letters in

O r . Lexicographical reordering is performed in the last step and

takes O r log r . As construction of advanced generalized patterns
is less complex than the construction of patterns mentioned above,
we can conclude that the time complexity for whole generalization
process on m patterns with duplicity removal is O m2 + m(4r +

3r log r) .

Figure 3: An example of pattern unification. 4.5 Higher-level Generalization
To improve performance of used algorithms, e.g. outlier detection
algorithms, we created a new generalization method. This method
ordering on parent nodes1 . We take a node and for each proposi- can be viewed as a higher-level generalization as it generalizes the
tional letter we first count the number of negative and the number of method described in previous paragraphs. This method uses do-
positive occurrences of the letter, e.g., for {¬C, ¬B, A,C} we have main knowledge about general resolution principle. It goes through
these counts: (0,1) for A, (1,0) for B, and (1,1) for C. Following all literals in a resolvent and deletes those which also appear in at
the ordering Ω defined as follows: (X,Y ) ≤ (U,V ) iff (X < U ∨ least one parent. Each such literal is also deleted from the corre-
(X = U ∧ Y ≤ V )), we have a result for the node {C, ¬B, A, ¬C}: sponding parent or parents in case it appears in both of them. In the
{A, ¬B,C, ¬C} with description ∆ = ((0,1), (1,0), (1,1)). We will next step, remaining literals in parents are merged into a new list
compute this transformation for both parent nodes. Then we say dropped and remaining literals in the resolvent form another list,
that a node is smaller if the description ∆ is smaller with respect to added. These two lists form a pattern of the higher-level general-
the Ω ordering applied lexicographically per components. Contin- ization and we will write such patterns in the following format:
uing with our example above, let the second node be {B,C, A, ¬A}
with ∆ = ((0,1), (0,1), (1,1)). Then this second node is smaller than
the first node {A, ¬B,C, ¬C}, since the first components are equal {Li1 , Li2 , ..., Lin }; {L j1 , L j2 , ..., L jm }
and (1,0) is greater than (0,1) in case of second components. (added) (dropped)

4.3 Generalization of Subgraphs For example, if we take the generalized subgraph from the right
Now we can describe how the generalized graphs are built. After side of Fig. 3, there is only one literal in the resolvent, ¬Y . We re-
the reordering introduced in the previous paragraph, we assign vari- move it from the resolvent and both parents and we get dropped =
ables Z,Y,X,W,V,U,. . . to propositional letters. To accomplish this, [Z, ¬Z], added = [].
we initially merge literals from all nodes into one list and order it
using the Ω ordering. After that, we assign variable Z to the let- As a result, there may be patterns which differ only in used letters
ter with the smallest value, variable Y to the letter with the second and order of literals in lists dropped and added. For this reason
smallest value, etc. If two values are equal, we compare the cor- we then apply similar method as in the lower-level generalization.
responding letters only within the first parent, alternatively within Specifically, we merge lists dropped and added and compute num-
the second parent or child. For example, let a student’s (incorrect) ber of negative and positive literals for each letter in this new list.
resolution step be {C, ¬B, A, ¬C}, {B,C, A, ¬A} → {A,C}. We or- The letters are then ordered primarily according to number of oc-
der the parents getting the result {B,C, A, ¬A}, {C, ¬B, A, ¬C} → currences of negative literals and secondly according to number of
{A,C}. Next we merge all literals into one list, keeping multi- occurrences of positive literals. In case of tie we check ordering
ple occurrences: {B,C, A, ¬A,C, ¬B, A, ¬C, A,C}. After reorder- of affected letters only in added list and if needed, then also in
ing, we get {B, ¬B,C,C,C, ¬C, A, A, A, ¬A} with ∆ = ((1,1), (1,3), dropped list. If tie occurs also in these lists, then the order does
(1,3)). This leads to the following renaming of letters: B → Z, not matter. At the end, the old letters are one by one replaced by
C → Y , and A → X. Final generalized subgraph is {Z,Y, X, ¬X}, the new ones according to the ordering and the new lists are sorted
{Y, ¬Z, X, ¬Y } → {X,Y }. In case that one node contains more lexicographically. For example, let dropped = [X, ¬X], added =
propositional letters and the nodes are equal (with respect to the or- [Y, Z, Z, ¬Z]. Then merged = [X, ¬X,Y, Z, Z, ¬Z] and number of
dering) on the intersection of propositional letters, the longer node occurrences can be listed as count(X, merged) = (1, 1), count(Y,
is defined as greater. At the end, the literals in each node are lexi- merged) = (0, 1), count(Z, merged) = (1, 2). Ordering on letters
cographically ordered to prevent from duplicities such as {Z, ¬Y } can be expressed as Y ≤ X ≤ Z. Using letters from the end of
and {¬Y, Z}. alphabet we perform following substitution according to created
ordering: Y → Z, X → Y , Z → X. Final pattern will have lists
4.4 Complexity of Graph Pattern Construc- dropped = [¬Y,Y ], added = [¬X, X, X, Z], provided that ¬ sign is
lexicographically before alphabetic characters. Examples of pat-
tion terns with absolute support ≥ 10 are shown in Tab. 1.
Complexity of pattern generalization depends on the number of
patterns and the number of literals within each pattern. Let r be
the maximum number of literals within a 3-node pattern. In the
4.6 Generalization Example
In this section we illustrate the whole generalization process by an
1 Ordering on nodes, not on clauses, as a student may write a text example. Assume that the following 3-node subgraph has to be
that does not correspond to any clause, e.g., {A, A}. generalized:
Table 1: Higher-level patterns with support ≥ 10 the letters once again. Lists added and dropped are merged together
and the same subroutine is used as before—now the lists can be
seen as two nodes instead of three. In this case, the renaming goes
Pattern (added; dropped) Support
as follows: X → Z,Y → Y, Z → X. At the end, literals in both lists
{}; {¬Z, Z} 3345
are lexicographically sorted and the final higher-level pattern is:
{}; {¬Y, ¬Z,Y, Z} 59
{¬Z}; {¬Y,Y } 18
{}; {¬Z} 13 {Z}; {¬X, ¬X, ¬Y, X, X,Y }
{}; {} 10 (added) (dropped)

4.7 Use of Generalized Subgraphs
P1 = {¬C, ¬A, ¬C, D, ¬D}, P2 = {¬D, ¬A, D,C} → {¬A, A, ¬C} This section puts all the information from previous sections to-
gether and describes how generalized patterns are used as new fea-
tures. Input data in form of nodes and edges are transformed into
First, the parents are checked and possibly reordered. For each attributes of two types. Generalized patterns of the lower level can
letter we compute the number of negative and positive literals in be considered as the first type and the patterns of higher-level gen-
either parent. Specifically, count(A, P1) = (1,0), count(C, P1) = eralization as the second type. One boolean attribute is created for
(2,0), count(D, P1) = (1,1), count(A, P2) = (1,0), count(C, P2) = each generalized pattern. Value of such attribute is equal to T RUE,
(0,1), count(D, P2) = (1,1). Obtained counts are lexicographically if the corresponding pattern occurs in the given resolution proof,
sorted for both parents and both chains are lexicographically com- and it is equal to FALSE otherwise. Thus following this procedure,
pared: the resolution proofs can be transformed into an attribute-value rep-
resentation as shown in Table 2. Such representation allows us to
use a lot of existing machine learning algorithms.
((1, 0), (1, 1), (2, 0)) > ((0, 1), (1, 0), (1, 1))
Table 2: Attribute-value representation of resolution proofs

In this case, the result was already obtained by comparing the first
two pairs, (1,0) and (0,1). Thus, the second parent is smaller and Instance Pattern1 Pattern2 ... Patternm
the parents should be switched: 1 TRUE FALSE ... FALSE
... ... ... ... ...
n FALSE FALSE ... TRUE
P10 = {¬D, ¬A, D,C}, P20 = {¬C, ¬A, ¬C, D, ¬D} → {¬A, A, ¬C}

Now, all three nodes are merged into one list: 5. OUTLIER DETECTION
5.1 Mining Class Outliers
S = {¬D, ¬A, D,C, ¬C, ¬A, ¬C, D, ¬D, ¬A, A, ¬C} In this section we present the main result, obtained from outlier
detection. We observed that student creativity is more advanced
than ours, and that results of the queries for error detection must
Once again, the numbers of negative and positive literals are com- be used carefully. Detection of anomalous solutions—either ab-
puted: count(A, S) = (3,1), count(C, S) = (3,1), count(D, S) = (2,2). normal, with picturesque error, or incorrectly classified—helps to
Since count(A, S) = count(C, S), we also check the counts in the improve the tool for automatic evaluation, as will be shown later.
first parent, P1’. As count(C, P1’) = count(C, P2) < count(A, P2) =
count(A, P1’), letter C is inserted before A. Finally, the letters are Here we focus only on outliers for classes created from error E3, the
renamed according to the created order: D → Z,C → Y, A → X. Af- resolution on two literals at the same time, as it was the most com-
ter the renaming and lexicographical reordering of literals, we get mon error. This means that the data can be divided into two groups,
the following generalized pattern: depending whether the instances contain error E3 or not. For other
types of errors, the analysis would be similar. We also present
only results computed on higher-level generalized patterns. The
{¬X, ¬Z,Y, Z}, {¬X, ¬Y, ¬Y, ¬Z, Z} → {¬X, ¬Y, X} reason is that they generally achieved much higher outlier scores
than lower-level patterns.

Next, we want to get also the higher-level generalization of that The data we processed had been labeled. Unlike in common outlier
pattern. The procedure goes through all literals in the resolvent and detection, where we look for outliers that differ from the rest of
deletes those literals that occur in at least one parent. This step "normal" data, we needed to exploit information about a class. That
results in a prunned version of the pattern: is why we used weka-peka [9] that looks for class outliers [8, 10]
using Random Forests (RF) [4]. The main idea of weka-peka lies in
different computation of proximity matrix in RF—it also exploits
{¬Z,Y, Z}, {¬Y, ¬Z, Z} → {X} information about a class label [9]. We used the following settings:

Parents from the pruned pattern are merged into a new list dropped NumberOfTrees=1000
and the resolvent is used in a list added. Thus, added = {X} and NumberOfRandomFetaures=7
dropped = {¬Z,Y, Z, ¬Y, ¬Z, Z}. Now it is necessary to rename FeatureRanking=gini
Table 3: Top outliers for data grouped by error E3

instance error E3 outlier score significant patterns significant missing patterns
[(AScore) added; dropped] [(AScore) added; dropped]
270 no 131.96 (0.96) looping (−0.99) {}; {¬Z, Z}
396 no 131.96 (0.96) looping (−0.99) {}; {¬Z, Z}
236 no 73.17 (0.99) {}; {¬Y, ¬Z,Y }
187 no 61.03 (0.99) {¬Z}; {¬Y,Y }
(0.99) {}; {¬Y, ¬Z,Y }
438 yes 54.43 (1.00) {Z}; {¬X, ¬Y, X,Y } (−0.94) {}; {¬Y, ¬Z,Y, Z}
389 yes 52.50 (1.00) {}; {¬Y, ¬Z,Y } (−0.94) {}; {¬Y, ¬Z,Y, Z}
(−0.81) {}; {¬Z, Z}
74 yes 15.91 (0.98) {¬Z}; {¬X, ¬Y, X,Y } (−0.94) {}; {¬Y, ¬Z,Y, Z}
(0.98) {}; {¬X, ¬Y, ¬Z, X,Y, Z}
718 yes 15.91 (0.98) {¬Z}; {¬X, ¬Y, X,Y } (−0.94) {}; {¬Y, ¬Z,Y, Z}
(0.98) {}; {¬X, ¬Y, ¬Z, X,Y, Z}

Figure 4: Drawings of the outlying instances from Table 3.

Table 4: Classification results for frequent subgraphs

Used attributes Algorithm Accuracy [%] Precision for incorrect proofs Recall
low-level generalization SVM (SMO) *95.2 0.94 0.61
both levels of generalization SVM (SMO) *96.9 0.95 0.74
both levels of generalization J48 96.1 *0.98 0.68
both levels of generalization E3 J48 *95.4 0.87 0.72
MaxDepthTree=unlimited 5.3 Interpretation of the Outliers
Bootstrapping=yes Using the AScore metrics we found the patterns which are interest-
NumberOfOutliersForEachClass=50 ing for outliers in Table 3. Patterns, with AScore > 0.8 are listed in
the significant patterns column and patterns with AScore < -0.8 in
the significant missing patterns column.
Main results of outlier detection process are summarized in Table 3.
When analyzing the strongest outliers that weka-peka found, we All outliers from Table 3, except for the last one as it is almost
can see that there are three groups according to the outlier score. identical to the penultimate one, are also displayed in Fig. 4. Anal-
The two most outlying examples, instances numbered 270 and 396, ysis of individual outliers let us draw several conclusions. Let us
significantly differ from the others. The second cluster consists of remind that higher-level patterns listed in Table 3 are derived from
four examples with the outlier score between 50 and 100, and the lower-level patterns consisting of three nodes, two parents and one
last group is comprised of instances with the lowest score of 15.91. resolvent, and that the component added simply denotes literals
which were added erroneously to the resolvent and the component
As weka-peka is based on Random Forest, we can interpret an out- dropped denotes literals from parents which participated in the res-
lier by analyzing trees that classify given instance to a different olution process. Two most outlying instances, numbered 270 and
class than it was labeled. Such trees show which attribute or com- 396, also contain one specific pattern, looping. This pattern repre-
bination of attributes lead to the resulting class. If we search for sents the ellipsis in a resolution tree, which is used for tree termi-
repeating patterns in those trees, we can find the most important nation if the tree cannot lead to a refutation. Both instances contain
attributes making the given instance an outlier. Using this method this pattern, but neither of them contains the pattern of correct us-
to interpret the instance 270, we found out that high outlier score age of the resolution rule, which is listed in the significant missing
is caused by not-applying one specific pattern (see Table 3). When patterns column. The important thing is that these two instances do
setting this attribute equal TRUE, outlier score decreases to -0,40. not contain error E3, but also any other error. In fact, they are cre-
Values of attributes of instances 396 and 270 are equal, it means ated from an assignment which always leads to the looping pattern.
that also interpretation is the same as in previous case. Similary, we This shows that it is not sufficient to find all errors and check the
found that outlierness of instance 236 is given by occurence of spe- termination of proofs, but we should also check whether the student
cific pattern in solution and non-occurence of another pattern. The performed at least few steps by using the resolution rule. Otherwise
value of the corresponding attribute is the only difference between we are not able to evaluate the student’s skills. Moreover, there may
instance 236 and 187. Occurence/non-occurence of this pattern is be situations in which a student only copies the solution.
therefore the reason why instance numbered 236 achieves higher
outlier score than instance 187. See again Table 3 for information Instances with the outlier score less than 100 are less different from
about particular patterns. We further elaborated this approach of other instances. In particular, instances number 236 and 187 are
outlier explanation in the following section. more similar to correct resolution proofs than the instances dis-
cussed above. Yet, they both contain anomalous patterns such as
5.2 Finding Significant Patterns {}; {¬Y, ¬Z,Y }. This particular error pattern does not indicate er-
As the outlier score is the only output information about the out- ror E3, as can be seen in Table 3. It is actually not marked as any
liers, we created a simple method for finding the attributes with the type of error, which tells us that it is necessary to extend our list of
most unusual values. Let xi j denote the value of the jth attribute potential errors in the automatic evaluator.
of the ith instance, which is either T RUE or FALSE for the pattern
attributes, and cl(i) denote the class of the ith instance. Then for Continuing with outlier instances we get to those which contain er-
instance i we compute the score of attribute j as: ror E3. Two of them exceed the boundary of outlier score 50, which
 suggests that they are still relatively anomalous. The first outlier,
 |{k|k6=i∧cl(i)=cl(k)∧xk j =FALSE}| if xi j = T RUE instance number 438, differ from other instances in an extra lit-
|{k|k6=i∧cl(i)=cl(k)}|
AScore(i, j) = eral which was added into a resolvent. Specifically, the number 1,
 − |{k|k6=i∧cl(i)=cl(k)∧xk j =T RUE}| if xi j = FALSE
|{k|k6=i∧cl(i)=cl(k)}| which is not even a variable, can be seen at the bottom of the reso-
lution proof in Fig. 4. More interesting is the second instance with
AScore expresses the proportion of other instances from the same number 389. Error E3 was detected already in the first step of res-
class which have different value of the given attribute. If outlier’s olution, specifically when resolved on parents {s,t} and {¬t, ¬s}.
attribute equals FALSE, then the only difference is in the sign of the This would not be a strange thing, if the resolvent was not s. Such
score. For example, consider our data set of 873 resolution proofs, a resolvent raises a question whether it is an error of type E3 or just
out of which 53 proofs contain error E3. Assume that one of the a typing error. The latter is a less serious error.
53 proofs is an outlier with an attribute equal to T RUE and from
the rest of 52 proofs only two proofs have the same value of this Last two outliers in the table are almost the same so only the in-
attribute as the outlier. Then the outlier’s AScore on this attribute stance number 74 is depicted in Fig. 4. These two instances have
is approximately 50/52 = 0.96 and it indicates that the value of this quite low outlier score and they do not expose any shortcomings of
attribute is quite unusual. our evaluation tool. Yet, they exhibit some outlying features such
as resolving on three literals at the same time.
In general, the AScore ranges from -1 to 1. If the outlier resolu-
tion graph contains a pattern which is unique for the class of the 6. DISCUSSION
graph, then the AScore of the corresponding attribute is equal to As we observed it is not sufficient to detect only the errors but we
1. On the other hand, if the outlier misses a pattern and all other need to analyze a context in which an error appeared. Moreover,
graphs contain it, then the AScore is equal to -1. An AScore equal there are solutions that are erroneous because they do not contain
to 0 means that all other instances are equal to the outlier on the a particular pattern or patterns. Outlier detection helped to find
specified attribute. wrong students’ solutions that could not be detected by the system
of queries even though the set of queries has been carefully built 8. REFERENCES
and tested on the test data. We also found a situation when a query [1] J. R. Anderson. Discovering the structure of mathematical
did not detected an error although it appeared in the solution. We problem solving. In Proceedings of EDM, 2013.
are convinced that with increasing number of solutions we will be [2] T. Barnes and J. Stamper. Toward automatic hint generation
able to further increase performance of wrong solution detection. for logic proof tutoring using historical student data. In
Proceedings of the 9th International Conference on
As we stressed in the introduction, this method has not been devel- Intelligent Tutoring Systems, pages 373–382, 2008.
oped for recognition of correct or incorrect solutions. However, to [3] T. Barnes and J. Stamper. Automatic hint generation for logic
verify that the feature construction is appropriate, we also learned proof tutoring using historical data. Educational Technology
various classifiers of that kind. In previous work we used only gen- and Society, 13(1):3–12, 2010.
eralized patterns as attributes for classification with allerrors class [4] L. Breiman. Random forests. Mach. Learn., 45(1):5–32, Oct.
attribute. However, these patterns were not sufficient for our cur- 2001.
rent data. Repeating the same experiments we got the best result for [5] D. J. Cook and L. B. Holder. Mining Graph Data. John
SMO Support Vector Machines from Weka [7], which had 95.2% Wiley & Sons, 2006.
accuracy, see Table 4. Precision and recall for the class "incorrect"
[6] A. Dovier, E. Pontelli, and G. Rossi. Set unification. CoRR,
were 0.94 and 0.61, respectively. Minimum support for pattern se-
cs.LO/0110023, 2001.
lection was 0% in this case. To improve performance of classifica-
tion we used the new level of generalization. Using the same set- [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
tings, but now with both levels of generalized patterns, we achieved and I. H. Witten. The weka data mining software: An update.
96.9% accuracy, 0.95 precision and 0.74 recall for the class "incor- SIGKDD Explor. Newsl., 11(1):10–18, Nov. 2009.
rect". Similar results were obtained when only the new level of [8] N. Hewahi and M. Saad. Class outliers mining:
generalization was used, again with SMO. When ordered accord- Distance-based approach. International Journal of Intelligent
ing to precision, value 0.98 was achieved by J48, but the accuracy Technology, 2.
and recall were only 96.1 and 0.68, respectively. [9] Z. Pekarcikova. Supervised outlier detection, 2013.
http://is.muni.cz/th/207719/fi_m/diplomova_
As one of the most common errors in resolution proofs is usage of praca_pekarcikova.pdf.
resolution rule on two pairs of literals at the same time, we repeated [10] P. Spiros and F. Christos. Cross-outlier detection. In
the experiment, but now discarding all patterns capturing this spe- Proceedings of SSTD, pages 199–213, 2003.
cific kind of error. In this scenario the performance slightly dropped [11] A. Srinivasan. The Aleph Manual, 2001.
but remained still high—J48 achieved 95.4% accuracy, 0.87 preci- http://web.comlab.ox.ac.uk/oucl/research/areas/
sion and 0.72 recall. For the sake of completeness, F1 score for machlearn/Aleph/ [Accessed: 2014-01-09].
the class "correct" varied between 0.97 and 0.99 in all the results [12] J. C. Stamper, M. Eagle, T. Barnes, and M. J. Croy.
above. Experimental evaluation of automatic hint generation for a
logic tutor. I. J. Artificial Intelligence in Education,
We also checked whether inductive logic programming (ILP) can 22(1-2):3–17, 2013.
help to improve the performance under the same conditions. To [13] K. Vaculik and L. Popelinsky. Graph mining for automatic
ensure it, we did not use any domain knowledge predicates that classification of logical proofs. In Proceedings of the 6th
would bring extra knowledge. For that reason, the domain knowl- International Conference on Computer Supported Education
edge contained only predicates common for the domain of graphs, CSEDU 2014, 2014.
like node/3, edge/3, resolutionStep/3 and path/2. We used Aleph [14] K. Vaculik, L. Popelinsky, E. Mrakova, and J. Jurco.
system [11]. The results were comparable with the method de- Tutoring and automatic evaluation of logic proofs. In
scribed above. Proceedings of the 12th European Conference on e-Learning
ECEL 2013, pages 495–502, 2013.
7. CONCLUSION AND FUTURE WORK [15] J. S. Yoo and M. H. Cho. Mining concept maps to understand
In this paper we introduced a new level of generalization method university students’ learning. In Proceedings of EDM, 2012.
for subgraphs of resolution proof trees built by students. Gener- [16] M. J. Zaki. Efficiently mining frequent embedded unordered
alized subgraphs created by this special graph mining method are trees. Fundam. Inf., 66(1-2):33–52, Jan. 2005.
useful for representation of logic proofs in an attribute-value fash-
ion. We showed how a class-based outlier detection method can
be used on these logic proofs by utilization of the generalized sub-
graphs. We also discussed how the outlying proofs may be used for
performance improvement of our automatic proof evaluator. This
method may also be used for other types of data such as tableaux
proofs.

As a future work we are going to analyse the temporal information,
which was saved together with the structural information of logic
proofs.

ACKNOWLEDGEMENTS
This work has been supported by Faculty of Informatics, Masaryk
University and the grant CZ.1.07/2.2.00/28.0209 Computer-aided-
teaching for computational and constructional exercises.
APPENDIX
A. DESCRIPTION OF DATA
CLAUSE - list of nodes from all graphs
. idclause - ID of the node
. coordinatex - x position in drawing
. coordinatey - y position in drawing
. timeofcreation - when the node was created
. timeofdeletion - when the node was deleted (if not deleted, value is "NA")
. idgraph - in which graph the node appears
. text - text label

EDGE - list of (directed) edges from all graphs
. idedge - ID of the edge
. starting - ID of the node from which this edge goes
. ending - ID of the node to which this edge goes
. timeofcreation
. timeofdeletion
. idgraph

ERRORS - errors found in resolution graphs (found by means of SQL queries)
. idgraph - ID of the graph
. error3 - resolving on two literals at the same time (1 = error occurred, 0 = not occurred)
. error4 - repetition of the same literal in a set
. error5 - resolving on identical literals
. error8 - no resolution performed, only union of two sets
. allerrors - any of the previously listed errors occurred / not occured

GRAPH - list of graphs
. idgraph - ID of the graph
. logintime - start of graph creation
. clausetype - either set or ordered list
. resolutiontype - type of resolution, encoded by numbers (see table RESOLUTIONTYPES)
. assignment - textual assignment of task
. endtime - end of graph creation

MOVEMENT - list of coordinate changes of nodes
. idmovement - ID of the change
. idclause - ID of the node whose coordinates were changed
. coordinatex - new x coordinate
. coordinatey - new y coordinate
. time - time of the change

RESOLUTIONTYPES - encoding of resolution types
. typeid - ID (numeric encoding)
. typetext - textual value

TEXT - list of text (label) changes of nodes.
. idtext - ID of the change
. idclause - ID of the node whose text label was changed
. time - time of the change
. text - new text (label) value

TYPES - list of resolution type and clause type changes
. idtypes - ID of the change
. resolutiontype - new value of resolution type for specific graph
. clasetype - new value of clause type for specific graph
. timeofchange - time of the change
. idgraph - ID of the graph whose values were changed