=Paper= {{Paper |id=Vol-1183/gedm_paper04 |storemode=property |title=Cross-Domain Performance of Automatic Tutor Modeling Algorithms |pdfUrl=https://ceur-ws.org/Vol-1183/gedm_paper04.pdf |volume=Vol-1183 |dblpUrl=https://dblp.org/rec/conf/edm/Kumar14 }} ==Cross-Domain Performance of Automatic Tutor Modeling Algorithms== https://ceur-ws.org/Vol-1183/gedm_paper04.pdf

Cross-Domain Performance of
Automatic Tutor Modeling Algorithms
Rohit Kumar
Raytheon BBN Technologies
Cambridge, MA, USA
rkumar @ bbn.com

annotation of the behavior graph. As ITS are being deployed to a
ABSTRACT large active user pool, it is now possible to pilot the UI with a
small sample of learners to collect multiple behavior
In our recent work, we have proposed the use of multiple solution demonstrations. We can significantly reduce the Stage 3 effort of
demonstrations of a learning task to automatically generate a tutor ITS developers by using algorithms that can automatically create
model. We have developed a number of algorithms for this a generalized behavior graph from multiple demonstrations.
automation. This paper describes the application of these domain- Several algorithms to address this challenge have been proposed
independent algorithms to three datasets from different learning and evaluated [4].
domains (Mathematics, Physics, French). Besides verifying the
applicability of our approach across domains, we report several In this paper, we will study the applicability and performance of
domain specific performance characteristics of these algorithms these algorithms on publicly available datasets from three
which can be used to choose appropriate algorithms in a different learning domains. Section 3 summarizes the key
principled manner. While the Heuristic Alignment based characteristics of the four algorithms used in our study. Section 4
algorithm (Algorithm 2) may be the default choice for automatic describes learning domains and the corresponding datasets used in
tutor modeling, our empirical finding suggest that the Path this work. Results and Analysis from our experiments are
Pruning based algorithm (Algorithm 4) may be favored for presented in Section 5. Before diving into the algorithms, the next
language learning domains. section reviews related work on automation of tutor model
development.
Keywords
Tutor Modeling, Automation, Domain Independence, STEM 2. RELATED WORK
domains, Language Learning
Automation of tutor model development process has been
explored in different contexts using completely automated
1. INTRODUCTION methods as well as augmentation of authoring tools [6][7]. For
example, motivated by application in language learning, a series
Wide-scale transition of Intelligent Tutoring Systems (ITS) to the of workshops on the problem of automatic question generation [8]
real world demands a scalable ability to develop such systems. explored a number of information extraction and NLP techniques
The past decade has seen the first instantiations of that employ existing linguistic resources. Barnes and Stamper [9]
industrialization of ITS development in the form of commercial proposed a method that uses existing student solutions to generate
products for different learning domains as well as diverse user hint messages for the Logic Proof tutor. Recently, Eagle et al. [10]
populations. In addition to addressing non-technical challenges have used clustering of interaction network states as an approach
such as designing robust production processes around to the same problem.
multidisciplinary teams of domain and pedagogical experts [1],
the industrialization of this technology is enabled by technical In the context of knowledge-tracing and example-tracing tutors,
advancements such as the development of general purpose McLaren et al. [11] proposed the use of activity logs from novice
authoring tools [2] which has allowed a scalable workforce to users to bootstrap tutor model development. They developed
contribute to ITS development. software tools that integrate access to novice activity logs with
authoring tools. The baseline algorithm (Interaction Networks)
In this paper, we extend our recent work [3][4] on automatically used in our work is similar to the integrated data view used in this
developing Example-Tracing Tutors (ETTs) [5] using multiple prior work. Furthermore, the algorithms used in our work address
behavior demonstrations. Conventionally, ETTs are developed in some of the shortcomings of their work (e.g. inability to identify
three stages by trained domain experts: (1) User Interface (UI) “buggy” paths).
development, (2) Behavior demonstration, (3) Generalization and
In addition to tutor modeling, recent work has investigated
automated methods for improving domain and student models
[12] [13]. Sudol et al. [14] aggregated solution paths taken by
different learners to develop a probabilistic solution assessment
metric. Johnson et al. [15] are creating visualization tools for
interaction networks that combine learner traces from open-ended
problem solving environments. They have developed an algorithm
for reducing the complexity of combined networks to make them
more readable/navigable. In a similar spirit, work by Ritter et al.
[16] used clustering techniques to reduce the large feature space
of student models to assist in qualitative model interpretation.
3. GENERATING BEHAVIOR GRAPHS 3.1.4 Accurate
Automatic Behavior Graph Generation (ABGG) algorithms Behavior graphs should be error free. This includes being able to
analyze the similarities and difference between multiple solution accurately capture the correct and incorrect events by learners
demonstrations of a problem to induce a behavior graph that can depending on the current solution state. Edge accuracy measures
serve as a tutor model for the problem. the percentage of Correct & Incorrect edges that were accurately
generated by the algorithm. Error Rate is a frequency weighted
3.1 Behavior Graphs combination of edge accuracy that measures the fraction of learner
Behavior graphs [5] are directed graphs. The nodes in this graph events that will be inaccurately classified by the automatically
correspond to valid solution states. Non-terminal nodes represent generated behavior graph. We use the error rate of an automatically
partial solutions. Edges in the graph represent solution paths some generate behavior graph on held out demonstrations as the primary
of which are correct and lead to the next state while other are accuracy metric.
incorrect and usually lead back to the same state. Edges are 3.1.5 Robust
annotated with the conditions that a behavior event must meet to
traverse the path. One of the reasons for the success of expertly crafted ETTs is the
ability to use them with a wide range of learners under different
Behavior graphs may contain multiple paths between two nodes. deployment conditions. Automatically generated behavior graphs
Multiple paths are useful to facilitate learner’s exploration of should retain this characteristic; e.g., by identifying alternate paths
alternate solutions to a problem especially in ill-defined learning and unordered groups. It is not unforeseeable that the use of a
domains. Behavior graphs may also include unordered groups. As data-driven approach could contribute to creating behavior graphs
the name suggests, states within an unordered group may be that are more robust than those authored by a human expert.
traversed in any order.
Branching factor is the average number of data values available at
Well-constructed behavior graphs have several desirable each UI element. A large branching factor indicates the capability
characteristics which motivate the design of metrics we use to to process a large variety of learner inputs at each state. Also, the
evaluate ABGG algorithms. number and size of unordered groups is indicative of flexibility a
3.1.1 Effective graph affords to learners to explore the solution paths of a
problem.
Since the purpose of the behavior graphs is to serve as a tutor
model, the primary metric for evaluating these models is their Note that readability and robustness are complementary
learning efficacy measured via use of the models by a relevant characteristics of a behavior graph. For example, a highly
sample of learners. However, in this paper we focus only on the complex behavior graph may be very robust but may not be very
use of automated metrics that do not require access to a learner readable.
pool. Further, as we in section 5, the automatically generated 3.2 ABGG Algorithms
behavior graphs are not perfect. They require checking and
refinement by ITS developers before they can be used with We use four algorithms, introduced in our previous work [4], to
learners. generate behavior graphs using multiple solution traces of a
problem. The first algorithm (Algorithm 1) generates interaction
3.1.2 Readable networks by sequentially collapsing identical events in solution
One of the key characteristics of behavior graphs that makes them traces into a shared node and creating a branch whenever two
a popular model is that they are readable by ITS developers different events are found. Interaction networks have been used in
without requiring a deep understanding of computational or prior work [10][15].
cognitive sciences. Automatically created behavior graphs should Algorithm 2 uses a heuristic alignment technique [3] to align
be editable with existing authoring tools to facilitate necessary similar events across multiple solution traces. The alignment is
manual annotation and modifications. Ideally, ABGG algorithms used to obtain a sequence of traversal through the problem’s steps.
should create concise graphs without losing other desirable Furthermore, this algorithm is able to use the positional entropy of
characteristics. This may involve collapsing redundant paths and a sequence of elements while obtaining the optimal sequence to
even pruning spurious or infrequent edges. identify unordered groups.
The conciseness of a graph can be measured using the number of Similar to the above algorithm, Algorithm 3 finds the optimal
nodes and edges in the graph. Our primary readability metric, sequence between aligned events. However, this algorithm uses
Compression Ratio measures the rate at which an algorithm is the Center Star Algorithm [17] to align the multiple solution
able to reduce behavior events into behavior states (i.e. nodes) by traces instead of the heuristic used by Algorithm 2. The Center
finding similarities between events. Star Algorithm is a foundational algorithm used for aligning more
than two sequences of symbols. It is particularly suited for our
3.1.3 Complete application because it is polynomial time in computational
In order to minimize author effort, generated behaviors graphs complexity and it does not make any assumptions about the space
should be as complete for creating an ETT as possible. As a and relationship of symbols comprising the sequence.
minimal criterion, at least one valid path to the final solution
First order transition matrix computed from solution traces can be
should be included♦. Additionally, complete behaviors graphs are
used to represent a directed graph. Algorithm 4 considers ABGG
annotated with all the expected inputs by the learner. We use the
as the process of finding multiple paths in a directed graph.
Rate of Unseen Events in held out demonstrations as the primary
Specifically, the longest (non-repeating) path in this directed
metric to measure the completeness of our automatically
graph represents the most likely path through the solution steps.
generated behavior graphs.
Since, the problem of finding longest paths in general graphs is
known to be NP-hard, we employ a combination of bounded
longest path finding and an algorithm for finding multiple shortest trace/demonstration comprises of a sequence of user interface (UI)
paths [18] in a transformed transition matrix to obtain a number of events. Each event is represented as a 2-tuple e = (u, d) that
different paths through the directed graph. These paths are merged includes an identifier u of the UI element and data d associated
to construct a behavior graph similar to the process of with the event. A UI element may be visited any number of times
constructing an interaction network. within a trace. In general, data can include one or more attributes
Algorithm 2, 3 and 4 assume that if two or more events within a of the event such as the event type, user input, event duration, etc.
trace were generated by the same UI element, the latter event In this paper, we assume single data attribute events where the
corresponds to a correction of the data value input at the former data captures the learner input at the UI element.
events. In this case, we refer to the former events as retracted Table 2. Problems & Traces for the three learning domains
events and data values entered at these events are assumed to be
incorrect values. Using this assumption, these three algorithms are Math. Physics French
able to automatically generate incorrect paths in behavior graphs
#Problems 1013 497 71
unlike Algorithm 1. This assumption is not applied to Algorithm 1
to compare our work against prior work [11] on extracting tutor Max. #Unique Elements 33 62 10
models from multiple demonstrations.
Avg. #Unique Elements 4.6 9.7 2.5
3.3 Discussion Avg. #Training Traces 76.0 26.6 12.1
Table 1 characterizes the four algorithms described above based
on their capabilities. Incremental addition of demonstrations to Avg. #Heldout Traces 38.0 13.3 6.1
generate interaction networks does not identify incorrect input Avg. #Events Per Trace 5.3 8.9 4.7
data values. However, using the assumption about retracted
events, the other three algorithms are able to identify incorrect
inputs. Johnson et al. [15] used a similar assumption in their work
on reducing the visual complexity of interaction networks. We
notice that the Algorithms 2 and 3 are complementary in terms of
their ability to find alternate paths and unordered groups.
Algorithm 4 on the other hand offers both of these abilities.
Table 1. Comparison of Algorithm Capabilities

Capability▼ Algorithm► 1 2 3 4
Identifies incorrect answers N Y Y Y
Generates alternate paths N N Y Y
Finds unordered groups N Y N Y
Generalizes beyond training demonstrations N Y Y Y
Guarantees all training demnstrs. will pass Y N N N
Finds atleast one path to final solution♦ Y Y Y N
Discovers new/unseen data values N N N N

None of the algorithms discussed in this paper are capable of
discovering unseen inputs beyond those seen in the solution
traces. This type of generative ability is particularly useful for
learning tasks, such as language learning, where a large number of
different inputs may be expected from the learners. In our ongoing
work, we use a number of heuristics [7] as well as grammar
induction techniques [6] to generate unseen inputs for certain
nodes in the behavior graphs. Figure 1. Example Math Problem from Assistments
Source: www.assistments.org, April 2014
4. DATASETS Table 2 provides some statistics about the problem and traces for
1
We use three datasets, accessed via DataShop [19], to study the each of learning domains used in this work. The Mathematics
cross-domain applicability of ABGG algorithms. These datasets traces were derived from three Assistments [20] datasets.
were filtered to use only problems that had six or more traces and Assistments is a web-based learning platform, developed by
had at least two UI elements. Also, we eliminated all events, such Worcester Polytechnic Institute (WPI), that includes a
as help requests, that did not correspond to user input at a solution Mathematics intelligent tutoring system for middle & high school
step. In this way, the datasets were transformed into solution grades. Figure 1 shows an example math problem from the
traces. As discussed in Kumar et al. [4], a solution Assistments system. Together, these datasets are the largest of the
three domains we use. Prior to filtering, these dataset comprised a
total of 683,197 traces and 1,905,672 events from 3,140 problems.
1
PSLC DataShop is available at http://pslcdatashop.org For our experiments, we treat the three datasets to be independent
of each other to account for change in UI designs of the problems eliminated due to the filtering criterion compared to Mathematics
common to the three datasets. or Physics.
We used 10 (out of 20) of the largest datasets released under the
Andes2 project [22] to build the collection of Physics problems
and traces. Andes2 is an intelligent tutoring system that includes
pedagogical content for a two-semester long college and advanced
high-school level Physics course. These ten datasets are based on
logs from several semesters of use of the Andes2 system at the
United States Naval Academy. Prior to filtering, these dataset
comprised a total of 81,173 traces and 1,162,581 events from
2,187 different problems. Note that, as is case with the Math
dataset, we treat the ten Andes2 datasets independently. Note that,
unlike typical domain independent example-tracing based tutor,
the Andes2 systems uses a model-tracing approach for tracking
learner’s solution of a problem and to provide feedback. The
domain knowledge dependent model tracer is able to match highly
inflected learner inputs (e.g. variable names) to its solution graph.
Despite this difference in tutoring approach used by the Andes2
system, we decided to include this domain in our experiments to
study the performance of our algorithms on such solution traces.
Figure 2. Example Steps from Problem from the French
Finally, the French traces are based on two dataset from the Online Course Source: oli.cmu.edu, April 2014
“French Course” project on DataShop. These datasets were
collected from logs of student’s use of the “French Online” course The datasets used in our experiments contain solution traces.
hosted by the Open Learning Initiative (OLI) [22] at Carnegie Traces are paths through an existing behavior graph, unlike
Mellon University. Figure 2 shows steps from couple of example behavior demonstrations which are unconstrained by existing
problems from this course. These datasets comprised a total of tutor models. In addition to the fact that these are the only
37,439 traces and 253,744 events from 1,246 different problems. available large scale collection of solution paths, we use these
Note that a significantly larger fraction of French problems were datasets in our experiments because these traces have been

Table 3. Averaged Metrics for the Graphs Generated by ABGG Algorithms
*indicates significant (p < 0.05) difference with the other algorithms (within the same dataset)

Mathematics (Assistments) Physics (Andes2) French (OLI)

Algorithm ► 1 2 3 4 1 2 3 4 1 2 3 4

#Nodes 79.2 5.4* 6.0* 6.6* 147.8 7.9* 11.5* 11.7* 25.6 3.8* 4.5* 4.5*

#Correct Edges 148.0 12.9* 18.3* 17.5* 182.2 43.5* 76.4 34.5* 37.2 6.9 9.8 9.5

#Incorrect Edges 23.9 33.5 19.5* 35.1 53.0 13.4* 4.2 11.0 8.0

Compression Ratio 6.7 76.8* 66.8 60.2 2.3 31.6* 21.9 21.7 2.2 14.6 12.8 12.8

% Accurate Correct Edges 39.1 41.9 42.5* 44.1* 61.4 80.2* 58.9 80.8* 22.5 27.7* 26.9* 29.8*

% Accurate Incorrect Edges 99.9* 97.2 99.5* 92.5* 67.3 85.5 97.8* 86.1 87.2

Training Error Rate 51.4 25.4 17.7* 17.5* 33.6 17.2* 25.8 24.3 75.2 56.1 22.3* 25.3*

Heldout Error Rate 42.8 23.5 16.1* 15.7* 29.1 25.5* 33.3 30.8 45.3 35.9 19.9* 18.5*

% Training Unseen Events 0.0* 10.7 2.2 6.8 0.0* 14.1 12.2 24.6 0.0* 13.4 5.2 4.5

% Heldout Unseen Events 10.2* 19.1 11.5* 13.9 35.9* 41.7 38.4* 42.6 31.7* 40.7 34.4* 34.3*

Branching Factor 2.2 10.9 12.6* 8.5 1.5 13.4* 12.9* 6.0 1.6 6.7* 9.4* 7.8*

#Groups 0.5* 0.0 0.8 1.4* 0.3* 0.1

Avg. Group Size 1.9* 0.0 2.0 2.0 0.6* 0.3

% Group Coverage 31.8* 0.5 27.2 30.6* 15.4* 6.1
collected from a large set of real users. They contain realistic accuracy metrics for this dataset and is not significantly worse on
variations in learner inputs similar to demonstrations. the fourth metric. Because of their lossless nature, Algorithm 1
(Interaction Network) performs the best on Completeness metrics
5. EXPERIMENTS (% unseen events). However, it is not significantly better than
We use a three-fold cross validation design that splits the Algorithm 3 (Center-Star Alignment). We find evidence of over-
available traces into three different training and held out sets. The fitting of the algorithms to training traces on this metric as
readability metrics (i.e. number of nodes, number of edges and indicated by the approximately 9% higher rate of unseen events
compression ratio) as well as the robustness metrics (branching for held out traces for all the algorithms. Algorithm 3 significantly
factor, number of unordered groups, average group size and outperforms the other algorithms on the primary robustness metric
coverage of graph within groups) are reported on the behavior (Branching Factor) for this domain. Algorithm 2 is better than
graphs generated by the algorithms. On the other hand, some Algorithm 4 for the metrics based on unordered groups.
accuracy metrics such as the accuracy of correct and incorrect
edges are measured on generated graphs whereas others such as
error rate are measured on event sequences which could be the
training traces; i.e., sequences used to generate the graphs, or held
out traces. Similarly, our completeness metrics, i.e. the rate of
unseen events in a sequence, can be measured on both training as
well as held out traces. Note that the metrics computed on training
traces used to generate the graphs may not accurately indicate the
performance of an algorithm due to over-fitting. This is the
motivation for choosing the cross validation based experimental
design.
5.1 Results
Table 3 shows our results along 14 metrics for each of the four
algorithms applied to the three learning domains under
consideration. Reported metrics are averaged over three cross
validation splits as well as over all the problems for each domain.
The metrics are organized by the four desirable characteristics
discussed earlier. Primary metric for each characteristic is
highlighted.

Figure 3. Compression Ratio of Algorithm 2
5.1.1 Mathematics Figure 4. Heldout Error Rate of Algorithms 2 and 4
As expected, the interaction networks comprise of a large number
of nodes and edges that lead them to have significantly smaller
5.1.2 Physics
compression ratio. Algorithm 2 (Heuristic Alignment) On the primary readability metric (Compression Ratio), Algorithm
outperforms all other algorithms on three of the readability 2 outperforms the others on the Physics dataset as was the case
metrics. On the other hand, Algorithm 4 (Path Pruning) with Mathematics. This is consistent with prior conclusion [4] on
significantly outperforms the other algorithms on three of the the use of Algorithm 2 for readability. We note that the Physics
dataset has significantly lower compression ratio than the previous Second, we expect the branching factor to be higher for a
dataset. Figure 3 shows a scatter plot and domain-specific language learning domain, due to the high degree of linguistic
regression fits for the compression ratio of Algorithm 2 for variation in learner inputs. The results in Table 3 do not indicate
different problems with different number of training traces and UI this. However, Figure 6 verifies this intuition. Branching factor
elements. We see that for equivalent number of training traces, the for the French behavior graphs is higher than those for the STEM
compression ratio for Physics is actually slightly better than domain for problems that have 10 or more traces.
Mathematics. However, as we know from Table 2, fewer training
traces are available for the Physics problems on average.
On the primary accuracy metric, we find that Algorithm 2 works
best for Physics unlike the case with the Mathematics domain. We
can note that the Algorithm 2 is significantly better on the
accuracy of incorrect edges. Figure 4 shows the relationship
between the error rate in heldout traces and the accuracy of
incorrect edges. We also see that the percentage of unseen events
in heldout traces is significantly higher for Physics. The lower
incorrect edge accuracy and higher percentage of unseen events
can be attributed to the differences in the tutoring approach
underlying the Andes2 system which uses domain-specific
knowledge to match a large variety of inputs from the learner at
each step of the solution. Because of this, Andes2 elicits
significantly diverse (& hence novel) inputs across traces.
Algorithms 2 and 3 are not significantly different in terms of the
primary robustness metric.
5.1.3 French

Figure 6. Branching Factor of Algorithm 3
5.1.4 Automatically Generated Behavior Graphs
Figures 7, 8 and 9 showcase several qualitative characteristics of
automatically generated behavior graphs (truncated to fit) for the
problems in the three datasets used in this work. We use the
following visual convention: Circular nodes represent states and
are labeled with identifiers u of the corresponding UI element.
Edges are labeled with the data values d. Correct edges are labeled
with green rectangles and incorrect edges are labeled with red
rectangles. Unordered groups are shown using blue containers.
Figure 7 shows graphs generated by two different algorithms for
the same Mathematics problem in the Assistments dataset using
241 solution traces by learners. The graph generated by Algorithm
1 is dense and hardly readable due to the large number of nodes
and edges in this graph. Also, as discussed in Section 3, this
algorithm is unable to identify incorrect paths. Contrary to that,
the graph in Figure 7b is composed of only 6 nodes. The various
paths taken by learners are compressed into 46 correct and 39
Figure 5. Accuracy of Correct Edges for Algorithm 4
incorrect edges. We can notice that not all paths are accurate.
The results for our non-STEM domain are largely consistent with However, the accurate paths are more frequent, as indicated by the
the Mathematics domain. This may be attributed to the similarities thicker arcs associated with the edge. In our ongoing work, we are
of the underlying tutoring approach for the Assistments system extending these algorithms to use this frequency attribute to
and the French Online course which has been developed using the eliminate inaccurate paths (either automatically, or by providing
Cognitive Tutor Authoring Tools (CTAT) [2]. However, we can additional controls to model developers in authoring tools).
notice two key differences. First, the accuracy of correct edges for
A behavior graph from the Physics dataset is shown in Figure 8.
this domain is significantly lower. Because the French Online
As discussed earlier, the large variation in learner input at each
Course is deployed on an publicly accessible platform, its likely
state is depicted in the edge labels of this graph. We notice that for
that a large number of the solution traces were generated by
the last state (s6) which corresponds to the learners filling in the
beginners as well as non-serious users leading to the dataset
answer to a problem, many minor variations of the correct answer
containing many incomplete solution traces containing no correct
are accurately captured. Due to the domain independent nature of
answers. This is evidenced in Figure 5 as we see that correct edge
our algorithms, these answers are treated as different string.
accuracy dramatically degrades for long traces which is contrary
Integration of domain knowledge can lead to further compression
to the case with the other two domains.
of these answers into a single path.
The linguistic variation in the inputs to a problem in the French
dataset is also noticeable in the two graphs for the same problem
in Figure 9. We can see the several wrong answers are marked as
correct answers (and vice versa), although the frequency-based
edge notation identifies the correct answer as was the case in
Figure 7b. In this problem, learners are asked to listen to an audio
file and type in the French word they hear. Learners are allowed
to go back and forth between these two steps. The first step has no
wrong answer. We notice that our assumption to consider
retracted events as incorrect fails in this case.

Figure 9a. Behavior Graph: French, Algorithm 2

Figure 9b. Behavior Graph: French, Algorithm 4
Figure 7a. Behavior Graph: Mathematics, Algorithm 1
6. CONCLUSIONS
In this paper, we have shared results from an empirical analysis of
application of ABGG algorithms to three different learning
domains. Several similarities and differences between the
performances of four algorithms on problems from these three
domains were discussed in the previous section.
We find that the accuracy of these algorithms suffers when they
are applied to solution traces collected from a tutoring system that
uses domain knowledge to process a large variety of inputs from
learners. While in our previous work [4], we have recommended
the use of Algorithm 2 as the default ABGG algorithm for use
within authoring tools, we find that for language learning
domains, Algorithm 4 may be preferable since it is the most
accurate on the French dataset and not significantly worse than the
Figure 7b. Behavior Graph: Mathematics, Algorithm 2 other algorithms on the other primary metrics.

It is particularly interesting to note the differences in the way We identified multiple potential improvements to the ABGG
Algorithm 2 and Algorithm 4 encode robustness into the learnt algorithms based on these analyses. There are several domain
tutor model. While Algorithm 2 identifies an unordered group specific nuances to the UI elements that comprise the problems in
containing the listen and answer nodes which allows learners to each domain. For example, in the French domain, we found steps
traverse these nodes in any order, Algorithm 4 identifies that the that do not have any wrong answer. For broad use, ABGG
listen step is optional and create two different way to reach the algorithms should identify these UI elements and selectively apply
answer step based on the solution behaviors exhibited by learners the powerful assumption about retracted events. Furthermore, the
in the traces. algorithms can exploit additional features computed from across
the multiple traces, such as the frequency of a data value at a
node, to improve the accuracy of the automatically generated
behavior graphs.
Finally, this paper extends our recent work on use of multiple
behavior demonstrations to automatically generate tutor models
using ABGG algorithms. While these algorithms can be improved
in specific ways discussed above, we find evidence for their
applicability to multiple domains.

Figure 8. Behavior Graph: Physics, Algorithm 2
ACKNOWLEDGEMENTS Interaction Logs to Improve Educational Outcomes, 7th
International Conference on Intelligent Tutoring Systems
This research was funded by the US Office of Naval Research (ITS 2004). August 2004
(ONR) contract N00014-12-C-0535.
[12] Pavlik, P.I., Cen, H., and Koedinger, K.R. 2009. Learning
Factors Transfer Analysis: Using Learning Curve Analysis to
7. REFERENCES Automatically Generate Domain Models, In Proceedings of
the 2nd International Conference on Educational Data
[1] Johnson, W. L., and Valente, A. 2008. Collaborative Mining (EDM 2009). Barnes, T., Desmarais, M., Romero, C.,
authoring of serious games for language and culture. In Ventura, S. (Eds.). 121-130
Proceedings of SimTecT (March 2008).
[13] Koedinger, K.R., Mclaughlin E.A., and Stamper, J.C. 2012.
[2] Aleven, V., McLaren, B. M., Sewall, J., and Koedinger. K. Automated student model improvement, In Proceedings of
R. 2006. The cognitive tutor authoring tools (CTAT): the 5th International Conference on Educational Data Mining
preliminary evaluation of efficiency gains. In Proceedings of (EDM 2012). Yacef, K., Zaïane, O., Hershkovitz, H.,
the 8th International Conference on Intelligent Tutoring Yudelson, M., and Stamper, J. (Eds.). 17-24
Systems (ITS'06), Ikeda, M., Ashley, K. D., and Chan, T.W.
(Eds.). Springer-Verlag, Berlin, Heidelberg, 61-70. [14] Sudol, L.A, Rivers, K., and Harris, T.K. 2012. Calculating
Probabilistic Distance to Solution in a Complex Problem
[3] Kumar, R., Roy, M.E, Roberts, R.B., and Makhoul, J.I. 2014. Solving Domain, In Proceedings of the 5th International
Towards Automatically Building Tutor Models Using Conference on Educational Data Mining (EDM 2012).
Multiple Behavior Demonstrations. In Proceedings of 12th Yacef, K., Zaïane, O., Hershkovitz, H., Yudelson, M., and
Intl. Conf. on Intelligent Tutoring Systems (ITS 2014), Stamper, J. (Eds.). 144-147
Honolulu, HI.
[15] Johnson, M., Eagle, M., Stamper, J., and Barnes, T. 2013. An
[4] Kumar, R., Roy, M.E, Roberts, R.B., and Makhoul, J.I. 2014. Algorithm for Reducing the Complexity of Interaction
Comparison of Algorithms for Automatically Building Networks, In Proceedings of the 6thInternational Conference
Example-Tracing Tutor Models. In Proceedings of 7th Intl. on Educational Data Mining, (EDM 2013). D’Mello, S. K.,
Conf. on Educational Data Mining (EDM 2014), Honolulu, Calvo, R. A., Olney, A. (Eds.). 248-251
HI.
[16] Ritter, R., Harris, T.K, Nixon, T., Dickison, D., Murray,
[5] Aleven, V., Mclaren, B. M., Sewall, J., and Koedinger. K. R. R.C., and Towle, B. 2009. Reducing the Knowledge Tracing
2009. A New Paradigm for Intelligent Tutoring Systems: Space, In Proceedings of the 2ndInternational Conference on
Example-Tracing Tutors. Int. J. Artif. Intell. Ed. 19, 2 (April Educational Data Mining (EDM 2009). Barnes, T.,
2009), 105-154. Desmarais, M., Romero, C., Ventura, S. (Eds.). 151-160
[6] Kumar, R., Sagae, A., and Johnson, W. L. 2009. Evaluating [17] Gusfield, D. 1997. Algorithms on Strings, Trees and
an Authoring Tool for Mini-Dialogs. In Proceedings of the Sequences. Cambridge University Press, New York.
2009 Conference on Artificial Intelligence in Education,
Dimitrova, V., Mizoguchi, R., du Boulay, B., and Graesser, [18] Yen, J. Y. 1971. Finding the K Shortest Loopless Paths in a
A. (Eds.). IOS Press, Amsterdam, The Netherlands, The Network. Management Science 17(11). 712-716
Netherlands, 647-649. [19] Koedinger, K.R., Baker, R.S.J.d., Cunningham, K.,
[7] Kumar, R., Roy, M.E, Pattison-Gordon, E. and Roberts, R.B. Skogsholm, A., Leber, B., and Stamper, J. 2010. A Data
2014. General Purpose ITS Development Tools. In Repository for the EDM community: The PSLC DataShop.
Proceedings of Workshop on Intelligent Tutoring System In Handbook of Educational Data Mining. Romero, C.,
Authoring Tools, 12th Intl. Conf. on Intelligent Tutoring Ventura, S., Pechenizkiy, M., Baker, R.S.J.d. (Eds.). Boca
Systems (ITS 2014), Honolulu, HI. Raton, FL: CRC Press
[8] Question Generation Workshops. 2008-2011. [20] Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N. T.,
http://www.questiongeneration.org/ Koedinger, K. R., Junker, B., Ritter, S., Knight, A.,
Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner,
[9] Barnes, T. and Stamper, J. 2008. Toward Automatic Hint T. E., Upalekar. R, Walonoski, J.A., Macasek. M.A. and
Generation for Logic Proof Tutoring Using Historical Rasmussen, K. P. 2005. The Assistment project: Blending
Student Data. In Proceedings of the 9th International assessment and assisting. In Proceedings of the 12th
Conference on Intelligent Tutoring Systems (ITS '08). Woolf, International Conference on Artificial Intelligence in
B. P., Aimeur, E., Nkambou, R., and Lajoie , S. (Eds.). Education, C.K. Looi, G. McCalla, B. Bredeweg, & J.
Springer-Verlag, Berlin, Heidelberg, 373-382. Breuker (Eds.) IOS Press. 555-562.
[10] Eagle, M., Johnson, J., and Barnes, T., 2012. Interaction [21] VanLehn, K., Lynch, C., Schulze, K. Shapiro, J. A., Shelby,
Networks: Generating High Level Hints Based on Network R., Taylor, L., Treacy, D., Weinstein, A., and Wintersgill, M.
Community Clusterings, In Proceedings of the 5th 2005. The Andes physics tutoring system: Lessons Learned.
International Conference on Educational Data Mining In International Journal of Artificial Intelligence and
(EDM 2012). Yacef, K., Zaïane, O., Hershkovitz, H., Education, 15 (3), 1-47
Yudelson, M., and Stamper, J. (Eds.). 164-167
[22] Strader, R. and Thille, C. 2012. The Open Learning
[11] McLaren, B.M., Koedinger, K.R., Schneider, M., Harrer, A., Initiative: Enacting Instruction Online. In Game Changers:
and Bollen, L. 2004. Bootstrapping Novice Data: Semi- Education and Information Technologies. Oblinger, D.G.
Automated Tutor Authoring Using Student Log Files. In (Ed.) Educause. 201-213.
Proceedings of the Workshop on Analyzing Student-Tutor