=Paper=
{{Paper
|id=Vol-1183/gedm_paper04
|storemode=property
|title=Cross-Domain Performance of Automatic Tutor Modeling Algorithms
|pdfUrl=https://ceur-ws.org/Vol-1183/gedm_paper04.pdf
|volume=Vol-1183
|dblpUrl=https://dblp.org/rec/conf/edm/Kumar14
}}
==Cross-Domain Performance of Automatic Tutor Modeling Algorithms==
Cross-Domain Performance of Automatic Tutor Modeling Algorithms Rohit Kumar Raytheon BBN Technologies Cambridge, MA, USA rkumar @ bbn.com annotation of the behavior graph. As ITS are being deployed to a ABSTRACT large active user pool, it is now possible to pilot the UI with a small sample of learners to collect multiple behavior In our recent work, we have proposed the use of multiple solution demonstrations. We can significantly reduce the Stage 3 effort of demonstrations of a learning task to automatically generate a tutor ITS developers by using algorithms that can automatically create model. We have developed a number of algorithms for this a generalized behavior graph from multiple demonstrations. automation. This paper describes the application of these domain- Several algorithms to address this challenge have been proposed independent algorithms to three datasets from different learning and evaluated [4]. domains (Mathematics, Physics, French). Besides verifying the applicability of our approach across domains, we report several In this paper, we will study the applicability and performance of domain specific performance characteristics of these algorithms these algorithms on publicly available datasets from three which can be used to choose appropriate algorithms in a different learning domains. Section 3 summarizes the key principled manner. While the Heuristic Alignment based characteristics of the four algorithms used in our study. Section 4 algorithm (Algorithm 2) may be the default choice for automatic describes learning domains and the corresponding datasets used in tutor modeling, our empirical finding suggest that the Path this work. Results and Analysis from our experiments are Pruning based algorithm (Algorithm 4) may be favored for presented in Section 5. Before diving into the algorithms, the next language learning domains. section reviews related work on automation of tutor model development. Keywords Tutor Modeling, Automation, Domain Independence, STEM 2. RELATED WORK domains, Language Learning Automation of tutor model development process has been explored in different contexts using completely automated 1. INTRODUCTION methods as well as augmentation of authoring tools [6][7]. For example, motivated by application in language learning, a series Wide-scale transition of Intelligent Tutoring Systems (ITS) to the of workshops on the problem of automatic question generation [8] real world demands a scalable ability to develop such systems. explored a number of information extraction and NLP techniques The past decade has seen the first instantiations of that employ existing linguistic resources. Barnes and Stamper [9] industrialization of ITS development in the form of commercial proposed a method that uses existing student solutions to generate products for different learning domains as well as diverse user hint messages for the Logic Proof tutor. Recently, Eagle et al. [10] populations. In addition to addressing non-technical challenges have used clustering of interaction network states as an approach such as designing robust production processes around to the same problem. multidisciplinary teams of domain and pedagogical experts [1], the industrialization of this technology is enabled by technical In the context of knowledge-tracing and example-tracing tutors, advancements such as the development of general purpose McLaren et al. [11] proposed the use of activity logs from novice authoring tools [2] which has allowed a scalable workforce to users to bootstrap tutor model development. They developed contribute to ITS development. software tools that integrate access to novice activity logs with authoring tools. The baseline algorithm (Interaction Networks) In this paper, we extend our recent work [3][4] on automatically used in our work is similar to the integrated data view used in this developing Example-Tracing Tutors (ETTs) [5] using multiple prior work. Furthermore, the algorithms used in our work address behavior demonstrations. Conventionally, ETTs are developed in some of the shortcomings of their work (e.g. inability to identify three stages by trained domain experts: (1) User Interface (UI) “buggy” paths). development, (2) Behavior demonstration, (3) Generalization and In addition to tutor modeling, recent work has investigated automated methods for improving domain and student models [12] [13]. Sudol et al. [14] aggregated solution paths taken by different learners to develop a probabilistic solution assessment metric. Johnson et al. [15] are creating visualization tools for interaction networks that combine learner traces from open-ended problem solving environments. They have developed an algorithm for reducing the complexity of combined networks to make them more readable/navigable. In a similar spirit, work by Ritter et al. [16] used clustering techniques to reduce the large feature space of student models to assist in qualitative model interpretation. 3. GENERATING BEHAVIOR GRAPHS 3.1.4 Accurate Automatic Behavior Graph Generation (ABGG) algorithms Behavior graphs should be error free. This includes being able to analyze the similarities and difference between multiple solution accurately capture the correct and incorrect events by learners demonstrations of a problem to induce a behavior graph that can depending on the current solution state. Edge accuracy measures serve as a tutor model for the problem. the percentage of Correct & Incorrect edges that were accurately generated by the algorithm. Error Rate is a frequency weighted 3.1 Behavior Graphs combination of edge accuracy that measures the fraction of learner Behavior graphs [5] are directed graphs. The nodes in this graph events that will be inaccurately classified by the automatically correspond to valid solution states. Non-terminal nodes represent generated behavior graph. We use the error rate of an automatically partial solutions. Edges in the graph represent solution paths some generate behavior graph on held out demonstrations as the primary of which are correct and lead to the next state while other are accuracy metric. incorrect and usually lead back to the same state. Edges are 3.1.5 Robust annotated with the conditions that a behavior event must meet to traverse the path. One of the reasons for the success of expertly crafted ETTs is the ability to use them with a wide range of learners under different Behavior graphs may contain multiple paths between two nodes. deployment conditions. Automatically generated behavior graphs Multiple paths are useful to facilitate learner’s exploration of should retain this characteristic; e.g., by identifying alternate paths alternate solutions to a problem especially in ill-defined learning and unordered groups. It is not unforeseeable that the use of a domains. Behavior graphs may also include unordered groups. As data-driven approach could contribute to creating behavior graphs the name suggests, states within an unordered group may be that are more robust than those authored by a human expert. traversed in any order. Branching factor is the average number of data values available at Well-constructed behavior graphs have several desirable each UI element. A large branching factor indicates the capability characteristics which motivate the design of metrics we use to to process a large variety of learner inputs at each state. Also, the evaluate ABGG algorithms. number and size of unordered groups is indicative of flexibility a 3.1.1 Effective graph affords to learners to explore the solution paths of a problem. Since the purpose of the behavior graphs is to serve as a tutor model, the primary metric for evaluating these models is their Note that readability and robustness are complementary learning efficacy measured via use of the models by a relevant characteristics of a behavior graph. For example, a highly sample of learners. However, in this paper we focus only on the complex behavior graph may be very robust but may not be very use of automated metrics that do not require access to a learner readable. pool. Further, as we in section 5, the automatically generated 3.2 ABGG Algorithms behavior graphs are not perfect. They require checking and refinement by ITS developers before they can be used with We use four algorithms, introduced in our previous work [4], to learners. generate behavior graphs using multiple solution traces of a problem. The first algorithm (Algorithm 1) generates interaction 3.1.2 Readable networks by sequentially collapsing identical events in solution One of the key characteristics of behavior graphs that makes them traces into a shared node and creating a branch whenever two a popular model is that they are readable by ITS developers different events are found. Interaction networks have been used in without requiring a deep understanding of computational or prior work [10][15]. cognitive sciences. Automatically created behavior graphs should Algorithm 2 uses a heuristic alignment technique [3] to align be editable with existing authoring tools to facilitate necessary similar events across multiple solution traces. The alignment is manual annotation and modifications. Ideally, ABGG algorithms used to obtain a sequence of traversal through the problem’s steps. should create concise graphs without losing other desirable Furthermore, this algorithm is able to use the positional entropy of characteristics. This may involve collapsing redundant paths and a sequence of elements while obtaining the optimal sequence to even pruning spurious or infrequent edges. identify unordered groups. The conciseness of a graph can be measured using the number of Similar to the above algorithm, Algorithm 3 finds the optimal nodes and edges in the graph. Our primary readability metric, sequence between aligned events. However, this algorithm uses Compression Ratio measures the rate at which an algorithm is the Center Star Algorithm [17] to align the multiple solution able to reduce behavior events into behavior states (i.e. nodes) by traces instead of the heuristic used by Algorithm 2. The Center finding similarities between events. Star Algorithm is a foundational algorithm used for aligning more than two sequences of symbols. It is particularly suited for our 3.1.3 Complete application because it is polynomial time in computational In order to minimize author effort, generated behaviors graphs complexity and it does not make any assumptions about the space should be as complete for creating an ETT as possible. As a and relationship of symbols comprising the sequence. minimal criterion, at least one valid path to the final solution First order transition matrix computed from solution traces can be should be included♦. Additionally, complete behaviors graphs are used to represent a directed graph. Algorithm 4 considers ABGG annotated with all the expected inputs by the learner. We use the as the process of finding multiple paths in a directed graph. Rate of Unseen Events in held out demonstrations as the primary Specifically, the longest (non-repeating) path in this directed metric to measure the completeness of our automatically graph represents the most likely path through the solution steps. generated behavior graphs. Since, the problem of finding longest paths in general graphs is known to be NP-hard, we employ a combination of bounded longest path finding and an algorithm for finding multiple shortest trace/demonstration comprises of a sequence of user interface (UI) paths [18] in a transformed transition matrix to obtain a number of events. Each event is represented as a 2-tuple e = (u, d) that different paths through the directed graph. These paths are merged includes an identifier u of the UI element and data d associated to construct a behavior graph similar to the process of with the event. A UI element may be visited any number of times constructing an interaction network. within a trace. In general, data can include one or more attributes Algorithm 2, 3 and 4 assume that if two or more events within a of the event such as the event type, user input, event duration, etc. trace were generated by the same UI element, the latter event In this paper, we assume single data attribute events where the corresponds to a correction of the data value input at the former data captures the learner input at the UI element. events. In this case, we refer to the former events as retracted Table 2. Problems & Traces for the three learning domains events and data values entered at these events are assumed to be incorrect values. Using this assumption, these three algorithms are Math. Physics French able to automatically generate incorrect paths in behavior graphs #Problems 1013 497 71 unlike Algorithm 1. This assumption is not applied to Algorithm 1 to compare our work against prior work [11] on extracting tutor Max. #Unique Elements 33 62 10 models from multiple demonstrations. Avg. #Unique Elements 4.6 9.7 2.5 3.3 Discussion Avg. #Training Traces 76.0 26.6 12.1 Table 1 characterizes the four algorithms described above based on their capabilities. Incremental addition of demonstrations to Avg. #Heldout Traces 38.0 13.3 6.1 generate interaction networks does not identify incorrect input Avg. #Events Per Trace 5.3 8.9 4.7 data values. However, using the assumption about retracted events, the other three algorithms are able to identify incorrect inputs. Johnson et al. [15] used a similar assumption in their work on reducing the visual complexity of interaction networks. We notice that the Algorithms 2 and 3 are complementary in terms of their ability to find alternate paths and unordered groups. Algorithm 4 on the other hand offers both of these abilities. Table 1. Comparison of Algorithm Capabilities Capability▼ Algorithm► 1 2 3 4 Identifies incorrect answers N Y Y Y Generates alternate paths N N Y Y Finds unordered groups N Y N Y Generalizes beyond training demonstrations N Y Y Y Guarantees all training demnstrs. will pass Y N N N Finds atleast one path to final solution♦ Y Y Y N Discovers new/unseen data values N N N N None of the algorithms discussed in this paper are capable of discovering unseen inputs beyond those seen in the solution traces. This type of generative ability is particularly useful for learning tasks, such as language learning, where a large number of different inputs may be expected from the learners. In our ongoing work, we use a number of heuristics [7] as well as grammar induction techniques [6] to generate unseen inputs for certain nodes in the behavior graphs. Figure 1. Example Math Problem from Assistments Source: www.assistments.org, April 2014 4. DATASETS Table 2 provides some statistics about the problem and traces for 1 We use three datasets, accessed via DataShop [19], to study the each of learning domains used in this work. The Mathematics cross-domain applicability of ABGG algorithms. These datasets traces were derived from three Assistments [20] datasets. were filtered to use only problems that had six or more traces and Assistments is a web-based learning platform, developed by had at least two UI elements. Also, we eliminated all events, such Worcester Polytechnic Institute (WPI), that includes a as help requests, that did not correspond to user input at a solution Mathematics intelligent tutoring system for middle & high school step. In this way, the datasets were transformed into solution grades. Figure 1 shows an example math problem from the traces. As discussed in Kumar et al. [4], a solution Assistments system. Together, these datasets are the largest of the three domains we use. Prior to filtering, these dataset comprised a total of 683,197 traces and 1,905,672 events from 3,140 problems. 1 PSLC DataShop is available at http://pslcdatashop.org For our experiments, we treat the three datasets to be independent of each other to account for change in UI designs of the problems eliminated due to the filtering criterion compared to Mathematics common to the three datasets. or Physics. We used 10 (out of 20) of the largest datasets released under the Andes2 project [22] to build the collection of Physics problems and traces. Andes2 is an intelligent tutoring system that includes pedagogical content for a two-semester long college and advanced high-school level Physics course. These ten datasets are based on logs from several semesters of use of the Andes2 system at the United States Naval Academy. Prior to filtering, these dataset comprised a total of 81,173 traces and 1,162,581 events from 2,187 different problems. Note that, as is case with the Math dataset, we treat the ten Andes2 datasets independently. Note that, unlike typical domain independent example-tracing based tutor, the Andes2 systems uses a model-tracing approach for tracking learner’s solution of a problem and to provide feedback. The domain knowledge dependent model tracer is able to match highly inflected learner inputs (e.g. variable names) to its solution graph. Despite this difference in tutoring approach used by the Andes2 system, we decided to include this domain in our experiments to study the performance of our algorithms on such solution traces. Figure 2. Example Steps from Problem from the French Finally, the French traces are based on two dataset from the Online Course Source: oli.cmu.edu, April 2014 “French Course” project on DataShop. These datasets were collected from logs of student’s use of the “French Online” course The datasets used in our experiments contain solution traces. hosted by the Open Learning Initiative (OLI) [22] at Carnegie Traces are paths through an existing behavior graph, unlike Mellon University. Figure 2 shows steps from couple of example behavior demonstrations which are unconstrained by existing problems from this course. These datasets comprised a total of tutor models. In addition to the fact that these are the only 37,439 traces and 253,744 events from 1,246 different problems. available large scale collection of solution paths, we use these Note that a significantly larger fraction of French problems were datasets in our experiments because these traces have been Table 3. Averaged Metrics for the Graphs Generated by ABGG Algorithms *indicates significant (p < 0.05) difference with the other algorithms (within the same dataset) Mathematics (Assistments) Physics (Andes2) French (OLI) Algorithm ► 1 2 3 4 1 2 3 4 1 2 3 4 #Nodes 79.2 5.4* 6.0* 6.6* 147.8 7.9* 11.5* 11.7* 25.6 3.8* 4.5* 4.5* #Correct Edges 148.0 12.9* 18.3* 17.5* 182.2 43.5* 76.4 34.5* 37.2 6.9 9.8 9.5 #Incorrect Edges 23.9 33.5 19.5* 35.1 53.0 13.4* 4.2 11.0 8.0 Compression Ratio 6.7 76.8* 66.8 60.2 2.3 31.6* 21.9 21.7 2.2 14.6 12.8 12.8 % Accurate Correct Edges 39.1 41.9 42.5* 44.1* 61.4 80.2* 58.9 80.8* 22.5 27.7* 26.9* 29.8* % Accurate Incorrect Edges 99.9* 97.2 99.5* 92.5* 67.3 85.5 97.8* 86.1 87.2 Training Error Rate 51.4 25.4 17.7* 17.5* 33.6 17.2* 25.8 24.3 75.2 56.1 22.3* 25.3* Heldout Error Rate 42.8 23.5 16.1* 15.7* 29.1 25.5* 33.3 30.8 45.3 35.9 19.9* 18.5* % Training Unseen Events 0.0* 10.7 2.2 6.8 0.0* 14.1 12.2 24.6 0.0* 13.4 5.2 4.5 % Heldout Unseen Events 10.2* 19.1 11.5* 13.9 35.9* 41.7 38.4* 42.6 31.7* 40.7 34.4* 34.3* Branching Factor 2.2 10.9 12.6* 8.5 1.5 13.4* 12.9* 6.0 1.6 6.7* 9.4* 7.8* #Groups 0.5* 0.0 0.8 1.4* 0.3* 0.1 Avg. Group Size 1.9* 0.0 2.0 2.0 0.6* 0.3 % Group Coverage 31.8* 0.5 27.2 30.6* 15.4* 6.1 collected from a large set of real users. They contain realistic accuracy metrics for this dataset and is not significantly worse on variations in learner inputs similar to demonstrations. the fourth metric. Because of their lossless nature, Algorithm 1 (Interaction Network) performs the best on Completeness metrics 5. EXPERIMENTS (% unseen events). However, it is not significantly better than We use a three-fold cross validation design that splits the Algorithm 3 (Center-Star Alignment). We find evidence of over- available traces into three different training and held out sets. The fitting of the algorithms to training traces on this metric as readability metrics (i.e. number of nodes, number of edges and indicated by the approximately 9% higher rate of unseen events compression ratio) as well as the robustness metrics (branching for held out traces for all the algorithms. Algorithm 3 significantly factor, number of unordered groups, average group size and outperforms the other algorithms on the primary robustness metric coverage of graph within groups) are reported on the behavior (Branching Factor) for this domain. Algorithm 2 is better than graphs generated by the algorithms. On the other hand, some Algorithm 4 for the metrics based on unordered groups. accuracy metrics such as the accuracy of correct and incorrect edges are measured on generated graphs whereas others such as error rate are measured on event sequences which could be the training traces; i.e., sequences used to generate the graphs, or held out traces. Similarly, our completeness metrics, i.e. the rate of unseen events in a sequence, can be measured on both training as well as held out traces. Note that the metrics computed on training traces used to generate the graphs may not accurately indicate the performance of an algorithm due to over-fitting. This is the motivation for choosing the cross validation based experimental design. 5.1 Results Table 3 shows our results along 14 metrics for each of the four algorithms applied to the three learning domains under consideration. Reported metrics are averaged over three cross validation splits as well as over all the problems for each domain. The metrics are organized by the four desirable characteristics discussed earlier. Primary metric for each characteristic is highlighted. Figure 3. Compression Ratio of Algorithm 2 5.1.1 Mathematics Figure 4. Heldout Error Rate of Algorithms 2 and 4 As expected, the interaction networks comprise of a large number of nodes and edges that lead them to have significantly smaller 5.1.2 Physics compression ratio. Algorithm 2 (Heuristic Alignment) On the primary readability metric (Compression Ratio), Algorithm outperforms all other algorithms on three of the readability 2 outperforms the others on the Physics dataset as was the case metrics. On the other hand, Algorithm 4 (Path Pruning) with Mathematics. This is consistent with prior conclusion [4] on significantly outperforms the other algorithms on three of the the use of Algorithm 2 for readability. We note that the Physics dataset has significantly lower compression ratio than the previous Second, we expect the branching factor to be higher for a dataset. Figure 3 shows a scatter plot and domain-specific language learning domain, due to the high degree of linguistic regression fits for the compression ratio of Algorithm 2 for variation in learner inputs. The results in Table 3 do not indicate different problems with different number of training traces and UI this. However, Figure 6 verifies this intuition. Branching factor elements. We see that for equivalent number of training traces, the for the French behavior graphs is higher than those for the STEM compression ratio for Physics is actually slightly better than domain for problems that have 10 or more traces. Mathematics. However, as we know from Table 2, fewer training traces are available for the Physics problems on average. On the primary accuracy metric, we find that Algorithm 2 works best for Physics unlike the case with the Mathematics domain. We can note that the Algorithm 2 is significantly better on the accuracy of incorrect edges. Figure 4 shows the relationship between the error rate in heldout traces and the accuracy of incorrect edges. We also see that the percentage of unseen events in heldout traces is significantly higher for Physics. The lower incorrect edge accuracy and higher percentage of unseen events can be attributed to the differences in the tutoring approach underlying the Andes2 system which uses domain-specific knowledge to match a large variety of inputs from the learner at each step of the solution. Because of this, Andes2 elicits significantly diverse (& hence novel) inputs across traces. Algorithms 2 and 3 are not significantly different in terms of the primary robustness metric. 5.1.3 French Figure 6. Branching Factor of Algorithm 3 5.1.4 Automatically Generated Behavior Graphs Figures 7, 8 and 9 showcase several qualitative characteristics of automatically generated behavior graphs (truncated to fit) for the problems in the three datasets used in this work. We use the following visual convention: Circular nodes represent states and are labeled with identifiers u of the corresponding UI element. Edges are labeled with the data values d. Correct edges are labeled with green rectangles and incorrect edges are labeled with red rectangles. Unordered groups are shown using blue containers. Figure 7 shows graphs generated by two different algorithms for the same Mathematics problem in the Assistments dataset using 241 solution traces by learners. The graph generated by Algorithm 1 is dense and hardly readable due to the large number of nodes and edges in this graph. Also, as discussed in Section 3, this algorithm is unable to identify incorrect paths. Contrary to that, the graph in Figure 7b is composed of only 6 nodes. The various paths taken by learners are compressed into 46 correct and 39 Figure 5. Accuracy of Correct Edges for Algorithm 4 incorrect edges. We can notice that not all paths are accurate. The results for our non-STEM domain are largely consistent with However, the accurate paths are more frequent, as indicated by the the Mathematics domain. This may be attributed to the similarities thicker arcs associated with the edge. In our ongoing work, we are of the underlying tutoring approach for the Assistments system extending these algorithms to use this frequency attribute to and the French Online course which has been developed using the eliminate inaccurate paths (either automatically, or by providing Cognitive Tutor Authoring Tools (CTAT) [2]. However, we can additional controls to model developers in authoring tools). notice two key differences. First, the accuracy of correct edges for A behavior graph from the Physics dataset is shown in Figure 8. this domain is significantly lower. Because the French Online As discussed earlier, the large variation in learner input at each Course is deployed on an publicly accessible platform, its likely state is depicted in the edge labels of this graph. We notice that for that a large number of the solution traces were generated by the last state (s6) which corresponds to the learners filling in the beginners as well as non-serious users leading to the dataset answer to a problem, many minor variations of the correct answer containing many incomplete solution traces containing no correct are accurately captured. Due to the domain independent nature of answers. This is evidenced in Figure 5 as we see that correct edge our algorithms, these answers are treated as different string. accuracy dramatically degrades for long traces which is contrary Integration of domain knowledge can lead to further compression to the case with the other two domains. of these answers into a single path. The linguistic variation in the inputs to a problem in the French dataset is also noticeable in the two graphs for the same problem in Figure 9. We can see the several wrong answers are marked as correct answers (and vice versa), although the frequency-based edge notation identifies the correct answer as was the case in Figure 7b. In this problem, learners are asked to listen to an audio file and type in the French word they hear. Learners are allowed to go back and forth between these two steps. The first step has no wrong answer. We notice that our assumption to consider retracted events as incorrect fails in this case. Figure 9a. Behavior Graph: French, Algorithm 2 Figure 9b. Behavior Graph: French, Algorithm 4 Figure 7a. Behavior Graph: Mathematics, Algorithm 1 6. CONCLUSIONS In this paper, we have shared results from an empirical analysis of application of ABGG algorithms to three different learning domains. Several similarities and differences between the performances of four algorithms on problems from these three domains were discussed in the previous section. We find that the accuracy of these algorithms suffers when they are applied to solution traces collected from a tutoring system that uses domain knowledge to process a large variety of inputs from learners. While in our previous work [4], we have recommended the use of Algorithm 2 as the default ABGG algorithm for use within authoring tools, we find that for language learning domains, Algorithm 4 may be preferable since it is the most accurate on the French dataset and not significantly worse than the Figure 7b. Behavior Graph: Mathematics, Algorithm 2 other algorithms on the other primary metrics. It is particularly interesting to note the differences in the way We identified multiple potential improvements to the ABGG Algorithm 2 and Algorithm 4 encode robustness into the learnt algorithms based on these analyses. There are several domain tutor model. While Algorithm 2 identifies an unordered group specific nuances to the UI elements that comprise the problems in containing the listen and answer nodes which allows learners to each domain. For example, in the French domain, we found steps traverse these nodes in any order, Algorithm 4 identifies that the that do not have any wrong answer. For broad use, ABGG listen step is optional and create two different way to reach the algorithms should identify these UI elements and selectively apply answer step based on the solution behaviors exhibited by learners the powerful assumption about retracted events. Furthermore, the in the traces. algorithms can exploit additional features computed from across the multiple traces, such as the frequency of a data value at a node, to improve the accuracy of the automatically generated behavior graphs. Finally, this paper extends our recent work on use of multiple behavior demonstrations to automatically generate tutor models using ABGG algorithms. While these algorithms can be improved in specific ways discussed above, we find evidence for their applicability to multiple domains. Figure 8. Behavior Graph: Physics, Algorithm 2 ACKNOWLEDGEMENTS Interaction Logs to Improve Educational Outcomes, 7th International Conference on Intelligent Tutoring Systems This research was funded by the US Office of Naval Research (ITS 2004). August 2004 (ONR) contract N00014-12-C-0535. [12] Pavlik, P.I., Cen, H., and Koedinger, K.R. 2009. Learning Factors Transfer Analysis: Using Learning Curve Analysis to 7. REFERENCES Automatically Generate Domain Models, In Proceedings of the 2nd International Conference on Educational Data [1] Johnson, W. L., and Valente, A. 2008. Collaborative Mining (EDM 2009). Barnes, T., Desmarais, M., Romero, C., authoring of serious games for language and culture. In Ventura, S. (Eds.). 121-130 Proceedings of SimTecT (March 2008). [13] Koedinger, K.R., Mclaughlin E.A., and Stamper, J.C. 2012. [2] Aleven, V., McLaren, B. M., Sewall, J., and Koedinger. K. Automated student model improvement, In Proceedings of R. 2006. The cognitive tutor authoring tools (CTAT): the 5th International Conference on Educational Data Mining preliminary evaluation of efficiency gains. In Proceedings of (EDM 2012). Yacef, K., Zaïane, O., Hershkovitz, H., the 8th International Conference on Intelligent Tutoring Yudelson, M., and Stamper, J. (Eds.). 17-24 Systems (ITS'06), Ikeda, M., Ashley, K. D., and Chan, T.W. (Eds.). Springer-Verlag, Berlin, Heidelberg, 61-70. [14] Sudol, L.A, Rivers, K., and Harris, T.K. 2012. Calculating Probabilistic Distance to Solution in a Complex Problem [3] Kumar, R., Roy, M.E, Roberts, R.B., and Makhoul, J.I. 2014. Solving Domain, In Proceedings of the 5th International Towards Automatically Building Tutor Models Using Conference on Educational Data Mining (EDM 2012). Multiple Behavior Demonstrations. In Proceedings of 12th Yacef, K., Zaïane, O., Hershkovitz, H., Yudelson, M., and Intl. Conf. on Intelligent Tutoring Systems (ITS 2014), Stamper, J. (Eds.). 144-147 Honolulu, HI. [15] Johnson, M., Eagle, M., Stamper, J., and Barnes, T. 2013. An [4] Kumar, R., Roy, M.E, Roberts, R.B., and Makhoul, J.I. 2014. Algorithm for Reducing the Complexity of Interaction Comparison of Algorithms for Automatically Building Networks, In Proceedings of the 6thInternational Conference Example-Tracing Tutor Models. In Proceedings of 7th Intl. on Educational Data Mining, (EDM 2013). D’Mello, S. K., Conf. on Educational Data Mining (EDM 2014), Honolulu, Calvo, R. A., Olney, A. (Eds.). 248-251 HI. [16] Ritter, R., Harris, T.K, Nixon, T., Dickison, D., Murray, [5] Aleven, V., Mclaren, B. M., Sewall, J., and Koedinger. K. R. R.C., and Towle, B. 2009. Reducing the Knowledge Tracing 2009. A New Paradigm for Intelligent Tutoring Systems: Space, In Proceedings of the 2ndInternational Conference on Example-Tracing Tutors. Int. J. Artif. Intell. Ed. 19, 2 (April Educational Data Mining (EDM 2009). Barnes, T., 2009), 105-154. Desmarais, M., Romero, C., Ventura, S. (Eds.). 151-160 [6] Kumar, R., Sagae, A., and Johnson, W. L. 2009. Evaluating [17] Gusfield, D. 1997. Algorithms on Strings, Trees and an Authoring Tool for Mini-Dialogs. In Proceedings of the Sequences. Cambridge University Press, New York. 2009 Conference on Artificial Intelligence in Education, Dimitrova, V., Mizoguchi, R., du Boulay, B., and Graesser, [18] Yen, J. Y. 1971. Finding the K Shortest Loopless Paths in a A. (Eds.). IOS Press, Amsterdam, The Netherlands, The Network. Management Science 17(11). 712-716 Netherlands, 647-649. [19] Koedinger, K.R., Baker, R.S.J.d., Cunningham, K., [7] Kumar, R., Roy, M.E, Pattison-Gordon, E. and Roberts, R.B. Skogsholm, A., Leber, B., and Stamper, J. 2010. A Data 2014. General Purpose ITS Development Tools. In Repository for the EDM community: The PSLC DataShop. Proceedings of Workshop on Intelligent Tutoring System In Handbook of Educational Data Mining. Romero, C., Authoring Tools, 12th Intl. Conf. on Intelligent Tutoring Ventura, S., Pechenizkiy, M., Baker, R.S.J.d. (Eds.). Boca Systems (ITS 2014), Honolulu, HI. Raton, FL: CRC Press [8] Question Generation Workshops. 2008-2011. [20] Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N. T., http://www.questiongeneration.org/ Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, [9] Barnes, T. and Stamper, J. 2008. Toward Automatic Hint T. E., Upalekar. R, Walonoski, J.A., Macasek. M.A. and Generation for Logic Proof Tutoring Using Historical Rasmussen, K. P. 2005. The Assistment project: Blending Student Data. In Proceedings of the 9th International assessment and assisting. In Proceedings of the 12th Conference on Intelligent Tutoring Systems (ITS '08). Woolf, International Conference on Artificial Intelligence in B. P., Aimeur, E., Nkambou, R., and Lajoie , S. (Eds.). Education, C.K. Looi, G. McCalla, B. Bredeweg, & J. Springer-Verlag, Berlin, Heidelberg, 373-382. Breuker (Eds.) IOS Press. 555-562. [10] Eagle, M., Johnson, J., and Barnes, T., 2012. Interaction [21] VanLehn, K., Lynch, C., Schulze, K. Shapiro, J. A., Shelby, Networks: Generating High Level Hints Based on Network R., Taylor, L., Treacy, D., Weinstein, A., and Wintersgill, M. Community Clusterings, In Proceedings of the 5th 2005. The Andes physics tutoring system: Lessons Learned. International Conference on Educational Data Mining In International Journal of Artificial Intelligence and (EDM 2012). Yacef, K., Zaïane, O., Hershkovitz, H., Education, 15 (3), 1-47 Yudelson, M., and Stamper, J. (Eds.). 164-167 [22] Strader, R. and Thille, C. 2012. The Open Learning [11] McLaren, B.M., Koedinger, K.R., Schneider, M., Harrer, A., Initiative: Enacting Instruction Online. In Game Changers: and Bollen, L. 2004. Bootstrapping Novice Data: Semi- Education and Information Technologies. Oblinger, D.G. Automated Tutor Authoring Using Student Log Files. In (Ed.) Educause. 201-213. Proceedings of the Workshop on Analyzing Student-Tutor