The Effect of Variations of Prior on Knowledge Tracing Matti Nelimarkka Madeeha Ghori School of Information, UC Berkeley Department of Electrical Engineering and 102 South Hall Computer Sciences, UC Berkeley Berkeley, California 94720-4600 387 Soda Hall Helsinki Institute for Information Technology HIIT, Berkeley, California 94720-17761 Aalto University madeeha.ghori@berkeley.edu PO Box 15600 Aalto, Finland 00076 matti.nelimarkka@hiit.fi ABSTRACT important to fully understand the strengths and limitations Knowledge tracing is a method which enables approximation of knowledge tracing before applying it more widely in the of a student’s knowledge state using a Bayesian network for classroom. As the parameters of the model are now known, approximation. As the applications of this method increase, there is a need to estimate these parameters from the given it is vital to understand the limits of this approximation. We data. Previous research has demonstrated that the accuracy are interested how well knowledge tracing performs when of parameter estimation – and therefore knowledge tracing students’ prior knowledge on the topic is extremely high or – can be improved by applying different heuristics [17, 13] low. Our results indicate that the estimates become more or methods [16, 18] including personalizing the model for erroneous when prior knowledge is extremely high (prior = each user [20, 8] or by extending the data used for analysis 0.90). [15, 6, 1]. Our work starts from a different premise: how robust is the Keywords BKT approach to variation in the parameter space? Our bayesian knowledge tracing, personalization, prior, parame- special interest is in the prior variable, which correlates to ter estimation a student’s knowledge of the topic before answering a ques- tion. In any classroom, MOOC or otherwise, some students 1. INTRODUCTION will come in with a better understanding of the material The Bayesian Knowledge-Tracing (BKT) algorithm was de- than others. Therefore it is important to study the effec- veloped in 1995 in an effort to model students’ changing tiveness of knowledge tracing on parameter estimation when knowledge state during skill acquisition [5]. The idea is to prior is extremely high or low. interpret students’ knowledge – a hidden variable – based on observed answers to a set of questions. The algorithm If knowledge tracing models are inaccurate in modelling stu- tracks the change in this probability distribution over time dents of a certain prior parameter, then smart tutors and using a simple Bayes’ net. The model is often presented as other systems designed to help those students learn will be four parameters: prior, learn, guess and slip (see Figure 1). less effective. Especially if the students being modelled in- Prior refers to the probability that the student knows the accurately are those students doing poorly in the class, as material initially, before acquiring any skills, learn indicates the smart tutors exist to help them the most. that the student did not have the skill initially but acquired it through doing the exercise, guess refers to accidentally answering the question correct and slip to answering acci- dentally wrong. Knowledge tracing is the most prominent method used to model student knowledge acquisition and is used in most in- telligent learning systems. These systems have been said to be outperforming humans since 2001 [3] and have been used in the real world to tutor students [4]. For these reasons it is Figure 1: The model of knowledge tracing 2. PREVIOUS WORK For the purposes of this work, here we shortly summarize three methods previously applied to improve the prediction capabilities of BKT models. However, these methods are in- sufficient to address the practical problem described above, resulting in a need for our own experiment. 2.1 Individualization Yudelson et al. [20] experimented with individualization by bringing student-specific parameters into the BKT algorithm on a larger scale. They split the usual skill-specific BKT parameters into two components: one skill-specific and one student-specific. They then built several individualized BKT models and added student-specific parameters in batches, examining the effect each addition had on the model’s per- formance. They found that student-specific prior parame- ters did not provide a vast improvement. However, student- specific learning provided a significant improvement to the model’s prediction accuracy. Pardos and Heffernan furthered the experiment by develop- Figure 2: The approach used in this study ing a method of formulating the individualization within the Bayes’ Net framework [11]. Especially interesting in terms of our work is the difference prior values and methods sug- sum of slip and guess to be less than or equal to 1 [17]. Other gested for this individualization. Pardos observes that mod- work determined that one’s starting estimated parameters els taking student spesific priors based on students’ prior could affect where the algorithm converged to. In order to knowledge clearly outperform traditional knowledge trace improve the accuracy of the convergence, it was suggested approach. This is a contrast Yudelson et al.’s findings [20] that starting parameters be selected from a Dirichlet distr- but it still underscores the importance of individualization bution derived from the data set [2, 13]. in the BKT algorithm. There have also been efforts to explore other machine learn- Related to individualization per user, there have been dis- ing methods on educational data. Initial trials born in the cussion on using different values per resources. It can be KDDCup competition use a medley of random forests and argued that different exercises teach different topics [7, 14]. other machine learning algorithms but these methods have This can be further used to individualize the model for dif- proven largely unsuccessful [16, 18]. ferent topics, an approach which has gained initial support on empirical studies [14]. The knowledge tracing community, while accepting the va- lidity of some of these heuristics [9, 12], has criticized their 2.2 Enhancing the data inability to provide any insight into the student learning The second approach to improve these methods is related model. Individualization, however, has the potential to im- to enhanching the data used for prediction. In its most prove the BKT algorithm while also providing a pedagogical simple form, this can be done by adding additional relevant explanation for said improvements. data, such as data from past years, to the analysis [15]. Others have explored the possibility of adding more data to 3. METHODOLOGY the general domain-related knowledge on the models, and We began by generating datasets with specific known ini- suggest that these indeed improve the estimates [6]. tial parameters in order to simulate groups of students at different knowledge levels. We then ran expectation max- However, the current direction in enhanced data relates to imization (EM) on these datasets and allowed knowledge information available on user interaction – especially in MOOC tracing to calculate its own estimated parameters. We then environments where it is possible to access this kind of data. compared these estimated parameters to the original ones To illustrate, Baker, Corbett, and Aleven [1] explore interac- used for generation to determine if the accurency of the pa- tions with the learning system and other non-exercise related rameter estimation depends on the initial parameters. data, such as time spent on answering and asking help, to determine the difference between slips and guesses. Table 1: Ground Truth Parameter Sets We applaud these efforts and acknowledge that data other than just student responses may indeed help to detect both prior learn guess slip the cases where initial knowledge (prior) is high and when Set 1.1 . . . 1.6 0.15 0.10 0.10 0.05 it is low, instead of tweaking the EM algorithm further. Set 2.1 . . . 2.6 0.30 0.10 0.10 0.05 Set 3.1 . . . 3.6 0.15 0.20 0.10 0.05 2.3 Improving the methods .. There are several heuristics currently used to enhance the . BKT algorithm. One such heuristic involves expecting the Set 48.1 . . . 48.6 0.90 0.20 0.20 0.10 3.1 Generating the Data As our goal was to determine how the prior ground truth af- fects parameter estimation, we varied the prior used to syn- thesize the data sets. We used six different priors (0.15, 0.30, . . . , 0.75, 0.9), and two variations on learn, slip and guess1 each (see Table 1); total of 48 variations of these parameters. Each of these data sets consists of 10,000 students and 20 observations per student. To increase the variation, we gen- erated 6 datasets per condition. This kind of simulated ap- proach has been previously used to evaluate the success of Bayesian machine learning methods [8]. Figure 4: Log likelihoods with different parameters 3.2 Analysis Procedure For each data set, we estimated the parameters using the a more accurate prior estimate. However, as we saw in Ta- expectation maximization fitting (EM) algorithm using the ble 2, this is not actually the case. The prior estimate gets fastHMM implementation [10]. The parameter estimation less accurate as the value of the ground truth prior increases. was conducted using a grid search with ten parameters, and In Figure 3 we can see again some of the results we saw in the best fitting model was selected using the log likelihood. Table 2: the prediction accuracy decreases when prior is 0.6 and continues to decrease as prior increases. Using our 288 data sets, we can compare the estimates and ground truths for each parameter and analyze the accuracy Figure 4 shows that the log likelihood for each of the param- of the estimates. We apply the standard methods of root- eter combinations we analyzed. We see a slight, but non- mean-square error (RMSE) and other visualizations to do significant increase in the log likelihoods, suggesting that our analysis. Using RMSE, we will be able to see if certain the model is performing better – even while our RMSE er- ground truths lend themselves to more accurate estimations. ror indicator demonstrates otherwise. It is also noteworthy to observe that that when slip is 0.10, all log likelihoods 4. RESULTS range between -65500 and -65250 but when slip is 0.05, all First, let us explore the parameter estimation in detail. The log likelihoods range between -40000 and -35750, indicat- avarage RMSE measurement in the data (Table 2) indicate ing that the slip value had a dramatic effect on the model that the prediction quality decreases as the prior increases; estimation accurancy. there is also increase of variance of the RMSE. This indi- cates that the predictions with higher priors are first more erronous and second, they converge in a larger area, result- 5. IMPLICATIONS ing in variance. To confirm our observations, we conducted Our findings indicate that there are higher errors in the a Wilcox-Mann-Whitney test to explore if the computed parameter estimations when prior is high (0.90). This is RMSEs differented in statistically significant manner. As probably due to the lack of evidence available for the HMM shown in Table 3, both the RMSEs computed from the data to attribute to the learn and guess parameters. One ap- sets with priors 0.15 and 0.90 statistically differ significantly proach to examine the impact of these errors is to examine from the other datasets (p < 0.05). Therefore we conclude the students’ subjective experience in different conditions that the EM algortihm performs badly when prior is high. [19]. As our data is syntetic, we can not measure the time consumed by students due to errors, as examined by Youdel- To further understand this phenomena, we explore the esti- son & Koedinger [19]. Instead we explore the difference on mates per parameter. The errors per parameter are shown the number of questions students’ need to answer to achieve in the Figure 3. The mean estimates are rather constantly mastery learning – for our purposes knowledge above 95 % close by the zero, though a higher prior does affect variance. and assuming that the students answer each question cor- As ground truth prior increases, the variance of guess and rectly. learn increases while the variance of prior decreases. In the- ory, a lesser variance on the prior prediction should imply Examining the case of high prior knowledge, and when the true learning was 0.1, we observed that majority of students 1 needed to answer over 5 times to achieve mastery (or: from Variations were 0.10 and 0.20 for learn and guess, and 0.05, 0.10 for slip. the 168 predicted value sets available, only 24 achieved mas- tery), and for the high learning (0.2) the situation was not Ground truth prior mean RMSE var RMSE 0.15 0.056639 0.000594 Table 3: Significant differences between the RMSEs 0.30 0.069073 0.001137 0.45 0.070005 0.000584 0.15 0.30 0.45 0.60 0.75 0.90 0.60 0.074044 0.001874 0.15 1 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.75 0.075946 0.002229 0.30 1 0.347 0.614 0.967 0.014 0.90 0.085257 0.004876 0.45 1 0.660 0.125 0.081 0.60 1 0.744 0.035 Table 2: The mean and variance of the root-mean-square 0.75 1 0.007 errors per prior 0.90 1 Figure 3: Predicting parameters with different values of prior significantly better – there 56 values achieved mastery with 5 bayesian knowledge tracing. In BeverleyP. Woolf, Esma responses. This indicates that the impact indeed was signif- AÃŕmeur, Roger Nkambou, and Susanne Lajoie, ed- icant in terms of impact to students learning and highlights itors, Intelligent Tutoring Systems, volume 5091 of the importance of this study. Lecture Notes in Computer Science, pages 406–415. Springer Berlin Heidelberg, 2008. 6. CONCLUSIONS [2] Joseph E Beck and Kai-min Chang. Identifiability : We started this study with the motivation to explore how A Fundamental Problem of Student Modeling. pages well the knowledge tracing method performs when the prior 137–146, 2007. doi: 10.1007/978-3-540-73078-1_17. is high or low; this performance has practical implications when applying this approach in a heterogenius classroom [3] Albert Corbett. Cognitive computer tutors: Solving where students arrive with highly different knowledge of the the two-sigma problem. In User Modeling 2001, volume domain. We studied this empirically by generating 288 dif- 2109 of Lecture Notes in Computer Science, pages 137– ferent synthetic datasets and explored the difference between 147. Springer Berlin Heidelberg, 2001. the predicted parameters and the parameters used to gen- erate the dataset. [4] Albert Corbett, Megan McLaughlin, and K Christine Scarpinatto. Modeling student knowledge: Cognitive Our results indicated a slightly increased in the estimation tutors in high school and college. User modeling and error when prior was 0.90, which we mostly attribute to user-adapted interaction, 10(2-3):81–108, 2000. higher error in learn and guess parameters. This observation [5] Albert T Corbett and John R Anderson. Knowledge was statistically significant and most likely due to the fact tracing: Modeling the acquisition of procedural knowl- that students with higher priors produce less information edge. User modeling and user-adapted interaction, 4(4): to be used by the HMM to estimate the guess and learn 253–278, 1994. parameters. [6] Albert T Corbett and Akshat Bhatnagar. Stu- We explored the influence these errors had on the propabil- dent modeling in the act programming tutor: Ad- ity of knowledge and observed that these errors significantly justing a procedural learning model with declar- reduced the speed students achieved mastery learning. This ative knowledge. COURSES AND LECTURES- result therefore implies that more work needs to be done to INTERNATIONAL CENTRE FOR MECHANICAL detect those with high prior knowledge to cater their learn- SCIENCES, pages 243–254, 1997. ing needs. [7] Tanja KÃďser, Severin Klingler, AlexanderGerhard Schwing, and Markus Gross. Beyond knowledge trac- Acknowledgments ing: Modeling skill topologies with bayesian net- This work was conducted during UC Berkeley School of In- works. In Stefan Trausan-Matu, KristyElizabeth Boyer, formation class “INFO290: Machine learning in education“ Martha Crosby, and Kitty Panourgia, editors, Intelli- instructed by Zach Pardos. We thank the support of the gent Tutoring Systems, volume 8474 of Lecture Notes course staff and peers on the presentation. in Computer Science, pages 188–198. Springer Interna- tional Publishing, 2014. References [1] RyanS.J.d. Baker, AlbertT. Corbett, and Vincent [8] Z. A. Pardos and N. T. Heffernan. Navigating the pa- Aleven. More accurate student modeling through con- rameter space of Bayesian Knowledge Tracing models textual estimation of slip and guess probabilities in Visualizations of the convergence of the Expectation Maximization algorithm. In Proceedings of the 3rd In- ternational Conference on Educational Data Mining, 2010. [9] ZA Pardos and NT Heffernan. Using HMMs and bagged decision trees to leverage rich features of user and skill from an intelligent tutoring system dataset. Jour- nal of Machine Learning Research W & CP, 2010. URL http://people.csail.mit.edu/zp/papers/pardos_JMLR_in_press.pdf. [10] Z.A. Pardos, M.J. Johnson, and et al. Scaling cogni- tive modeling to massive open environments. TOCHI Special Issue on Learning at Scale, (in preparation). [11] ZacharyA. Pardos and Neil T. Heffernan. Modeling in- dividualization in a bayesian networks implementation of knowledge tracing. In Paul Bra, Alfred Kobsa, and David Chin, editors, User Modeling, Adaptation, and Personalization, volume 6075 of Lecture Notes in Com- puter Science, pages 255–266. Springer Berlin Heidel- berg, 2010. ISBN 978-3-642-13469-2. [12] Pardos, Zachary A, Sujith M. Gowda, Ryan S.J.d. Baker, and Neil T. Heffernan. The sum is greater than the parts. ACM SIGKDD Explo- rations Newsletter, 13(2):37, May 2012. ISSN 19310145. doi: 10.1145/2207243.2207249. URL http://dl.acm.org/citation.cfm?id=2207249 http://dl.acm.org/citation.cfm?doid=2207243.2207249. [13] Dovan Rai, Yue Gong, and Joseph E Beck. Using dirich- let priors to improve model parameter plausibility. In- ternational Working Group on Educational Data Min- ing, 2009. [14] Leena Razzaq, Neil T Heffernan, Mingyu Feng, and Zachary A Pardos. Developing Fine-Grained Transfer Models in the ASSISTment System. Technology, In- struction, Cognition & Learning, 5(3):1–16, 2007. [15] Steven Ritter, Thomas K Harris, Tristan Nixon, Daniel Dickison, R Charles Murray, and Brendon Towle. Re- ducing the knowledge tracing space. International Working Group on Educational Data Mining, 2009. [16] A Toscher and Michael Jahrer. Collaborative filtering applied to educational data mining. Journal of Machine Learning Research, 2010. [17] Brett van De Sande. Properties of the Bayesian Knowl- edge Tracing Model. Journal of Educational Data Min- ing, 5(2):1–10, 2013. [18] Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing- Kai Lou, Todd G McKenzie, Jung-Wei Chou, Po-Han Chung, Chia-Hua Ho, Chun-Fu Chang, Yin-Hsuan Wei, et al. Feature engineering and classifier ensemble for kdd cup 2010. JMLR: Workshop and Conference Pro- ceedings, 1, 2010. [19] Michael V Yudelson and Kenneth R Koedinger. Esti- mating the benefits of student model improvements on a substantive scale. In Proceedings of the 6th Interna- tional Conference on Educational Data Mining, 2013. [20] Michael V Yudelson, Kenneth R Koedinger, and Ge- offrey J Gordon. Individualized bayesian knowledge tracing models. In Artificial Intelligence in Education, pages 171–180. Springer, 2013.