=Paper=
{{Paper
|id=Vol-1183/bkt20y_paper08
|storemode=property
|title= A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing
|pdfUrl=https://ceur-ws.org/Vol-1183/bkt20y_paper08.pdf
|volume=Vol-1183
|dblpUrl=https://dblp.org/rec/conf/edm/DhananiLPP14
}}
== A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing==
A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing ∗ † † † Asif Dhanani Seung Yeon Lee Phitchaya Mangpo Phothilimthana Zachary Pardos University of California, Berkeley {asifdhanani, sy.lee, mangpo, pardos}@berkeley.edu ABSTRACT by looking closer into the relationship between three popular In the knowledge-tracing model, error metrics are used to error metrics: LL, RMSE, and AUC, and particularly eluci- guide parameter estimation towards values that accurately dating the relationship to one another closer to the ground represent students’ dynamic cognitive state. We compare truth point. several metrics, including log-likelihood (LL), RMSE, and AUC, to evaluate which metric is most suited for this pur- pose. In order to examine the effectiveness of using each 2. METHODOLOGY metric, we measure the correlations between the values cal- To assess whether LL, RMSE, or AUC is the best error met- culated by each and the distances from the corresponding ric to use in parameter searching for the BKT model, we points to the ground truth. Additionally, we examine how needed datasets with known parameter values in order to each metric compares to the others. Our findings show that compare these with the parameter values predicted by us- RMSE is significantly better than LL and AUC. With more ing different error metrics. Therefore, we synthesized 26 knowledge of effective error metrics for learning parameters datasets by simulating student responses based on diverse in the knowledge-tracing model, we hope that better param- known ground truth parameter values. eter searching algorithms can be created. Correlations to the ground truth. For each dataset, we 1. INTRODUCTION evaluated LL, RMSE, and AUC values on all points over the In Bayesian Knowledge Tracing (BKT), one of the essential entire prior/learn/guess/slip parameter space with a 0.05 elements is the error metric that is used for learning model interval. On each point, we calculated students’ predicted parameters: prior, learn, guess, and slip. Choice of a type responses (probability that students will answer questions of error metric is crucial because the error metric takes a correctly). We then used these predicted responses with the role of guiding the search to the best parameters. The BKT actual responses to calculate LL, RMSE, and AUC for all model can be fit to student performance data by using a points. To determine which error metric is the best for this method which finds a best value calculated from the error purpose, we looked at the correlations between values cal- metric such as log-likelihood (LL), root-mean-squared error culated from error metrics (i.e. LL, RMSE, and AUC) and (RMSE), or area under the ROC curve (AUC). the euclidean distances from the points to the ground truth. We applied logarithm to all error metrics other than LL in As a modeling method, grid search/brute force [1] is often order to compare everything on the same scale. Finally, we used to find the set of parameters with optimal values of tested whether the correlation between the values calculated the error metric, and Expectation Maximization (EM) algo- by any particular error metric and the distances is signifi- rithm [5] is also commonly used to choose parameters max- cantly stronger than the others’ by running one-tailed paired imizing the LL fit to the data. Many studies have com- t-tests comparing all three metrics against one another. pared different modeling approaches [1, 4]. However, the findings are varied across the studies, and it has still been Distributions of values. We visualized the values of LL unclear which method is the best at predicting student per- and -RMSE of all points over the 2 dimensional guess/slip formance [2]. space with a 0.02 interval while fixing prior and learn pa- rameter values to the actual ground truth values. Using the Pardos and Yudelson compares different error metrics to in- guess and slip parameters as the axes, we visualize LL and vestigate which one has the most accuracy of estimating the -RMSE values by color. The colors range from dark red to moment of learning [6]. Our work extends this comparison dark blue corresponding to the values ranging from low to ∗For more details of this work, please refer to the full tech- high. nical report [3]. †Asif Dhanani, Seung Yeon Lee, and Phitchaya Mangpo Direct comparison: LL and RMSE. We plotted LL val- Phothilimthana contributed equally to this work and are ues and RMSE values of all points against each other in or- listed alphabetically. der to observe the behavior of the two metrics in detail. We then labeled each data point by its distance to the ground truth with a color. The range of colors is the same as used in the previous method. Comparision ∆ of correlations t p-value RMSE > LL 0.0408 8.9900 << 0.0001 RMSE > AUC 0.0844 2.7583 0.0054 LL > AUC 0.0436 1.4511 0.0796 Figure 1: T-test statistics Figure 3: LL vs -RMSE of dataset 25 when prior = 0.564, learn = 0.8, guess = 0.35 , and slip = 0.4 (a) LL Heatmap (b) -RMSE Heatmap Figure 2: LL and -RMSE values when fixing prior 4. CONCLUSION and learn parameter values and varying guess and In our comparison of LL, RMSE, and AUC as metrics for slip parameter values. Red represents low values, evaluating the closeness of estimated parameters to the true while blue represents high values. The white dots parameters in the knowledge tracing model, we discovered represent the ground truth. that RMSE serves as the strongest indicator. RMSE has a significantly higher correlation to the distance from the 3. RESULTS ground truth on average than both LL and AUC, and RMSE is notably better when the estimated parameter value is not Correlations to the ground truth. The average LL, RMSE, very close to the ground truth. The effectiveness of teach- and AUC correlations were 0.4419, 0.4827, and 0.3983 re- ing systems without human supervision relies on the ability spectively. We define that an error metric A is better than of the systems to predict the implicit knowledge states of B if the correlation between values calculated by an error students. We hope that our work can help advance the pa- metric A and the distances to the ground truth is higher than rameter learning algorithms used in the knowledge tracing that of B. By this definition, RMSE was better than LL on model, which in turn can make these teaching systems more all 26 datasets and better than AUC on 18 of 26 datasets. effective. This is validated by the one-tailed paired t-test shown in Figure 1 revealing RMSE as statistically significantly better 5. REFERENCES than both LL and AUC. [1] R. Baker, A. Corbett, S. Gowda, A. Wagner, B. MacLaren, L. Kauffman, A. Mitchell, and Distributions of values. Figure 2 shows the heat maps of S. Giguere. Contextual slip and prediction of student LL and RMSE on a representative dataset. If we follow the performance after use of an intelligent tutor. In User gradient from the lowest value to the highest value in the Modeling, Adaptation, and Personalization, volume LL heat map, we see that it is very high at the beginning 6075 of Lecture Notes in Computer Science. 2010. (far from the ground truth) and is very low at the end (close to the ground truth). Conversely, in the -RMSE heat map, [2] R. S. Baker, Z. A. Pardos, S. M. Gowda, B. B. Nooraei, the change in the gradient is low. Additionally, notice that and N. T. Heffernan. Ensembling predictions of student the darkest blue region in -RMSE heat map is smaller than knowledge within intelligent tutoring systems. In that in LL heat map. This suggests that we may be able to Proceedings of the 19th International Conference on refine the proximity of the ground truth better with RMSE. User Modeling, Adaption, and Personalization, 2011. [3] A. Dhanani, S. Y. Lee, P. Phothilimthana, and Direct comparison: LL and RMSE. Figure 3 shows a LL Z. Pardos. A comparison of error metrics for learning vs -RMSE graph from the most representative dataset. As model parameters in bayesian knowledge tracing. expected, LL values and RMSE values correlate logarithmi- Technical Report UCB/EECS-2014-131, EECS cally. Additionally, a secondary curve, which we will refer Department, University of California, Berkeley, May to as the hook, is observed in varying sizes among datasets. 2014. The hook converges with the main curve when the -RMSE [4] Y. Gong, J. Beck, and N. Heffernan. Comparing and LL values are both sufficiently high and the points are knowledge tracing and performance factor analysis by very close to the ground truth. using multiple model fitting procedures. In Intelligent Tutoring Systems, volume 6094 of Lecture Notes in Before this point, when we look at a fixed LL value with Computer Science. Springer Berlin Heidelberg, 2010. varied RMSE values, most points in the hook have higher [5] Z. Pardos and N. Heffernan. Modeling individualization -RMSE values and are closer to the ground truth than do the in a bayesian networks implementation of knowledge points in the main curve. However, this same pattern is not tracing. In User Modeling, Adaptation, and seen for a fixed RMSE value with varied LL values. After the Personalization. 2010. curve and hook converge, we can infer that both RMSE and [6] Z. A. Pardos and M. V. Yudelson. Towards moment of LL will give similar estimates of the ground truth. However, learning accuracy. In Proceedings of the 1st AIED for a portion of the graph before this point, RMSE is a better Workshop on Simulated Learners, 2013. predictor of ground truth values.