=Paper=
{{Paper
|id=Vol-1183/bkt20y_paper08
|storemode=property
|title= A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing
|pdfUrl=https://ceur-ws.org/Vol-1183/bkt20y_paper08.pdf
|volume=Vol-1183
|dblpUrl=https://dblp.org/rec/conf/edm/DhananiLPP14
}}
== A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing==
<pdf width="1500px">https://ceur-ws.org/Vol-1183/bkt20y_paper08.pdf</pdf>
<pre>
         A Comparison of Error Metrics for Learning Model
           Parameters in Bayesian Knowledge Tracing ∗

                   †                            †                                               †
  Asif Dhanani           Seung Yeon Lee              Phitchaya Mangpo Phothilimthana                 Zachary Pardos
                                              University of California, Berkeley
                             {asifdhanani, sy.lee, mangpo, pardos}@berkeley.edu

ABSTRACT                                                         by looking closer into the relationship between three popular
In the knowledge-tracing model, error metrics are used to        error metrics: LL, RMSE, and AUC, and particularly eluci-
guide parameter estimation towards values that accurately        dating the relationship to one another closer to the ground
represent students’ dynamic cognitive state. We compare          truth point.
several metrics, including log-likelihood (LL), RMSE, and
AUC, to evaluate which metric is most suited for this pur-
pose. In order to examine the effectiveness of using each        2.   METHODOLOGY
metric, we measure the correlations between the values cal-      To assess whether LL, RMSE, or AUC is the best error met-
culated by each and the distances from the corresponding         ric to use in parameter searching for the BKT model, we
points to the ground truth. Additionally, we examine how         needed datasets with known parameter values in order to
each metric compares to the others. Our findings show that       compare these with the parameter values predicted by us-
RMSE is significantly better than LL and AUC. With more          ing different error metrics. Therefore, we synthesized 26
knowledge of effective error metrics for learning parameters     datasets by simulating student responses based on diverse
in the knowledge-tracing model, we hope that better param-       known ground truth parameter values.
eter searching algorithms can be created.
                                                                 Correlations to the ground truth. For each dataset, we
1.   INTRODUCTION                                                evaluated LL, RMSE, and AUC values on all points over the
In Bayesian Knowledge Tracing (BKT), one of the essential        entire prior/learn/guess/slip parameter space with a 0.05
elements is the error metric that is used for learning model     interval. On each point, we calculated students’ predicted
parameters: prior, learn, guess, and slip. Choice of a type      responses (probability that students will answer questions
of error metric is crucial because the error metric takes a      correctly). We then used these predicted responses with the
role of guiding the search to the best parameters. The BKT       actual responses to calculate LL, RMSE, and AUC for all
model can be fit to student performance data by using a          points. To determine which error metric is the best for this
method which finds a best value calculated from the error        purpose, we looked at the correlations between values cal-
metric such as log-likelihood (LL), root-mean-squared error      culated from error metrics (i.e. LL, RMSE, and AUC) and
(RMSE), or area under the ROC curve (AUC).                       the euclidean distances from the points to the ground truth.
                                                                 We applied logarithm to all error metrics other than LL in
As a modeling method, grid search/brute force [1] is often       order to compare everything on the same scale. Finally, we
used to find the set of parameters with optimal values of        tested whether the correlation between the values calculated
the error metric, and Expectation Maximization (EM) algo-        by any particular error metric and the distances is signifi-
rithm [5] is also commonly used to choose parameters max-        cantly stronger than the others’ by running one-tailed paired
imizing the LL fit to the data. Many studies have com-           t-tests comparing all three metrics against one another.
pared different modeling approaches [1, 4]. However, the
findings are varied across the studies, and it has still been
                                                                 Distributions of values. We visualized the values of LL
unclear which method is the best at predicting student per-
                                                                 and -RMSE of all points over the 2 dimensional guess/slip
formance [2].
                                                                 space with a 0.02 interval while fixing prior and learn pa-
                                                                 rameter values to the actual ground truth values. Using the
Pardos and Yudelson compares different error metrics to in-
                                                                 guess and slip parameters as the axes, we visualize LL and
vestigate which one has the most accuracy of estimating the
                                                                 -RMSE values by color. The colors range from dark red to
moment of learning [6]. Our work extends this comparison
                                                                 dark blue corresponding to the values ranging from low to
∗For more details of this work, please refer to the full tech-   high.
nical report [3].
†Asif Dhanani, Seung Yeon Lee, and Phitchaya Mangpo              Direct comparison: LL and RMSE. We plotted LL val-
Phothilimthana contributed equally to this work and are          ues and RMSE values of all points against each other in or-
listed alphabetically.                                           der to observe the behavior of the two metrics in detail. We
                                                                 then labeled each data point by its distance to the ground
                                                                 truth with a color. The range of colors is the same as used
                                                                 in the previous method.
     Comparision   ∆ of correlations      t      p-value
     RMSE > LL           0.0408        8.9900   << 0.0001
     RMSE > AUC          0.0844        2.7583     0.0054
     LL > AUC            0.0436        1.4511     0.0796

               Figure 1: T-test statistics


                                                                 Figure 3: LL vs -RMSE of dataset 25 when prior =
                                                                 0.564, learn = 0.8, guess = 0.35 , and slip = 0.4
       (a) LL Heatmap              (b) -RMSE Heatmap

Figure 2: LL and -RMSE values when fixing prior                  4.   CONCLUSION
and learn parameter values and varying guess and                 In our comparison of LL, RMSE, and AUC as metrics for
slip parameter values. Red represents low values,                evaluating the closeness of estimated parameters to the true
while blue represents high values. The white dots                parameters in the knowledge tracing model, we discovered
represent the ground truth.                                      that RMSE serves as the strongest indicator. RMSE has
                                                                 a significantly higher correlation to the distance from the
3.    RESULTS                                                    ground truth on average than both LL and AUC, and RMSE
                                                                 is notably better when the estimated parameter value is not
Correlations to the ground truth. The average LL, RMSE,          very close to the ground truth. The effectiveness of teach-
and AUC correlations were 0.4419, 0.4827, and 0.3983 re-         ing systems without human supervision relies on the ability
spectively. We define that an error metric A is better than      of the systems to predict the implicit knowledge states of
B if the correlation between values calculated by an error       students. We hope that our work can help advance the pa-
metric A and the distances to the ground truth is higher than    rameter learning algorithms used in the knowledge tracing
that of B. By this definition, RMSE was better than LL on        model, which in turn can make these teaching systems more
all 26 datasets and better than AUC on 18 of 26 datasets.        effective.
This is validated by the one-tailed paired t-test shown in
Figure 1 revealing RMSE as statistically significantly better    5.   REFERENCES
than both LL and AUC.                                            [1] R. Baker, A. Corbett, S. Gowda, A. Wagner,
                                                                     B. MacLaren, L. Kauffman, A. Mitchell, and
Distributions of values. Figure 2 shows the heat maps of
                                                                     S. Giguere. Contextual slip and prediction of student
LL and RMSE on a representative dataset. If we follow the
                                                                     performance after use of an intelligent tutor. In User
gradient from the lowest value to the highest value in the
                                                                     Modeling, Adaptation, and Personalization, volume
LL heat map, we see that it is very high at the beginning
                                                                     6075 of Lecture Notes in Computer Science. 2010.
(far from the ground truth) and is very low at the end (close
to the ground truth). Conversely, in the -RMSE heat map,         [2] R. S. Baker, Z. A. Pardos, S. M. Gowda, B. B. Nooraei,
the change in the gradient is low. Additionally, notice that         and N. T. Heffernan. Ensembling predictions of student
the darkest blue region in -RMSE heat map is smaller than            knowledge within intelligent tutoring systems. In
that in LL heat map. This suggests that we may be able to            Proceedings of the 19th International Conference on
refine the proximity of the ground truth better with RMSE.           User Modeling, Adaption, and Personalization, 2011.
                                                                 [3] A. Dhanani, S. Y. Lee, P. Phothilimthana, and
Direct comparison: LL and RMSE. Figure 3 shows a LL                  Z. Pardos. A comparison of error metrics for learning
vs -RMSE graph from the most representative dataset. As              model parameters in bayesian knowledge tracing.
expected, LL values and RMSE values correlate logarithmi-            Technical Report UCB/EECS-2014-131, EECS
cally. Additionally, a secondary curve, which we will refer          Department, University of California, Berkeley, May
to as the hook, is observed in varying sizes among datasets.         2014.
The hook converges with the main curve when the -RMSE            [4] Y. Gong, J. Beck, and N. Heffernan. Comparing
and LL values are both sufficiently high and the points are          knowledge tracing and performance factor analysis by
very close to the ground truth.                                      using multiple model fitting procedures. In Intelligent
                                                                     Tutoring Systems, volume 6094 of Lecture Notes in
Before this point, when we look at a fixed LL value with             Computer Science. Springer Berlin Heidelberg, 2010.
varied RMSE values, most points in the hook have higher          [5] Z. Pardos and N. Heffernan. Modeling individualization
-RMSE values and are closer to the ground truth than do the          in a bayesian networks implementation of knowledge
points in the main curve. However, this same pattern is not          tracing. In User Modeling, Adaptation, and
seen for a fixed RMSE value with varied LL values. After the         Personalization. 2010.
curve and hook converge, we can infer that both RMSE and         [6] Z. A. Pardos and M. V. Yudelson. Towards moment of
LL will give similar estimates of the ground truth. However,         learning accuracy. In Proceedings of the 1st AIED
for a portion of the graph before this point, RMSE is a better       Workshop on Simulated Learners, 2013.
predictor of ground truth values.

</pre>