=Paper= {{Paper |id=Vol-1183/bkt20y_paper07 |storemode=property |title= A Brief Overview of Metrics for Evaluation of Student Models |pdfUrl=https://ceur-ws.org/Vol-1183/bkt20y_paper07.pdf |volume=Vol-1183 |dblpUrl=https://dblp.org/rec/conf/edm/Pelanek14 }} == A Brief Overview of Metrics for Evaluation of Student Models== https://ceur-ws.org/Vol-1183/bkt20y_paper07.pdf
      A Brief Overview of Metrics for Evaluation of Student
                            Models

                                                      Radek Pelánek
                                                   Masaryk University Brno
                                                   pelanek@fi.muni.cz


ABSTRACT                                                         2.    OVERVIEW OF METRICS
Many different metrics are used to evaluate and compare          To attain clear focus we discuss only models that predict
performance of student models. The aim of this paper is to       probability of a correct answer. We assume that we have
provide an overview of commonly used metrics, to discuss         data about n answers, numbered i ∈ {1, . . . , n}, correctness
properties, advantages, and disadvantages of different met-      of answers is given by ci ∈ {0, 1}, a student models provides
rics, and to summarize current practice in research papers.      predictions pi ∈ [0, 1]. A model performance metric is a
The paper should serve as a starting point for workshop          function f (~
                                                                             p, ~c). Note that the word “metric” is here used
discussion about the use of metrics in student modeling.         in a sense “any function that is used to make comparisons”,
                                                                 not in the mathematical sense of a distance function. Since
                                                                 we are interested in using the metrics for comparison, mono-
1.   INTRODUCTION                                                tone transformations (square root, logarithm, multiplication
A key part of intelligent tutoring systems are models that       by constant) are inconsequential and are used mainly for
estimate the knowledge of students. To compare and im-           better interpretability (or sometimes rather for traditional
prove these models we use metrics that measure quality of        reasons).
model predictions. Metrics are also used (sometimes implic-
itly) for parameter fitting, since many fitting procedures try
to optimize parameters with respect to some metric.              2.1    Mean Absolute Error
                                                                 This basic metric consider the absoluteP differences between
At the moment there is no standard metric for model eval-        predictions and answers: MAE = n1 n      i=1 |ci − pi |. This is
uation and thus researchers have to decide which metric to       not a suitable performance metric, because it prefers models
use. The choice of metric is an important step in the research   which are biased towards the majority results. As a simple
process. Differences in predictions between competing mod-       illustration, consider a simulated student which answers cor-
els are often small and the choice of metric can influence the   rectly with constant probability 0.7. If we compare differ-
results more than the choice of a parameter fitting proce-       ent constant predictors with respect to this metric, we get
dure. Moreover, fitted model parameters are often used in        that the best model is the one which predicts probability
subsequent steps in educational data mining and thus the         of correct answer to be 1. This is clearly not a desirable
choice of metric can indirectly influence many other aspects     result. As this example illustrates, the use of MAE can lead
of the research.                                                 to rather misleading conclusions. Despite this clear disad-
                                                                 vantage, MAE is sometimes used for evaluation (although
However, despite the fact that the choice of metric is im-       mostly in combination with other metrics, which reduces
portant and that there is no clear consensus on the usage        the risk of misleading conclusions in published papers).
of performance metrics, the topic gets very little attention
in most research papers. Most authors do not provide any         2.2    Root Mean Square Error
rationale for their choice of metric. Sometimes it is not even   A similar metric is obtained by
clear what metric is exactly used, so it may be even difficult                                   q using  squared values instead
                                                                 of absolute values: RMSE = n1 n
                                                                                                     P
                                                                                                        i=1 (ci − pi ) . Note that
                                                                                                                      2
to use the same metric as previous authors. The main aim
of this paper is to give an overview of performance metrics      from the perspective of model comparison, the important
relevant for evaluation of student models and to explicitly      part is only the sum of square errors (SSE). The square
discuss points that are in most papers omitted.                  root in RMSE is traditionally used to get the result in the
                                                                 same units of as the original “measurements” and thus to
                                                                 improve interpretability of the resulting number. In the
                                                                 particular context of student modeling and evaluation of
                                                                 probabilities, this is not particularly useful, since the result-
                                                                 ing numbers are hard to interpret anyway. In order to get
                                                                 better interpretability
                                                                          Pn              researchers sometimes use R2 metric:
                                                                                         2 Pn
                                                                 R = 1− i=1 (ci −pi ) / i=1 (ci −c̄)2 . With respect to com-
                                                                   2

                                                                 parison of models, R2 is equivalent to RMSE since here again
                                                                 the only model dependent part is the sum of square errors.
                                                                 In the context of the standard linear regression (where it is
most commonly used) R2 has a nice interpretation as “ex-               dictions into just two discrete classes (correct, incorrect),
plained variability”. In the case of logistic regression (which        we need to select a threshold for the classification. For a
is more similar to student models) this interpretation does            fixed threshold we can compute standard metrics like preci-
not hold and different “pseudo R2 ” metrics are used (e.g.,            sion, recall, and accuracy. If we do not want to use a fixed
Cox and Snell, McFadden, Nagelkerke). Thus a disadvan-                 threshold, we can use the ROC curve, which summarises the
tage of R2 is that unless the authors are explicit about which         behaviour of the prediction model over all possible thresh-
version of R2 they use (usually they are not), a reader cannot         olds. The curve has “false positive rate” on x-axis and “true
know for sure which metric is reported.                                positive rate” on the y-axis, each point of the curve corre-
                                                                       sponds to a choice of a threshold. Area under the ROC curve
In educational data mining the use of RMSE metric is very              (AUC) provides a summary performance measure across all
common (it was also used as a metric in KDD Cup 2010                   possible thresholds. It is equal to the probability that a
focused on student performance evaluation). In other ar-               randomly selected correct answer has higher predicted score
eas, particularly in meteorology, mean square error (RMSE              than a randomly selected incorrect answer. The area under
without the square root) is called the Brier score [1]. The            the curve can be approximated using a A’ metric, which is
Brier score is often decomposed into additive components               equivalent to the well-studied Wilcoxon statistics [2]. This
(e.g., reliability and refinement) which provide further in-           connection provides ways to study statistical significance of
sight into the behaviour of the predictor. Moreover, in an             results (but requires attention to assumptions of the tests,
analogy to AUC metric and ROC curve (described below),                 e.g., independence).
this metric can be interpreted as area under Brier curves.
These methods may provide interesting inspirations for stu-            The ROC curve and AUC metric are successfully used in
dent modeling.                                                         many different research areas, but their use is sometimes
                                                                       also criticised [3], e.g., because the metric summarises per-
2.3    Metrics Based on Likelihood                                     formance over all possible thresholds, even over those for
The likelihood of data (the answers) given a model (pre-               which the classifier would never be used in practice. From
dicted probabilities) is L = n
                                Q      ci            (1−ci )           the perspective of student modeling the main reservation
                                  i=1 pi · (1 − pi )         . Since
we are indifferent to monotonic transformations we typically           seems to be that this approach focuses on classification and
work with thePnumerically more stable logarithm of the like-           considers predictions only in relative way – note that if all
lihood LL = n                                                          predictions are divided by 2, the AUC metric stays the same.
                i=1 ci log(pi )+(1−ci ) log(1−pi ). This metric
can also be interpreted from information theoretic perspec-
tive as measure of data compression provided by a model [4].           In the context of student modeling we are usually not in-
The log-likelihood metric can be further extended into met-            terested in classification, we are often interested directly in
rics like Akaike information criterion (AIC) and Bayesian              absolute values of probabilities and we need these values
information criterion (BIC). These metrics penalize large              to be properly calibrated. The probabilities are often com-
number of model parameters and thus aim to avoid overfit-              pared to a fixed constant (typically 0.95) as an indication of
ting. In the context of student modeling it is typically much          a mastered skill and the specific value is meant to carry a
better to address the issue of overfitting by cross-validation.        certain meaning. Probabilistic estimates can be also used to
Since AIC and BIC provide a faster way to assess models                guide the behaviour of a system to achieve suitable challenge
than cross-validation, they may be useful as heuristics in             for students, e.g., by choosing question of right difficulty or
some algorithms (e.g., learning factor analysis), but they             modifying difficulty by number of options in multiple choice
are not serious contenders for proper model comparison.                questions.

MAE, RMSE and LL have all the form of “sum of penalties                Nevertheless, despite this disadvantage, AUC is widely used
for individual errors” and differ only in the function which           for evaluation of student models, often as the only metric.
specifies the penalty. For RMSE and LL values of penalty               It seems that in some cases AUC is used as the only metric
functions are quite similar, the main difference is in the in-         for final evaluation, but the parameter fitting procedure uses
terval [0.95, 1], i.e., in cases where the predictor is confident      (implicitly) different metric (RMSE or LL). Particularly in
and wrong. These cases are penalized very prohibitively by             cases of brute force fitting this approach seems strange and
LL, whereas RMSE is relatively benevolent. In fact the LL              should be at least explicitly mentioned.
metric is unbounded, so single wrong prediction (if it is too
confident) can ruin the performance of a model. This prop-             3.   REFERENCES
erty is usually undesirable and an artificial bound is used.           [1] G. W. Brier. Verification of forecasts expressed in terms
This corresponds to basically forcing a possibility of a slip              of probability. Monthly weather review, 78(1):1–3, 1950.
and guess behaviour into a model. After this modification              [2] J. Fogarty, R. S. Baker, and S. E. Hudson. Case studies
the penalties for RMSE and LL are rather similar. Never-                   in the use of ROC curve analysis for sensor-based
theless, the LL approach “penalize mainly predictions which                estimates in human computer interaction. In Proc. of
are confident and wrong” is reasonable thus it is rather sur-              Graphics Interface 2005, pages 129–136, 2005.
prising that this metric is used only marginally in evaluation         [3] J. M. Lobo, A. Jiménez-Valverde, and R. Real. AUC: a
of student models (it is used mostly in connection with AIC                misleading measure of the performance of predictive
or BIC).                                                                   distribution models. Global ecology and Biogeography,
                                                                           17(2):145–151, 2008.
2.4    Area Under an ROC Curve                                         [4] M. S. Roulston and L. A. Smith. Evaluating
Another popular metric is based on the receiver operating                  probabilistic forecasts using information theory.
characteristics (ROC) curve. If we want to classify pre-                   Monthly Weather Review, 130(6), 2002.