A Brief Overview of Metrics for Evaluation of Student Models Radek Pelánek Masaryk University Brno pelanek@fi.muni.cz ABSTRACT 2. OVERVIEW OF METRICS Many different metrics are used to evaluate and compare To attain clear focus we discuss only models that predict performance of student models. The aim of this paper is to probability of a correct answer. We assume that we have provide an overview of commonly used metrics, to discuss data about n answers, numbered i ∈ {1, . . . , n}, correctness properties, advantages, and disadvantages of different met- of answers is given by ci ∈ {0, 1}, a student models provides rics, and to summarize current practice in research papers. predictions pi ∈ [0, 1]. A model performance metric is a The paper should serve as a starting point for workshop function f (~ p, ~c). Note that the word “metric” is here used discussion about the use of metrics in student modeling. in a sense “any function that is used to make comparisons”, not in the mathematical sense of a distance function. Since we are interested in using the metrics for comparison, mono- 1. INTRODUCTION tone transformations (square root, logarithm, multiplication A key part of intelligent tutoring systems are models that by constant) are inconsequential and are used mainly for estimate the knowledge of students. To compare and im- better interpretability (or sometimes rather for traditional prove these models we use metrics that measure quality of reasons). model predictions. Metrics are also used (sometimes implic- itly) for parameter fitting, since many fitting procedures try to optimize parameters with respect to some metric. 2.1 Mean Absolute Error This basic metric consider the absoluteP differences between At the moment there is no standard metric for model eval- predictions and answers: MAE = n1 n i=1 |ci − pi |. This is uation and thus researchers have to decide which metric to not a suitable performance metric, because it prefers models use. The choice of metric is an important step in the research which are biased towards the majority results. As a simple process. Differences in predictions between competing mod- illustration, consider a simulated student which answers cor- els are often small and the choice of metric can influence the rectly with constant probability 0.7. If we compare differ- results more than the choice of a parameter fitting proce- ent constant predictors with respect to this metric, we get dure. Moreover, fitted model parameters are often used in that the best model is the one which predicts probability subsequent steps in educational data mining and thus the of correct answer to be 1. This is clearly not a desirable choice of metric can indirectly influence many other aspects result. As this example illustrates, the use of MAE can lead of the research. to rather misleading conclusions. Despite this clear disad- vantage, MAE is sometimes used for evaluation (although However, despite the fact that the choice of metric is im- mostly in combination with other metrics, which reduces portant and that there is no clear consensus on the usage the risk of misleading conclusions in published papers). of performance metrics, the topic gets very little attention in most research papers. Most authors do not provide any 2.2 Root Mean Square Error rationale for their choice of metric. Sometimes it is not even A similar metric is obtained by clear what metric is exactly used, so it may be even difficult q using squared values instead of absolute values: RMSE = n1 n P i=1 (ci − pi ) . Note that 2 to use the same metric as previous authors. The main aim of this paper is to give an overview of performance metrics from the perspective of model comparison, the important relevant for evaluation of student models and to explicitly part is only the sum of square errors (SSE). The square discuss points that are in most papers omitted. root in RMSE is traditionally used to get the result in the same units of as the original “measurements” and thus to improve interpretability of the resulting number. In the particular context of student modeling and evaluation of probabilities, this is not particularly useful, since the result- ing numbers are hard to interpret anyway. In order to get better interpretability Pn researchers sometimes use R2 metric: 2 Pn R = 1− i=1 (ci −pi ) / i=1 (ci −c̄)2 . With respect to com- 2 parison of models, R2 is equivalent to RMSE since here again the only model dependent part is the sum of square errors. In the context of the standard linear regression (where it is most commonly used) R2 has a nice interpretation as “ex- dictions into just two discrete classes (correct, incorrect), plained variability”. In the case of logistic regression (which we need to select a threshold for the classification. For a is more similar to student models) this interpretation does fixed threshold we can compute standard metrics like preci- not hold and different “pseudo R2 ” metrics are used (e.g., sion, recall, and accuracy. If we do not want to use a fixed Cox and Snell, McFadden, Nagelkerke). Thus a disadvan- threshold, we can use the ROC curve, which summarises the tage of R2 is that unless the authors are explicit about which behaviour of the prediction model over all possible thresh- version of R2 they use (usually they are not), a reader cannot olds. The curve has “false positive rate” on x-axis and “true know for sure which metric is reported. positive rate” on the y-axis, each point of the curve corre- sponds to a choice of a threshold. Area under the ROC curve In educational data mining the use of RMSE metric is very (AUC) provides a summary performance measure across all common (it was also used as a metric in KDD Cup 2010 possible thresholds. It is equal to the probability that a focused on student performance evaluation). In other ar- randomly selected correct answer has higher predicted score eas, particularly in meteorology, mean square error (RMSE than a randomly selected incorrect answer. The area under without the square root) is called the Brier score [1]. The the curve can be approximated using a A’ metric, which is Brier score is often decomposed into additive components equivalent to the well-studied Wilcoxon statistics [2]. This (e.g., reliability and refinement) which provide further in- connection provides ways to study statistical significance of sight into the behaviour of the predictor. Moreover, in an results (but requires attention to assumptions of the tests, analogy to AUC metric and ROC curve (described below), e.g., independence). this metric can be interpreted as area under Brier curves. These methods may provide interesting inspirations for stu- The ROC curve and AUC metric are successfully used in dent modeling. many different research areas, but their use is sometimes also criticised [3], e.g., because the metric summarises per- 2.3 Metrics Based on Likelihood formance over all possible thresholds, even over those for The likelihood of data (the answers) given a model (pre- which the classifier would never be used in practice. From dicted probabilities) is L = n Q ci (1−ci ) the perspective of student modeling the main reservation i=1 pi · (1 − pi ) . Since we are indifferent to monotonic transformations we typically seems to be that this approach focuses on classification and work with thePnumerically more stable logarithm of the like- considers predictions only in relative way – note that if all lihood LL = n predictions are divided by 2, the AUC metric stays the same. i=1 ci log(pi )+(1−ci ) log(1−pi ). This metric can also be interpreted from information theoretic perspec- tive as measure of data compression provided by a model [4]. In the context of student modeling we are usually not in- The log-likelihood metric can be further extended into met- terested in classification, we are often interested directly in rics like Akaike information criterion (AIC) and Bayesian absolute values of probabilities and we need these values information criterion (BIC). These metrics penalize large to be properly calibrated. The probabilities are often com- number of model parameters and thus aim to avoid overfit- pared to a fixed constant (typically 0.95) as an indication of ting. In the context of student modeling it is typically much a mastered skill and the specific value is meant to carry a better to address the issue of overfitting by cross-validation. certain meaning. Probabilistic estimates can be also used to Since AIC and BIC provide a faster way to assess models guide the behaviour of a system to achieve suitable challenge than cross-validation, they may be useful as heuristics in for students, e.g., by choosing question of right difficulty or some algorithms (e.g., learning factor analysis), but they modifying difficulty by number of options in multiple choice are not serious contenders for proper model comparison. questions. MAE, RMSE and LL have all the form of “sum of penalties Nevertheless, despite this disadvantage, AUC is widely used for individual errors” and differ only in the function which for evaluation of student models, often as the only metric. specifies the penalty. For RMSE and LL values of penalty It seems that in some cases AUC is used as the only metric functions are quite similar, the main difference is in the in- for final evaluation, but the parameter fitting procedure uses terval [0.95, 1], i.e., in cases where the predictor is confident (implicitly) different metric (RMSE or LL). Particularly in and wrong. These cases are penalized very prohibitively by cases of brute force fitting this approach seems strange and LL, whereas RMSE is relatively benevolent. In fact the LL should be at least explicitly mentioned. metric is unbounded, so single wrong prediction (if it is too confident) can ruin the performance of a model. This prop- 3. REFERENCES erty is usually undesirable and an artificial bound is used. [1] G. W. Brier. Verification of forecasts expressed in terms This corresponds to basically forcing a possibility of a slip of probability. Monthly weather review, 78(1):1–3, 1950. and guess behaviour into a model. After this modification [2] J. Fogarty, R. S. Baker, and S. E. Hudson. Case studies the penalties for RMSE and LL are rather similar. Never- in the use of ROC curve analysis for sensor-based theless, the LL approach “penalize mainly predictions which estimates in human computer interaction. In Proc. of are confident and wrong” is reasonable thus it is rather sur- Graphics Interface 2005, pages 129–136, 2005. prising that this metric is used only marginally in evaluation [3] J. M. Lobo, A. Jiménez-Valverde, and R. Real. AUC: a of student models (it is used mostly in connection with AIC misleading measure of the performance of predictive or BIC). distribution models. Global ecology and Biogeography, 17(2):145–151, 2008. 2.4 Area Under an ROC Curve [4] M. S. Roulston and L. A. Smith. Evaluating Another popular metric is based on the receiver operating probabilistic forecasts using information theory. characteristics (ROC) curve. If we want to classify pre- Monthly Weather Review, 130(6), 2002.