1. Introduction

June

1613-0073

Actionable Explanations for Student Success Prediction Models: A Benchmark Study on the Quality of Counterfactual Methods

Mustafa Cavus

mustafacavus@eskisehir.edu.tr 0

Jakub Kuzilek

jakub.kuzilek@hu-berlin.de 1 0 Eskisehir Technical University, Department of Statistics , Eskisehir , Turkiye 1 Humboldt University of Berlin , Unter den Linden 6, Berlin , Germany

2024

14 2024

Digital transformation in higher education resulted in a surge of information technology solutions suited for the needs of academia. The massive use of digital technology in education leads to the production of vast amounts of education and learner-related data, enabling advanced data analysis methods to explore and support the learning processes. When focusing on supporting at-risk students, the dominant research focuses on predicting student success. Enabling prediction models to help at-risk students involves a reliable technical solution and a transparent and explainable solution to build trust among the target learners and educators. Counterfactual explanations (aka counterfactuals) from explainable machine learning tools promise to enable trustful explainable models, provided the features are actionable and causal. However, determining the most suitable counterfactual generation method for student success prediction models remains unexplored. This study evaluates standard counterfactual methods -Multi-Objective Counterfactual Explanations, Nearest Instance Counterfactual Explanations, and What-If Counterfactual Explanations. The methods are evaluated using a black-box machine learning model trained on the Open University Learning Analytics dataset, demonstrating their practical usefulness and suggesting concrete steps for model prediction alteration. Our results indicate that the Nearest Instance Counterfactual Explanation method based on the sparsity metric provides the best results regarding several quality criteria. Detailed statistical analysis finds statistically significant diferences between all methods except the diference between the Nearest Instance Counterfactual Explanation and the Multi-Objective Counterfactual Explanation method, which suggests that the methods might be interchangeable in the context of the given dataset. counterfactual explanations, explainable artificial intelligence, contrastive explanations, learning analytics Attribution 4.0 International (CC BY 4.0).

1. Introduction

The pace of digital transformation in higher education increased over the decade. With this increase, the data generated by the learners, lecturers, and educational institutions are multiplied. The data growth enabled the use of advanced Data Science methods for the analysis within the field of Learning Analytics [ 1 ]. With the extensive use of analytical tools in all areas of human life concerns about security and privacy emerged, resulting in new data protection regulations (e.g., GDPR in EU) [ 2 ]. Consequently, trust in advanced analytical tools and Machine Learning methods in higher education has been reduced. To overcome the distrust, a new approach called Trusted Learning Analytics emerged [ 3 ]. The TLA approach emphasizes using ‘white box‘ Machine Learning (ML) methods and systems. Within this focus, the Explainable Artificial Intelligence (XAI) methods play a crucial role because they unlock the potential of the ‘black box‘ models for use within the TLA systems [ 3 ].

A typical task in Learning Analytics (LA) is the predictive modelling of learner success, which enables identifying the learners needing help with their studies [ 4 ]. The ML model is trained with historical data collected within the same educational context. This model is then used as a trigger for educational intervention to support needy learners (i. e. [ 5 ], [ 6 ] or [ 7 ].

In the ML modelling process, black box models, known for their high predictive accuracy, are often preferred over interpretable models [ 8, 9, 10 ]. The XAI tools are primarHuman-Centric eXplainable AI in Education Workshop (HEXED 2024), ∗Corresponding author. †These authors contributed equally.

The use of counterfactual explanations in LA has been explored in several studies [ 15, 16, 17 ]. Yet, the focus of counterfactual explanations is in the frame of delivering actionable insights to the relevant stakeholders. None of the studies have investigated the quality of the generated counterfactual explanations. Facing numerous counterfactual explanations due to the nature of optimization problems requires selecting those explanations that fulfil specific criteria beneficial for the stakeholder. Because of their background, challenges, and needs diferences, each learner requires personalized counterfactual [ 18 ]. Thus, several desired quality measures that a counterfactual explanation must satisfy.

To explore how the typical ML black box model trained for the predictive modelling of student success within the frame of TLA, we employed the open-access dataset Open University Learning Analytics Dataset (OULAD) [ 19 ] to answer the following research questions: © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

RQ1: What is the most appropriate method for generating CEUR

ceur-ws.org

the counterfactual explanations? RQ2: What is the most relevant quality measure of the methods for generating counterfactual explanations?

This study compares the qualities of diferent counterfactual generation methods for students whose success prediction model developed on the OULAD anticipates failing. It is essential in two ways: (1) because the missing evaluation of the counterfactual quality can lead to ineficient explanations, and this may compromise their trustworthiness [ 20 ], and (2) there is no uniformly better method for each domain [ 21 ] and this is the first benchmark in the domain of LA.

The remainder of the paper introduces our approach for analysis and selecting the most appropriate counterfactual generation method followed by the results and their discussion. Finally, the conclusions are presented.

2. Methods

2.1. Data Dataset. We employed the OULAD dataset released by the Open University, the largest distance learning institution in the United Kingdom, to analyse counterfactual generating methods. The typical courses at OU take approximately nine months and consist of multiple assignments and a final exam. The most crucial assignments are Tutor Marked Assignments (TMAs), which represent milestones in the course schedule. The dataset contains data about learners’ demographics, assessment results, and interaction with Moodle-like Learning Management System (LMS). For the analysis, we selected STEM course FFF and its presentation 2013J studied by 2283 students. The course contains five TMAs in weeks 2, 5, 13, 18, and 24. The last TMA was used as a target variable for model training. Learners can achieve scores from 0 to 100; we set a threshold for passing to 40 points. The following groups of students were excluded from the data set: actively withdrawn students (n = 675) and students who did not submit all TMAs (n = 500). The resulting dataset contains the data of 1108 students. It consists of 14 predictors from which 6 of categorical variables are encoded numerically. The online interactions of learners with the LMS (i.e., ‘n_clicks_xy‘ variables) have been computed for the top five most common activity types in the VLE, and they represent 95% of all student click-stream data. Table 1 presents the details of selected variables.

2.2. Counterfactual Explanations

Let = [ 1, 2, ..., ] be a data matrix of observations from variables, and be the response vector. The goal is to find ∶ → that minimizes the expected value of the loss function in predictive modelling. A counterfactual ′ ∈ ℝ of an observation ∈ ℝ is calculated through an optimization problem: ′∈ℝ [ ( ′), ′] + (, ′) (1) where ℝ denotes the p-dimensional real space, denotes a loss function that penalizes deviation of the prediction ( ′) from the interested outcome ′, and , represents a distance function between the observation and its counterfactual. A counterfactual explanation can be briefly defined as the necessary changes in one or more than one variable to lfip the model prediction. The distance function controls the distance between the target observation and the counterfactual. Figure 1 illustrates a counterfactual generation example. The value of the variable 3 must be changed to 3′ to flip the model’s prediction to ′. To illustrate this in the context of the OULAD dataset: An at-risk student can pass the course if the student increases assessment results or the total number of clicks in the discussion forum before the ifnal exam .

Counterfactuals aim to minimize the distance between the target observation and the counterfactual; however, there are more properties for a counterfactual explanation [ 22, 23 ]. Sparsity advocates for a minimal number of variable alterations, thereby maintaining its simplicity. Minimality focuses on the smallest possible changes in variable values. Validity is maintained by minimizing the disparity between the counterfactual instance, denoted as ′, and the observation while ensuring the model output aligns with the desired label ′. Proximity denotes the necessity of a slight divergence between the factual and counterfactual features. Plausibility mandates that counterfactual explanations remain realistic and adhere closely to the underlying data distribution. There are more than known 120 counterfactual generation methods; see [ 24 ] for details. However, we considered three commonly used counterfactual methods to make comparing the quality of counterfactuals feasible.

What-if counterfactual explanations. What-if method (WhatIf) finds the observations closest to the observation from the other observations in terms of Gower distance, solving the following optimization problem [ 25 ]: ′ ∈ ∈ (, ′).

Multi-objective counterfactual explanations. The multi-objective counterfactual explanations method (MOC) objects to find counterfactuals corresponding to the validity, proximity, sparsity, and plausibility of solving a multiobjective optimization problem [ 26 ]: ′ ∈ [ ( ( )̂,

′), (, ′), (, ′), (, )] where the objectives correspond to the desired properties, validity, proximity, sparsity, plausibility, respectively. Thus, it generates valid, proximal, sparse, and plausible counterfactuals.

Nearest instance counterfactual explanations. The nearest instance counterfactual explanations method (NICE) ifnds the observations most similar to the observation in terms of the heterogenous Euclidean overlap method [ 27 ]. Because of the NICE method, there are two options in the object function based on the properties proximity, and sparsity, which can be used in these two ways.

The WhatIf method generates valid, proximal, and plausible counterfactuals. It is shown that the MOC method generates more counterfactuals than other counterfactual methods that are closer to the training data and require (2) (3) fewer feature changes [ 26 ]. Moreover, NICE generates the proximity counterfactuals. However, there is no uniformly better method in the datasets from diferent domains [ 21 ]. Thus, evaluating the quality of the generated counterfactual is necessary, and we conduct the experiments in the following section.

2.3. Experiment design

This study focuses on which method provides the highest quality counterfactual explanations for the student success prediction model trained using the OULAD dataset. Thus, our approach is (1) selecting the most appropriate ML model, (2) generating the counterfactuals, and (3) producing the evaluation criteria. Modeling. We used forester [ 28 ] for model selection and hyperparameter optimization. It is an AutoML tool that adjusts the hyperparameters of tree-based models using Bayesian optimization. The reason for using this tool instead of manual modelling is its ability to make Bayesian optimization highly practical with its relevant parameters. Additionally, the fact that tree-based models exhibit lower prediction performance than alternative complex models in classifying tabular datasets [ 29 ] supports the idea that using this tool does not limit model selection. The number of optimization rounds bayes_iter is taken as 5, and the number of trained models random_evals is taken as 10 in the AutoML tool, respectively. forester returns 28 models, including decision trees, random forests, XGBoost, LightGBM, and their fine-tuned versions with Bayesian optimization and random search in Table 2. Because the best-performing one is a fine-tuned random forest model with random search —accuracy 0.900, AUC 0.771, and F1 0.946— the counterfactuals are generated on it.

Counterfactual generation. We used counterfactuals package [ 21 ] to generate the counterfactual explanations for the at-risk students using the counterfactual generation methods WhatIf, proximitybased NICE (NICE_pr), sparsity-based NICE (NICE_sp), and MOC. The non-actionable variables that are impossible to change are kept constant, such as gender, disability, region, age_band, education, imd_band, num_of_prev_attempts, cummulative_assessment_results. The MOC, NICE_pr, NICE_sp, and WhatIf methods generate 191, 39, 19, and 120 counterfactuals for the 12 failed students predicted by the student success prediction model. It is essential to compare the counterfactual generation methods in terms of the number of generated counterfactuals because it shows the diversity of alternative ways to flip the model decision. The higher number of counterfactuals is better. The materials for reproducing the experiments performed and the dataset are accessible in the following anonymized repository: https://github.com/mcavs/HEXED2024_paper.

3. Results and discussion

The quality metrics minimality, plausibility, proximity, sparsity, validity are calculated to evaluate the generated counterfactuals by the methods WhatIf, NICE_pr, NICE_sp, and MOC. It should be highlighted that the lower values are better for each metric. Some user studies have shown that the users prefer to use the counterfactuals, which perform well on the criteria in [ 30, 31 ]. Thus, we compared their qualities in two steps. First, we used the average values and the standard deviations of these metrics given in Table 3, and second, we compared the distribution of the results in Figure 2.

It is seen that the quality of counterfactuals is quite good in terms of proximity, plausibility, and validity. However, the results are not promising for WhatIf in minimality and sparsity. It is expected because it is known the WhatIf method generates valid, proximal, and plausible counterfactuals. Therefore, we do not recommend using this method in this domain. On the other hand, counterfactuals generated by the NICE method that optimizes based on sparsity showed better results in sparsity and other quality metrics than the one that optimizes based on proximity. There are diferences between the NICE_pr and NICE_sp in terms of minimality and sparsity. NICE_sp shows better performance because it optimizes based on sparsity and the metrics sparsity and minimality are quite related metrics. Sparsity refers to the changes in the number of variables while minimality refers the the smallest possible changes in the variable values. Therefore, using the NICE_sp method may be preferred to obtain better-quality explanations in this domain. Although the MOC method shows results competing with NICE_sp, it is poor on average.

Figure 2 shows the distribution of the quality metrics of the counterfactuals, providing deeper insights. The WhatIf method appears to produce explanations that are not minimal compared to the others. Although the NICE_pr was better than the WhatIf method in this regard, it performed worse than the other methods. When the methods are compared in terms of plausibility, it is seen that the WhatIf is better than the others, but the diference is low. While the WhatIf method produced fewer proximity explanations, other methods produced proximity explanations at a similar level. A similar pattern against the WhatIf has also been observed for sparsity. As expected, the NICE_sp method shows the best performance in terms of sparsity. Surprisingly, no method other than the MOC produced non-validity explanations. This is the most problematic quality feature for the MOC. The intriguing observation is the quality of counterfactuals generated by the MOC is better than the NICE_pr in terms of proximity, even though the NICE_pr method aims to create the proximity counterfactuals.

In summary, the quality of the explanations produced by the methods compete with each other in terms of both average and distribution properties, and it is not possible to say that the NICE_sp method produces the best quality explanations based on visual outputs alone. Therefore, using the Kruskal-Wallis test and the pairwise Wilcoxon test, we statistically test whether the explanations made by the methods difer. A Kruskal-Wallis test was performed on the quality metric values of the four methods (MOC, NICE_pr, NICE_sp, and WhatIf). The diferences between the rank totals of the methods were significant, (24) = 48.823, < .001 . Post hoc comparisons were conducted using Wilcoxon Tests with a Benjamini-Hochberg adjusted alpha level of .016. The diference between the MOC and NICE_pr was no statistically significant ( = .115) . The other comparisons were significant. The results of the statistical tests support the previous results.

4. Conclusions

In this study, we explored the possibilities of using XAI tools in the frame of the TLA research. Our research focused on deploying the counterfactual explanation methods on the OULAD dataset containing the demographics, results, and learner interactions with LMS to answer the following research questions: 1) What is the most appropriate method for generating the counterfactual explanations? Selection of the most suitable method depends on the stakeholder requirements and the educational context. However, selecting the most appropriate methods is generally guided by evaluating standard counterfactual properties: Sparsity, Validity, Proximity, and Plausibility. The evaluation of our approach on the OULAD dataset resulted in the finding that explanations generated using the NICE method based on sparsity are of higher quality in terms of all considered metrics than explanations generated through other methods (Table 3). 2) What is the most relevant quality measure of the methods for generating counterfactual explanations? As mentioned before, selecting a method depends highly on the educational setting. Yet, it might be defined by the relevant stakeholder as the most essential criteria chosen from those used as a standard evaluation measure. In addition, the statistical hypothesis testing results indicate no statistically significant diference between the Nearest Instance Counterfactual Explanation and the Multi-Objective Counterfactual Explanations method, which indicates the requirement for the deep validation of generated counterfactual explanations for the at-risk students to avoid misconceptions. This suggests that the human-in-the-loop is needed even when selecting the most optimal method in technical validation. In addition, the counterfactuals provide a simple way to understand and uncover the issues about learner learning and open the path to recommendations for possible educational interventions. Finally, the study has some limitations. Due to the focus of the study, data drift was not considered, and only the most common counterfactual explanation methods were used. Furthermore, we believe that conducting qualitative studies and evaluating the explanations solely based on quality metrics would provide further validation for the ifndings.

Acknowledgments

The work in this paper is supported by the German Federal Ministry of Education and Research (BMBF), grant no. 16DHBKI045.

[1]

Siemens , R. S. d. Baker, Learning analytics and educational data mining: towards communication and collaboration , in: Proceedings of the 2nd international conference on learning analytics and knowledge , 2012 , pp. 252 - 254 .

[2]

Hoel ,

Grifiths , W. Chen, The influence of data protection and privacy frameworks on the design of learning analytics systems , in: Proceedings of the seventh international learning analytics & knowledge conference , 2017 , pp. 243 - 252 .

[3]

Drachsler , Trusted learning analytics , Universität Hamburg , 2018 .

[4]

Papamitsiou ,

A. A.

Economides , Learning analytics and educational data mining in practice: A systematic literature review of empirical evidence , Journal of Educational Technology & Society 17 ( 2014 ) 49 - 64 .

[5]

K. E.

Arnold ,

M. D.

Pistilli , Course signals at purdue: Using learning analytics to increase student success , in: Proceedings of the 2nd international conference on learning analytics and knowledge , 2012 , pp. 267 - 270 .

[6]

Waheed , S.-U. Hassan,

N. R.

Aljohani ,

Hardman ,

Alelyani ,

Nawaz , Predicting the academic performance of students from vle big data using deep learning models , Computers in Human behavior 104 ( 2020 ) 106189 .

[7]

Adnan ,

Habib ,

Ashraf ,

Mussadiq ,

A. A.

Raza ,

Abid ,

Bashir , S. U. Khan, Predicting at-risk students at diferent percentages of course length for early intervention using machine learning models , Ieee Access 9 ( 2021 ) 7519 - 7539 .

[8]

Guidotti ,

Monreale ,

Ruggieri ,

Turini ,

Giannotti ,

Pedreschi , A survey of methods for explaining black box models, ACM computing surveys (CSUR) 51 ( 2018 ) 1 - 42 .

[9]

Biecek , T. Burzykowski, Explanatory model analysis: explore, explain, and examine predictive models, Chapman and Hall/CRC, 2021 .

[10]

Holzinger ,

Saranti ,

Molnar ,

Biecek , W. Samek, Explainable ai methods-a brief overview , in: International workshop on extending explainable AI beyond deep models and classifiers , Springer, 2022 , pp. 13 - 38 .

[11] C. Molnar, Interpretable machine learning , Lulu. com , 2020 .

[12] A. Bhattacharya, Applied Machine Learning Explainability Techniques: Make ML models explainable and trustworthy for practical applications using LIME, SHAP, and more , Packt Publishing Ltd , 2022 .

[13]

Cavus ,

Stando ,

Biecek , Glocal explanations of expected goal models in soccer , arXiv preprint arXiv:2308.15559 ( 2023 ).

[14]

Artelt ,

Hammer , On the computation of counterfactual explanations-a survey , arXiv preprint arXiv: 1911 . 07749 ( 2019 ).

[15]

Tsiakmaki ,

Ragos , A case study of interpretable counterfactual explanations for the task of predicting student academic performance , in: 2021 25th International Conference on Circuits, Systems , Communications, and Computers (CSCC), 2021 , pp. 120 - 125 . doi: 10 .1109/CSCC53858. 2021 . 00029 .

[16]

Zhang ,

Dong ,

Lv ,

Lin ,

Bai , Visual analytics of potential dropout behavior patterns in online learning based on counterfactual explanation , Journal of Visualization 26 ( 2023 ) 723 - 741 .

[17]

Afrin ,

Hamilton ,

Thevathyan , Exploring counterfactual explanations for predicting student success , in: International Conference on Computational Science , Springer, 2023 , pp. 413 - 420 .

[18]

B. I.

Smith ,

Chimedza ,

J. H.

Bührmann , Individualized help for at-risk students using model-agnostic and counterfactual explanations , Education and Information Technologies ( 2022 ) 1 - 20 .

[19]

Kuzilek ,

Hlosta ,

Zdrahal , Open university learning analytics dataset, Scientific data 4 ( 2017 ) 1 - 8 .

[20]

Artelt ,

Vaquet ,

Velioglu ,

Hinder ,

Brinkrolf ,

Schilling ,

Hammer , Evaluating robustness of counterfactual explanations , in: 2021 IEEE Symposium Series on Computational Intelligence (SSCI) , IEEE, 2021 , pp. 01 - 09 .

[21]

Dandl ,

Hofheinz ,

Binder , G. Casalicchio, counterfactuals: An R Package for Counterfactual Explanation Methods , 2023 . R package version 0.1.2.

[22]

Wachter ,

Mittelstadt ,

Russell , Counterfactual explanations without opening the black box: Automated decisions and the gdpr , Harv. JL & Tech. 31 ( 2017 ) 841 .

[23] A . -H. Karimi , G.

Barthe , B.

Balle , I. Valera , Modelagnostic counterfactual explanations for consequential decisions , in: International conference on artificial intelligence and statistics , PMLR, 2020 , pp. 895 - 905 .

[24]

Warren ,

M. T.

Keane ,

Gueret , E. Delaney, Explaining groups of instances counterfactually for xai: a use case, algorithm and user study for groupcounterfactuals , arXiv preprint arXiv:2303.09297 ( 2023 ).

[25]

Wexler ,

Pushkarna ,

Bolukbasi ,

Wattenberg ,

Viégas , J. Wilson, The what-if tool: Interactive probing of machine learning models , IEEE Transactions on Visualization and Computer Graphics 26 ( 2019 ) 56 - 65 .

[26]

Dandl ,

Molnar ,

Binder ,

Bischl , Multiobjective counterfactual explanations , in: International Conference on Parallel Problem Solving from Nature , Springer, 2020 , pp. 448 - 469 .

[27]

Brughmans ,

Leyman ,

Martens , Nice: an algorithm for nearest instance counterfactual explanations, Data mining and knowledge discovery ( 2023 ) 1 - 39 .

[28]

Kozak , H. Ruczyński, forester: A novel approach to accessible and interpretable automl for tree-based modeling , in: AutoML Conference 2023 (ABCD Track) , 2023 .

[29]

Grinsztajn , E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep learning on typical tabular data? , Advances in neural information processing systems 35 ( 2022 ) 507 - 520 .

[30]

Spreitzer ,

Haned , I. van der Linden , Evaluating the practicality of counterfactual explanations , in: Workshop on Trustworthy and Socially Responsible Machine Learning , NeurIPS 2022 , 2022 .

[31]

Förster ,

Hühn ,

Klier ,

Kluge , Capturing users' reality: A novel approach to generate coherent counterfactual explanations ( 2021 ).