Explore the Explanation and Consistency of Explainable
                         AI in the LBLS Data Set
                         Tiffany T.Y Hsu, Owen H.T. Lu

                         International College of Innovation, National Chengchi University, Taiwan


                                         Abstract
                                         Learning Analytics (LA) is a field focusing on analyzing educational data, utilizing machine learning. One
                                         of the most discussed topics is at-risk student prediction. However, the application of these methods for
                                         predicting students' academic behaviors has faced criticism due to concerns about context insensitivity,
                                         potentially leading to prejudice and discrimination against students. While some methods in explainable
                                         AI (xAI) have been proposed to address these issues, there remains uncertainty regarding the
                                         consistency of their results. In response, we incorporate two popular explainable AI (xAI) methods SHAP
                                         (Shapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations), to
                                         interpret the predicting models. These methods attribute the output of these models to individual
                                         features, providing a clearer understanding of how each features contributes to the overall prediction.
                                         This approach is exemplified in the LBLS467 dataset, which includes data on 467 students’ academic
                                         performance and learning behaviors in computer programming courses, encompassing a range of
                                         metrics from programming behavior to self-regulated learning and language learning strategies.
                                         Concerning the consistency of interpretations derived from SHAP and LIME, analysis via Kendall’s tau
                                         coefficients reveals a moderate alignment in their feature weight rankings. Additionally, this alignment
                                         is substantiated by a highly significant confidence level, affirming that the observed alignment is not a
                                         mere coincidence.

                                         Keywords
                                         Learning Analytics, Explainable AI, SHAP, LIME 1


                         1. Introduction
                         Learning Analytics (LA) is a research field centered on measuring, collecting, analyz-
                         ing, and reporting data about learners and their contexts [1]. Within this field, predicting student
                         academic achievement is a foundational and significant topic [2]. Risk student prediction involves
                         identifying students at risk of academic failure using data-driven insights and has been used to
                         enhance web-based learning environments [3]. This process is not about labelling or categorizing
                         students, rather, it aims to foresee students’ performance in classes in advance. This foresight
                         enables educators to offer timely assistance and intervention, tailored to each student’s needs,
                         thereby enhancing their academic outcomes and experiences.
                            Machine learning is often criticized for being overly generalized, and overlooking the context
                         of the individual. Reflecting on the limitations of generalizations in understanding human
                         behavior, anthropologist Clifford Geertz suggests that theories and generalizations inevitably
                         lack deep and contextual understanding of human thought. ‘Theoretical disquisitions stand far
                         from the immediacies of social life,’ he notes. ‘Any generalization or theory constructed in the
                         absence of deep understanding, not grounded in the concrete and particular, is vacuous.’ [4]. The
                         approach of risk student prediction has also faced similar criticism of over-generalizing. The fact
                         that machine learning models do not provide a causal effects between features and prediction is
                         overlooking the individuality of students. In machine learning predictions, we are confronted
                         solely with the dichotomous outcomes: students being classified as either ’at risk’ or ’not at risk.’
                         While the purpose of such predictions is not to categorize students, the absence of interpretability
                         in these outcomes can inadvertently result in failure to recognize individuality and risk of
                         discrimination and stereotyping [5]. Explainable Artificial Intelligence (xAI) appears to be a

                         LAK-WS 2024: Joint Proceedings of LAK 2024 Workshops, March 18–19, Kyoto, Japan
                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
solution to address these concerns, helping educators understand the differences among
individual students. xAI refers to methods to explain and interpret predictions made by machine
learning models [6]. Recently, artificial intelligence has been integrated into many areas of society.
At the same time, debates surrounding AI, particularly in the context of ethics remain active. One
of the most popular topics is transparency. In discussions about transparency, besides disclosing
training data and sources, another prevalent approach is the application of xAI to render the
decision-making processes transparent. When a model's decision process is transparent, it
becomes simpler to monitor and assess its accuracy, thereby enhancing the model's
accountability. Moreover, the comprehensible predictions offered by interpretable models play a
vital role in fostering people's acceptance and trust in the decisions made by the model [7].In this
study, we will answer two research questions:

   RQ1: What are the successful factors in the LBLS dataset explored by SHAP and LIME?
   RQ2: How consistent are SHAP and LIME in interpreting a student’s learning performance?

2. Literature Review
The global community has developed an extensive variety of xAI approaches, which have been
applied across various domains to interpret a wide range of machine learning models, including
several complex models that were previously considered too intricate to interpret [8]. These
advancements in xAI have enabled a deeper understanding of machine learning outputs,
enhancing transparency and trust, especially in critical sectors. In line with these developments,
a systematic review of xAI applications reveals a concentrated focus in specific sectors, notably
healthcare, industry, and transportation [9]. As for the field of education, despite the relatively
lower number of scholarly articles compared to other domains, the application of xAI has been
noted in the review. It is noteworthy that 27% of xAI applications in these articles are utilized for
decision support, which is the highest proportion of application in this context. Therefore,
employing xAI as a tool for decision support in predicting whether students are at risk is justified.
   The application of xAI in education manifests primarily in two aspects: data usage and
stakeholder engagement [10]. Application in data usage enables the explanatory models to
improve prediction models after identifying the characteristics of student success in the
classroom. In terms of stakeholder engagement, it allows teachers to adjust their teaching
methods based on the results provided by the explanations.
   Reflecting on previous studies, there was research focused on the automatic generation of
explanations in virtual learning environments. In [11], a tool was developed to generate multi-
modal explanations regarding predictions of whether a student will pass or fail. The study
compared the accuracy of various classifiers. Under the conditions of most models demonstrated
high accuracy, it opts for simpler models including J48, Rep-Tree, and RandomTree over complex
ones like SVM to achieve a balance between accuracy and interpretability. [12] also indicates that
when models achieve high predictive accuracy, simpler models may yield higher quality
explanations. Therefore, this study follows this direction by comparing the predictive accuracy of
multiple models and selecting a simpler model for explanation under the premise of high
accuracy.
   According to [6], the most prominent repositories on GitHub in 2022 for xAI, as measured by
the number of stars, were slundberg/shap (Shapley Additive exPlanations) and marcotcr/lime
(Local Interpretable Model-agnostic Explanations). SHAP operates on game theory principles,
attributing a machine learning model’s output to the contributions of individual features [13].
Conversely, LIME elucidates the predictions of classifiers or regressors faithfully by locally
approximating them with an interpretable model [8]. Both methods are adept at explaining
machine learning models, regardless of their complexity. Given the active community
engagement on GitHub, the high level of attention these methods have garnered, and their open-
source status, this study will incorporate both SHAP and LIME. Utilizing these approaches, we aim
to pinpoint key features that determine the classification of individual students as at-risk. We will
compare the outcomes from each method and assess the consistency between the two.

3. Methods

    3.1. LBLS Dataset
LBLS467 is a dataset that collects data on 467 students’ academic performance and learning
behaviors in computer programming courses. It encompasses students’ programming editing
behaviors, questionnaire survey results on Self-regulated Learning (SRL) and the Strategy
Inventory for Language Learning (SILL). This dataset includes a total of 208 features, covering a
wide range of learning behaviors and performance indicators. The dataset is utilized to propose
a series of challenging suggestions for the LBLS dataset and was used in a data challenge
workshop organized by the Society for Learning Analytics Research (SoLAR) [14] [15]. 'At-risk'
has diverse definition, in this study, we defined 'At-risk' students are those who fail or are on the
verge of failing the course in this study. Specifically, risk students are those whose performance
is comparatively worse than at least 75% of the students in their class."

    3.2. Feature Extraction and Classification

In this study, we employ Principal Component Analysis (PCA) as our primary tool for feature extraction.
PCA, a common preprocessing step for machine learning algorithms [16], is followed by the application
of three different models, ranging from the most explainable to the least: Decision Tree, Logistic
Regression, and Support Vector Machine (SVM). We create a graph to demonstrate how model
accuracy relates to the number of PCA components, aiming to find the most accurate model for a
given component count. The most accurate model is then further analyzed using SHAP and LIME.

                                                                    ("# % "&)
                                       Accuracy = ("# % (# % "& %(&)                                 (1)

    •   TP (True Positives): The number of correct predictions that an instance is positive.
    •   TN (True Negatives): The number of correct predictions that an instance is negative.
    •   FP (False Positives): The number of incorrect predictions that an instance is positive (actually
        negative).
    •   FN (False Negatives): The number of incorrect predictions that an instance is negative
        (actually positive).

    3.3. SHAP and LIME
SHAP is an xAI method grounded in game theory, designed to interpret the predictions of complex
machine learning models. It employs the Shapley value to calculate the contribution of each feature
to the model’s output. This approach facilitates a comprehensive understanding of how different
features influence the model’s predictions. The weights of the features are derived from the following
[13]:

                                   ϕ! # ∑              |"|!(|%|&|"|&')!
                                                                        [&"∪* (("∪* )*&" ((" )]
                                                                                                     (2)
                                            " ⊆%\{*}          |%|!
        •   F : All features.
        •   |F| : The total number of features.
        •   φ) : The SHAP value of feature i .
        •   S : A subset of all features set F excluding the feature i.
        •   |S| : The size of subset S.
         •   f*∪) (x*∪) ) : The prediction of model f when the feature set S includes the feature i.
         •   f* (x* ) : The prediction of the model f with only the feature set S.

    SHAP provides a method to quantify the contribution to the change in prediction when feature
i is added to the model for every possible feature set S. The idea of LIME, on the other hand, is to
approximate the behavior of a complex model near the prediction of a specific instance using a
simpler model. The formula of LIME is as follows [8]:

                                         ξ(x) = argmin L(f, g, π/ ) + Ω(g)                              (3)
                                                    ,∈.
    •   g : A simple model used to approximate the behavior of the complex model f near the
        instance x.
    •   G : The set of all possible simple models.
    •   π/ : A weighting function that assigns higher weights to points closer to the instance x.
    •   Ω(g): A complexity measure that penalizes the model g.

    LIME begins by selecting a specific instance x (in this case, an individual student), already predicted
by a complex model. It generates a series of perturbed samples around this instance to explore the
model’s behavior locally. To approximate the behavior of the complex model in this localized region,
a simpler model, such as linear regression, is employed. The key objective is to assess the alignment
between the outputs of this simpler model, denoted as g, and the original complex model, denoted
as f , within the local context. This process is represented in the formulation by minimizing the loss
function between g and f, complemented by the minimization of g’s complexity measure.
    In this study, the Logistic Regression model was trained using data transformed through
PCA. Consequently, to maintain consistency with the training data, the data samples generated by
LIME need to be transformed into the same dimensional space. A wrapper function is implemented
to facilitate this process, transforming the LIME-generated data via PCA to ensure that the data is in
the appropriate form for the trained model to process effectively. And the same wrapper function has
been applied on SHAP.

    3.4 Consistency Evaluation

SHAP and LIME operate on distinct principles to determine the contribution of each feature to the
outcome. The critical question lies in the extent of the differences in the explanations derived from
these two methods. To evaluate their consistency, our approach involved identifying features that
show statistical correlation with the predicted results, as assessed by Spearman’s correlation with a
significance threshold set at α = 0.05. This threshold was chosen to discern features significantly
correlated with the outcomes. The next step is to compare the ranks of contributions as provided by
SHAP and LIME, utilizing the Kendall’s tau for this comparative analysis.
   The Kendall correlation coefficient measures the degree of similarity between two sets of
rankings assigned to the same group of objects [18]. Firstly, we rank the selected features based
on their influence on the prediction outcome. This process results in two sets of ranking data,
each ordering the same set of features. For each pair of features, we examine their respective
positions in both ranking sets and calculate their relative positions. Consequently, if a feature
ranks higher than another in both sets, the pair is deemed 'consistent'; the opposite scenario
indicates inconsistency. Once considers all pairs of features, calculating the difference between
the number of consistent pairs and inconsistent pairs, divided by the total number of pairs.
Following is the formula of Kendall correlation coefficient:
                                                          0! # 0$
                                                τ = %                                                   (4)
                                                          ×0×(023)
                                                      &


        •    n4 : The number of concordant pairs
        •   n5 : The number of discordant pairs.
        •   n : The sample sizes.

   The value of this coefficient ranges from -1 and 1. A value approaching 1 indicates a high level of
consistency in the rankings, while a value approaching -1 signifies a substantial degree of inconsistency
[19].

4. Results and Discussion

    4.1 Reply RQ1

As illustrated in Figure 1, the accuracy assessments demonstrate that all models achieved accuracy
rates around 80%. Notably, both Logistic Regression and Decision Tree models showed remarkable
performance. Logistic Regression achieved an 84.6% accuracy rate with 16 components, while the
Decision Tree reached a same level of accuracy with 58 components.
     The final decision to focus on Logistic Regression for in-depth analysis stems from a crucial
observation. Under the premise of using PCA as a method for feature extraction, the Decision Tree
model becomes less interpretable. Initially, the Decision Tree was a preferred choice due to its well-
known ease of interpretability. However, it was crucial to assess whether its performance was
sufficiently superior to warrant detailed explanation. Upon further analysis, it was found that its
accuracy was comparable to that of the Logistic Regression model. Therefore, we decided to apply
SHAP and LIME to the Logistic Regression model.


Figure 1: Accuracy vs. Number of PCA Components

    In the results, we present the explanation of SHAP’s prediction for individual instance in the form
of a waterfall plot. This mode of presentation is very similar to the way data is represented in LIME
results, which aids in our comparison of each instance.


Figure 2: SHAP interpretation of student A, B (left to right)
     This SHAP waterfall plot illustrates how feature contributions (red and blue bars) move the model
prediction from a baseline value (the average output of the model) E[f(x)] to the final prediction f(x).
Blue bars represent features that decrease the prediction probability, while red bars indicate those
that increase it. The gray texts in front of the feature names are the value to each features.
     In Figure 2, the model predicts Student A as at-risk with a probability value of 0.665, surpassing
the threshold for risk. Key features like ‘ADD_MEMO’, ‘srl_s_28’, and ‘srl_s_29’’ positively influence
this outcome. In contrast, ’srl_m_18’ and 192 other features collectively decrease the prediction
probability by about 0.16. Figure 2 shows Student B as not at-risk with a predictive value of 0.448,
influenced by features like ‘SEARCH’, ’SEARCH_JUMP’, and ‘srl_m_30’ which lower the risk probability.
     LIME’s plot in Figure 3 indicates Student A as at-risk with a 0.66 predictive probability, consistent
with the number displayed on SHAP analysis. Influential features include ‘LINK_CLICK’,
‘SEARCH_JUMP’, and ‘srl_s_3. Conversely, features like ‘CLEAR_HW_MEMO, and
‘OPEN_RECOMMENDATION’ contribute to a lower risk prediction. Figure 3 predicts Student B as not
at-risk at 0.55 probability, significantly influenced by ‘RuntimeError’ and ‘HTTPError’.


Figure 3: LIME interpretation of student A, B (left to right)

     To identify the key features contributing to succeed in the LBLS dataset, we assess the contribution
of features to the prediction. For SHAP analysis, we employed global explanation to find out the top
five features with the highest contribution values. Since LIME lacks a global explanation mechanism,
we aggregated the top five features with the most significant impact from each predictions. The five
most influential features in the global explanation of SHAP were ADD_RECOMMENDATION,
ADD_HW_MEMO, s_41, s_26, and TabError; whereas, for LIME, they were LINK_CLICK, HTTPError,
ZeroDivisionError, RecursionError, and s_32.

    4.2 Reply RQ2
We select 93 features that are statistically correlated to the result using Spearman’s correlation
coefficient. In the next step of the analysis, we will employ SHAP and LIME to evaluate the prediction
concerning student A. Our focus will be on capturing the rankings of all 93 features. Following this,
using Kendall’s tau coefficient to assess the similarity between these rankings.

Table 1
Features that are statistically correlated to predicting result
 Features                      Description                                                     ρ
 CLOSE                         Closed the book.                                             -0.11*
 OPEN                          Opened the book.                                            -0.12**
 PAGE_JUMP                     Jumped to a particular page.                                0.17***
 CODE_COPY                     Number of times a student copy codes.                       0.33***
 CODE_EXECUTION                Number of times a student execute codes.                    0.28***
 Other 88 features
*p < .05 **p < .01 ***p < .001
    Each blue dot on the graph represents a unique feature. When a dot aligns with the diagonal, it
signifies that SHAP and LIME assign the same ranking to that feature’s weight. For the interpretative
analysis of Student A and Student B, the Kendall’s tau are 0.66 and 0.64, respectively, suggesting a
moderate but noticeable positive correlation between the two datasets. This implies that an increase
in one dataset’s values is generally mirrored by an increase in the other, although the relationship is
not exceptionally strong. The analysis yields remarkably low P-values for the weight rankings, all of
which are below 0.001, reinforce the significance of this correlation.
    The graph reveals a tendency for the features’ weight rankings, as determined by both
interpretation methods, to cluster near the diagonal, particularly those with higher (towards the start)
and lower weights (towards the end). This pattern suggests a greater consistency in how both
methods evaluate these features. Conversely, the rankings of features in the central region of the
graph tend to be more dispersed.
    In the two prediction points, SHAP and LIME show a moderate level of consistency in assessing
feature importance, with a tendency for feature rankings to cluster near the diagonal line indicating
higher consistency in evaluating the most and least important features. The dispersion of feature
rankings in the central area of the graph suggests greater variability in interpreting features of medium
importance. The low P-values enhance the credibility of the results, suggesting that the observed
correlations are not random but reflect the underlying patterns in the data.

Table 2
Features that are statistically correlated to predicting result
     Rank          Feature Weight Ranking by SHAP               Feature Weight Ranking by LIME
       1                          srl_s_9                                 srl_m_17
       2                   FileNotFoundError                              srl_m_15
       3                         srl_m_3                                     PREV
       4                         srl_m_9                                   srl_s_23
       5                    ConversionError                                  s_10
And other 88 features


Figure 4: Kendall’s Tau Rank Correlation of student A and B

     Finally, we compared the feature weight rankings explained by SHAP and LIME for each prediction
point pairwise, calculating the average of Kendall's tau and p-value. We obtained an average Kendall's
tau of 0.623 and an average p-value of 0.000979. This suggests that there is also a moderate to strong
correlation in the feature importance rankings between the two methods for each prediction point.
In other words, the rankings of feature importance are relatively consistent between the two methods,
and the p-value being far below 0.05 shows that the correlation in rankings between SHAP and LIME
is statistically significant.

5. Conclusion
In this study, we emphasize the importance of xAI in preventing over-generalization of machine
learning algorithms, especially in fields of learning analytics. We use PCA for feature extraction,
comparing accuracies of multiple models, and selected one that is both simple to use and highly
accurate. We then combine various statistical methods to check if SHAP and LIME explanations of
feature weight rankings are consistent. The results show moderate consistency in SHAP and LIME
rankings among 93 selected features related to prediction outcomes, with high confidence. In learning
analytics, divergent results from xAI in predicting at-risk students can complicate strategy formulation
for stakeholders. Our study has analyzed explanations for two students predicted with different labels.
Future research could explore which explanation is more trustable when there is a lack of consistency,
whether to sacrifice model accuracy for higher consistency, or to involve more human intuition in
assessing the reasonableness of explanations. As for key feature identification for student learning
performance and strategy formulation for adaptive development, it undoubtedly requires
involvement from school teachers, educators, and psychologists.


Acknowledgments
This study is supported in part by the National Science and Technology Council of Taiwan under
contract numbers NSTC 112-2410-H-004 -063 -.

References
[1] Siemens, G., & Baker, R. S. d. (2012). Learning analytics and educational data mining: towards
     communication and collaboration. In Proceedings of the 2nd international conference on
     learning analytics and knowledge (pp. 252–254).
[2] Ak ̧capınar, G., Altun, A., & A ̧skar, P. (2019). Using learning analytics to develop early-
     warning system for at-risk students. International Journal of Educational Technology in
     Higher Education, 16 (1), 1–20.
[3] Romero, C., & Ventura, S. (2010). Educational data mining: a review of the state of the art.
     IEEE Transactions on Systems, Man, and Cybernetics, Part C (applications and reviews), 40
     (6), 601–618.
[4] Birhane, A. (2021). Algorithmic injustice: a relational ethics approach. Patterns, 2 (2).
[5] Scholes, V. (2016). The ethics of using learning analytics to categorize students on risk.
     Educational Technology Research and Development, 64 (5), 939–955.
[6] Holzinger, A., Saranti, A., Molnar, C., Biecek, P., & Samek, W. (2020). Explainable ai methods-
     a brief overview. In International workshop on extending explainable ai beyond deep models
     and classifiers (pp. 13–38).
[7] Rudin, C. (2019). Stop explaining black box machine learning models for high stakes
     decisions and use interpretable models instead. Nature machine intelligence, 1 (5), 206–215.
[8] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). ” why should i trust you?” explaining the
     predictions of any classifier. In Proceedings of the 22nd acm sigkdd international conference
     on knowledge discovery and data mining (pp. 1135–1144).
[9] Islam, M. R., Ahmed, M. U., Barua, S., & Begum, S. (2022). A systematic review of explainable
     artificial intelligence in terms of different application domains and tasks. Applied Sciences,
     12 (3), 1353.
[10] De Laet, T., Millecamp, M., Broos, T., De Croon, R., Verbert, K., & Duorado, R. (2020).
     Explainablelearning analytics: challenges and opportunities. In Companion proceedings of
     the 10th internationalconference on learning analytics & knowledge lak20 society for
     learning analytics research (solar)(pp. 500–510).
[11] Alonso, J. M., & Bugar ́ın, A. (2019). Expliclas: automatic generation of explanations in natural
     language for weka classifiers. In 2019 ieee international conference on fuzzy systems (fuzz-
     ieee) (pp.1–6).
[12] Liu, B., & Udell, M. (2020). Impact of accuracy on model interpretations. arXiv preprint
     arXiv:2011.09903 .
[13] Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions.
     Advances in neural information processing systems, 30
[14] Lu, O. H., Huang, A. Y., Flang, B., Ogata, H., & Yang, S. J. (2022). A quality data set for data
     challenge: Featuring 160 students’ learning behaviors and learning strategies in a
     programming course. In 2022 30th International Conference on Computers in Education.
     ICCE.
[15] Flanagan, B., Ogata, H. (2018). Learning Analytics Platform in Higher Education in Japan,
     Knowledge Management & E-Learning (KM&EL), Vol.10, No.4, pp.469-484.
[16] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and
     intelligent laboratory systems, 2 (1-3), 37–52.
[17] DW, G. D. A. (2019). Darpa’s explainable artificial intelligence program. AI Mag, 40 (2), 44
[18] Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30 (1/2), 81 93.
[19] Abdi, H. (2007). The kendall rank correlation coefficient. Encyclopedia of Measurement and
     Statistics. Sage, Thousand Oaks, CA, 508–510.