Enhancing Trust in Generative AI: Investigating
                         Explainability of LLMs to Analyse Confusion in MOOC
                         Discussions
                         Yuanyuan Hu1, Nasser Giacaman1 and Claire Donald1
                         1 The University of Auckland, 5 Grafton Rd, Auckland, 1010, New Zealand


                                            Abstract
                                            Providing feedback to address learners’ confusion in a personalised and timely manner can enhance learning
                                            engagement and deeper understanding in large-scale online courses, particularly Massive Open Online Courses
                                            (MOOCs). This goal aligns with a key objective within the Learning Analytics (LA) community. The advent
                                            of Generative Artificial Intelligence (GenAI) tools presents the potential to identify learners’ confusion in vast
                                            numbers of discussion texts and provide automatically-generated and adaptive feedback to learners rapidly.
                                            However, a lack of trust in AI-generated content among educators and learners is an obstacle to building
                                            effective GenAI-based LA solutions. This paper discusses the potential of enhancing trust in GenAI tools by
                                            improving the transparency and explainability of the large language models (LLMs) — a foundation of GenAI.
                                            We illustrate this approach through a pilot study where we apply an explainable AI (XAI) method — the
                                            Integrated Gradients — to decipher LLM-based predictions regarding learners’ confusion in MOOC
                                            discussions. The findings suggest promising reliability in the XAI method’s ability to identify word-level
                                            indicators of confusion in MOOC messages. The paper concludes by advocating the integration of XAI
                                            methods in GenAI applications, aiming to foster wider acceptance and efficacy of future GenAI-based LA
                                            solutions.

                                            Keywords
                                            Generative AI, Trust, Explainable AI, Learning Analytics, AIED, XAI, XAI-ED1


                         1. INTRODUCTION
                             Confusion, a common emotion during learning, is often an obstacle for learners to move forward
                         [1]. While a certain level of confusion can encourage learning engagement [2], this confusion may also
                         evolve into frustration and finally lead to boredom without timely interventions [3]. In distance learning
                         contexts, particularly Massive Open Online Courses (MOOCs), low participation and drop-out rates
                         may increase due to the impact of learners’ emotions, such as confusion [4].
                             MOOCs offer high-quality, open-access, rich, online learning resources, and micro-credentials
                         regardless of university-entry barriers, empowering a diverse range of learners to study at their own
                         pace. As learning in MOOCs is entirely virtual and asynchronous, discussion forums become key
                         venues for interaction and communication between learners and instructors. In MOOCs, resolving
                         numerous queries and confusion raised by a huge number of learners in discussion forums is a
                         significant challenge due to the limited availability of educators [5, 6, 7]. Behavioural and physiological
                         measures, such as facial expressions and skin conductance, have successfully discerned learners’
                         confusion in traditional small to medium classrooms [8]. However, these measures are impractical to
                         be implemented in MOOCs. Researchers explore solutions to provide adaptive, immediate responses
                         to address learners’ confusion and improve learning engagement in MOOC discussion forums [9]. This
                         objective is also a crucial goal of the Learning Analytics (LA) community [10].
                             The increasing availability of generative artificial intelligence (GenAI) tools, such as ChatGPT [11]
                         and Gemini [12], has opened fresh possibilities for LA research. Investigating the feasibility of applying
                         GenAI tools in higher education practices has displayed promising outcomes, such as automatic
                         generation of academic writing [13], adaptive responses to discussion-forum posts [9], automated code
                         review [14], and personalised summary and feedback on students’ writing assignments [15]. Despite
                         the opportunities in education, the breakthroughs of GenAI techniques have also sparked debates on
                         their ethical concerns, such as biases and reliability concerns about the texts generated, namely trust in

                         Joint Proceedings of LAK 2024 Workshops, co-located with 14th International Conference on Learning Analytics and
                         Knowledge (LAK 2024), Kyoto, Japan, March 18-22, 2024.
                            y.hu@auckland.ac.nz (Y. Hu); n.giacaman@auckland.ac.nz (N. Giacaman); c.donald@auckland.ac.nz (C. Donald)
                                0000-0003-2592-8473 (Y. Hu); 0000-0001-6885-1571 (N. Giacaman); 0000-0003-2803-7537 (C. Donald)
                                       © 2023 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
GenAI [16, 17, 18]. Also, the EU Parliament calls for safety, transparency, accountability, equality, and
eco-friendliness in AI techniques to avoid harmful effects [19].
    Deficiency of trust in AI-generated content among educators and learners is an obstacle to
developing reliable and effective GenAI-based LA (Gen-LA) solutions for teaching and learning [16,
20, 21]. This issue of deficiency stems from the fundamental architecture of GenAI, specifically the
transformer-based large language models (LLMs). A notable challenge faced by LLMs and inherited
by GenAI is deep learning models’ inability to explain the mechanisms and reasoning in their decision-
making processes [22]. Explainable artificial intelligence (XAI) methods can contribute to interpreting
obscured ‘black-box’ mechanisms hiding behind deep learning models [22, 23]. Applying XAI
techniques to provide clear rationales in AI-generated content is required for designing and developing
trustworthy educational AI systems [24].
    Based on these studies, our research interest focuses on investigating the potential of employing XAI
methods to decode word-level indicators in the prediction made by LLMs, particularly in identifying
learners’ confusion in MOOC discussions. Detecting learners’ confusion timely and accurately is a
prerequisite for providing them with adaptive feedback in MOOC discussions. Earlier studies decoded
linguistic cues as indicators of confusion in MOOC discussions using different machine learning and
deep learning models [1, 25, 26, 27]. As a preliminary step of an ongoing project, this small project
attempts to provide proof of concept for enhancing trust in future GenAI applications by improving the
transparency and explainability of LLMs. Thus, the research question in this paper is “What can we
gain from using XAI methods to interpret LLMs’ predictions of learners’ confusion in MOOC
discussions?” We assume that XAI methods can discern positive and negative indicators behind LLMs’
processes for identifying confusion in MOOC discussions. We conduct a test case in this paper to
examine this assumption.
    In the following section, we will review the related work that shaped our study: learners’ confusion
identification in MOOC discussions and applications of XAI methods to interpret important features of
confusion predictions. Subsequently, we will illustrate a pilot study to address our research question.
Finally, suggestions for improving trust in future GenAI-LA solutions based on the implications of the
pilot study will be expounded at the end of this paper.

2. RELATED WORK
    A pioneering study on the detection of learners’ confusion in MOOC discussions is Agrawal et al.’s
research [28] that developed a system using bag-of-words, the conversational position of discussion
messages, the number of likes, and learners’ grades to identify confusion and recommended minute-
resolution video clips to learners accordingly. This study also developed the Stanford MOOC discussion
data sets, which are used in our pilot study explained in Section 3.1. These data sets were also used in
most of the previous studies on analysing learners’ confusion in MOOC discussions [1, 26, 27, 29, 30,
31].
    After Agrawal et al.’s work [28], Bakharia [30] applied a Support-vector-machine classifier to detect
confusion, urgency, and sentiments in MOOC discussions, achieving over 70% accuracy rates in
domain-specific courses. Zeng et al. [32] trained an Elastic-net model using content-related features
(e.g., readability index, the number of words in a post, topicality, etc.) and community-related features
(e.g., the number of likes and reads, etc.) to identify confusion and urgency in MOOC discussions,
reaching an over 80% accuracy in specific domain data sets.
    Building on the previous work, Atapattu et al. [1, 26] applied a random forest classifier with solely
linguistic features to identify learners’ confusion in MOOC discussions, improving the accuracy to over
83% F1 score in all domain-specific data sets and to F1 scores between 70.7% to 84.5% in cross-domain
data sets. These approaches not only underscore the significant role of linguistic features in identifying
learners’ confusion within MOOC discussions, but also imply the nuanced language cues that can
distinguish confusion messages from non-confusion ones.
    Other trials that explore machine learning and deep learning methods to detect confusion in MOOC
discussions after Atapattu et al.’s [1, 26] work focused merely on enhancing classification performance
rather than deciphering indicators of learners’ confusion, such as applying a Transformers classification
model in Chanaa and El Faddouli [31] and comparing different machine learning methods in
Bhumireddy and Anala [33].
    Alrajhi et al. [25] offer a preliminary example of employing XAI methods to interpret the prediction
reasons of a Transformers model with ontology methods to detect urgent MOOC discussion messages.
While this study offers valuable insights for investigating XAI methods, providing more details about
the prediction and interpretation processes would be more helpful for further studies in this area. Du
and Xing [27] developed an explainable text classifier framework to identify confusion in MOOC
discussions based on a legal services model. However, they mentioned that their work might have
limitations in interpreting negative indicators for different levels of confusion.

3. A PILOT STUDY
   In this section, we demonstrate a pilot study to investigate the research question proposed in Section
1. We will explain the data sets used in this study, the architecture of LLMs-based classifiers for
confusion detection, an XAI method employed to interpret word-level indicators for model predictions,
and experimental results.

    3.1. Datasets Description
   Data sets used in this study came from discussion posts and replies in the Stanford University public
MOOCs, which contain archived runs of eleven courses [28]. These courses involve multiple topics
mainly from three disciplines: Education, Medicine, and Humanities. Each message — a discussion
post or a reply — was classified by three expert coders independently into degrees of confusion from 1
(extremely knowledgeable) to 7 (extremely confused). In our pilot study, the degree of 4 was regarded
as neutral, 1 to 3 as non-confusion, and 5 to 7 as confusion, which was the same categorisation way in
Atapattu et al.’s work [1].
   Following Atapattu et al.’s [1] work, we trained and tested our binary classification model, which
will be explained in the next section, through two experiments. One included the neutral messages in
the confusion ones as a ‘broader’ confusion class. The other excluded these neutral messages in training
and testing processes. Both experiments’ outcomes will be illustrated in Section 3.3. We pre-processed
the text data by expanding abbreviations, eliminating repeated characters and extraneous spaces, and
removing messages with less than three words. Table 1 demonstrates the distribution of messages
classified in each category across three domain-specific data sets after the data cleaning.

Table 1
Number of messages classified as Non-Confusion, Confusion, and Neutral in Education, Medicine,
Humanities, and all three data sets.
 Set                      Non-Confusion         Confusion               Neutral
 Education                6650                  638                     2446
 Medicine                 1577                  1587                    6339
 Humanities               1533                  2252                    5872
 Total                    9760                  4477                    14657


    3.2. Classifier Architecture and the XAI method
    We applied and fine-tuned a DistilBERT model — a faster and lightweight transformer-based deep
learning model [34] — to predict confusion or non-confusion of messages in MOOC discussions. This
pre-trained model has been applied to a broader range of natural language processing solutions,
particularly where the implementation of LLMs is not feasible due to hardware resource limitations.
The DistilBERT model has achieved excellent performance in sentiment analysis tasks [35]. As a faster
and smaller LLM but maintaining a competitive level of accuracy, the DistilBERT model will be a
better option for our pilot study on the model explanation process of the automatic confusion analysis
to provide concept of concept rather than very large and computationally expensive LLMs, such as
GPT-4 [36].
    Based on Alrajhi et al.’s work [25], we employed the Integrated Gradients method from the Captum
library for PyTorch [37] to gain a deeper understanding of the decision-making processes (i.e., positive,
or negative indicators) within our DistilBERT-based confusion classifier. The Integrated Gradients
method computes the prediction feature importance by integrating gradients of the deep learning
model’s outputs (e.g., classes) regarding the inputs (e.g., words and sentences), from non-informative
baseline inputs to actual inputs, evaluating each feature’s contribution to the prediction output.

    3.3. Results

        3.3.1. Classification performance
   In training and testing processes of domain-specific sets, our fine-tuned DistilBERT model achieved
the best-performing weighted-averaged F1 scores of 0.74, 0.90, 0.87 and 0.83 in the Education,
Medicine, Humanities, and all three data sets, respectively, where we regarded neutral messages as
confusion messages based on Atapattu et al.’s work [1]. When we excluded these neutral messages in
the training step, weighted-averaged F1 scores increased to 0.95, 0.90, 0.90, and 0.92, as summarised
in Table 2. Our models reached an average higher performance than random forest classifiers applied
in a previous study with and without neutral messages [1]. These results suggest that the neutral
messages, which were classified between confusion and non-confusion messages by expert coders,
affect the model performance, particularly in the Education set where neutral messages contributed the
major percentage.

Table 2
Fine-tuned model classification performance in Education, Medicine, Humanities, and all three data
sets.
                                  Weighted average F1             Weighted average F1
 Set
                                  (including neutral data)        (excluding neutral data)
 Education                        0.74                            0.94
 Medicine                         0.90                            0.90
 Humanities                       0.87                            0.90


        3.3.2. Word-level indicators for confusion identification
    This section presents results from experiments designed to predict confusion and non-confusion in
the MOOC messages, where neutral messages were excluded. The reliability of these experiments is
underscored by the model’s high performance, achieving over 0.90 F1 scores. Due to the page limits of
this workshop paper, interpretation samples from the best-performing Education set are displayed in
Figure 1 and Figure 2 as examples. We highlight negative indicators in red and positive ones in green.
The intensity of the green correlates with the strength of the positive attribution.
    While the paper only showcases examples from the Education dataset, we provide a summary of the
findings from experiments conducted on domain-specific and all three datasets as follows. Strong word-
level indicators to predict MOOC learners’ confused messages positively are 1) first-person singular
and plural, 2) question stems, 3) question bigrams, 4) confusion expressions, and 5) the question mark.
Strong indicators that can predict non-confused messages positively are 1) second-person pronouns and
2) academic writing expressions. These interpretation outcomes by the XAI method strongly align with
the indicators found in previous studies [1, 26].

4. IMPLICATIONS AND FUTURE WORK
    4.1. Implications
   We can answer our research questions as follows. Outcomes of our pilot study demonstrate
promising reliability of using the Integrated Gradients method with the fine-tuned DistilBERT model
to discern word-level predictors in the MOOC discussions. This is because indicators of confusion
detected in our study are in line with the linguistic indicators identified by the previous studies using
tree-based machine learning classifiers [1, 26]. Unlike hidden computation of deep learning models,
tree-based machine learning algorithms are often regarded as “white-box” models due to their clear,
transparent decision-making rules and easy, straightforward tracking paths of every-step impacts of
input features on outputs. This is also the main reason that white-box algorithms can be preferences for
educational studies [38]. Robustness of a certain degree can be implied if indicators from the XAI
method are similar to the important features from white-box algorithms. Future research can employ
XAI methods in tandem with LLMs to enhance the transparency and trustworthiness of deep learning
mechanisms, leveraging GenAI-LA solutions to be more accessible and understandable to non-
technical audiences.
    A possible application of XAI techniques in GenAI-LA solutions is offering clear and UX-friendly
designed rationales, along with automatically generated and personalised feedback to urgent MOOC
posts. A previous study suggests that GPT2-generated replies to MOOC posts can reach a similar degree
of emotional and community support as human tutors although a lower extent of informality [39]. This
promising result encourages further studies to investigate the potential of applying GenAI techniques
to provide learners with automatic responses in large-scale online learning scenarios. We recommend
employing XAI methods to highlight words or phrases that attribute high importance to GenAI’s
decision-making processes for each part of the responses generated. In this way, learners and educators
can gain insights into how AI tools approach their queries, improving their trust in AI-generated content.
Also, learners can refine question-posing strategies in discussion forums to elicit accurate responses
from GenAI agents according to rationales provided by XAI methods.
    XAI methods can also be integrated into other GenAI-LA applications such as AI-assisted writing
assessment. A writing analytics tool, AcaWriter, applies XAI designs to offer sentence-level and
document-level feedback in learners’ academic writing assessments [40]. A recent study indicates that
ChatGPT can generate high-quality feedback on summarising topics of students’ assignments and
providing process-focus suggestions [15]. We assume XAI methods also have the potential to offer
distributed rationales at word, sentence, concept, and organisation levels with grading rubrics during
GenAI-assisted writing processes. In this way, scores and advice provided by GenAI tools would
become more transparent and credible to both learners and educators.
    The LA community calls for redefining our perception of learners in the AI era [41, 42]. Learners
can gain personalised feedback from GenAI as a new way of learning. At the same time, learners can
iteratively coach a GenAI tool to align its responses with their expectations. GenAI is regarded as a full
participant in conversational education systems now [43]. With the improvement of transparency and
explainability by providing learners with rationales in AI’s decision-making mechanisms, they will
coach GenAI more easily and effectively for personalised learning demands. This reciprocal learning
model, akin to the ‘Ako’ concept from Māori culture where roles between educator and learner are
interchangeable, may offer innovative ways to enhance skills such as problem-solving, collaboration,
and self-regulated learning in the AI era.

      4.2. Limitations and Future Work
   This study has two main limitations. Firstly, the pilot study only provides a trial of using an XAI
method to explain the positive and negative indicators in confusion predictions of the LLMs-based
classifier, which is a vital foundation of GenAI. This XAI method may not be directly extended to
GenAI models. Secondly, the LLMs-based classifier for identifying confusion messages was trained
and fine-tuned by using discussion data from three domains (i.e., education, medicine, and humanities),
which still needs further refinement on MOOC discussions from other domains to improve the model’s
generalisability.
   Our future work will investigate the feasibility of using XAI methods to detect key indicators within
learners’ queries that result in content generated by GenAI. We will also explore methods to visualise
these indicators at a word level in a way that is intuitive and readable for learners and educators. This
future research will enhance the feasibility and user-friendliness of GenAI-LA solutions towards
human-AI collaboration on teaching and learning processes in the age of AI.

References
[1]     T. Atapattu, K. Falkner, M. Thilakaratne, L. Sivaneasharajah, and R. Jayashanka, “An
        Identification of Learners’ Confusion through Language and Discourse Analysis,” arXiv
        preprint arXiv:1903.03286, 2019.
[2]     S. D’Mello and A. Graesser, “Confusion and its dynamics during device comprehension with
        breakdown scenarios,” Acta Psychol (Amst), vol. 151, pp. 106–116, 2014, doi:
        10.1016/j.actpsy.2014.06.005.
[3]    M. K. Chandrasekaran, M.-Y. Kan, B. C. Y. Tan, and K. Ragupathi, “Learning Instructor
       Intervention from MOOC Forums: Early Results and Issues,” Apr. 2015, [Online]. Available:
       http://arxiv.org/abs/1504.07206
[4]    P. M. Moreno-Marcos, C. Alario-Hoyos, P. J. Munoz-Merino, and C. D. Kloos, “Prediction in
       MOOCs: A Review and Future Research Directions,” IEEE Transactions on Learning
       Technologies, vol. 12, no. 3, pp. 384–401, Jul. 2019, doi: 10.1109/TLT.2018.2856808.
[5]    I. Buchem et al., “Integrating Mini-Moocs into Study Programs in Higher Education During
       Covid-19. Five Pilot Case Studies in Context of the Open Virtual Mobility Project,” Human and
       Artificial Intelligence for the Society of the Future, pp. 299–310, 2020, doi: 10.38069/edenconf-
       2020-ac0028.
[6]    H. Cha and H. J. So, Integration of Formal, Non-formal and Informal Learning Through
       MOOCs. Springer Singapore, 2020. doi: 10.1007/978-981-15-4276-3_9.
[7]    Y. Hu, C. Donald, and N. Giacaman, “Cross Validating a Rubric for Automatic Classification
       of Cognitive Presence in MOOC Discussions,” International Review of Research in Open and
       Distributed       Learning,      vol.    23,     no.     2,    pp.    242–260,      2021,    doi:
       https://doi.org/10.19173/irrodl.v23i3.5994.
[8]    A. Arguel, L. Lockyer, O. V. Lipp, J. M. Lodge, and G. Kennedy, “Inside Out: Detecting
       Learners’ Confusion to Improve Interactive Digital Learning Environments,” Journal of
       Educational Computing Research, vol. 55, no. 4, pp. 526–551, 2017, doi:
       10.1177/0735633116674732.
[9]    C. Li and W. Xing, “Natural Language Generation Using Deep Learning to Support MOOC
       Learners,” Int J Artif Intell Educ, vol. 31, no. 2, pp. 186–214, Jun. 2021, doi: 10.1007/s40593-
       020-00235-x.
[10]   “What is Learning Analytics?” Accessed: Jan. 23, 2024. [Online]. Available:
       https://www.solaresearch.org/about/what-is-learning-analytics/
[11]   “ChatGPT.” Accessed: Jan. 23, 2024. [Online]. Available: https://chat.openai.com
[12]   “Gemini.”           Accessed:         Jan.       23,        2024.      [Online].       Available:
       https://deepmind.google/technologies/gemini
[13]   Ö. Aydin and E. Karaarslan, OpenAI ChatGPT Generated Literature Review: Digital Twin in
       Healthcare, vol. 2. İzmir Akademi Dernegi, 2022. [Online]. Available:
       https://ssrn.com/abstract=4308687
[14]   E. A. Oliveira, S. Rios, and Z. Jiang, “AI-powered peer review process: An approach to enhance
       computer science students’ engagement with code review in industry-based subjects,” in
       ASCILITE 2023 Conference Proceedings: People, Partnerships and Pedagogies, Christchurch,
       New Zealand, 2023.
[15]   W. Dai et al., “Can Large Language Models Provide Feedback to Students? A Case Study on
       ChatGPT,” in 2023 IEEE International Conference on Advanced Learning Technologies
       (ICALT), IEEE, 2023, pp. 323–325. [Online]. Available: https://chat.openai.com/
[16]   A. Matin et al., “Trust in Generative AI among students: An Exploratory Study,” in IEEE
       International Conference on Program Comprehension, IEEE Computer Society, 2022, pp. 36–
       47. doi: 10.1145/nnnnnnn.nnnnnnn.
[17]   M. Amoozadeh et al., “Towards Characterizing Trust in Generative Artificial Intelligence
       among Students,” ICER ’23: Proceedings of the 2023 ACM Conference on International
       Computing Education Research, vol. 2, Aug. 2023, doi: 10.1145/3617367.
[18]   V. , Charisi et al., “Artificial Intelligence and the Rights of the Child Towards an Integrated
       Agenda for Research and Policy,” Luxembourg, 2022. doi: 10.2760/012329.
[19]   “AI Act: a step closer to the first rules on Artificial Intelligence,” European Parliament.
       Accessed: Jan. 23, 2024. [Online]. Available: https://www.europarl.europa.eu/news/en/press-
       room/20230505IPR84904/ai-act-a-step-closer-to-the-first-rules-on-artificial-intelligence
[20]   S. Hashim, M. K. Omar, H. Ab Jalil, and N. Mohd Sharef, “Trends on Technologies and
       Artificial Intelligence in Education for Personalized Learning: Systematic Literature Review,”
       International Journal of Academic Research in Progressive Education and Development, vol.
       11, no. 1, Feb. 2022, doi: 10.6007/ijarped/v11-i1/12230.
[21]   C. Ling Thong, R. Butson, and L. WeiLee, “Understanding the impact of ChatGPT in education:
       Exploratory study on students’ attitudes, perception and ethics,” in ASCILITE 2023, 2023.
       [Online]. Available: https://www.aleks.com
[22]   A. Barredo Arrieta et al., “Explainable Explainable Artificial Intelligence (XAI): Concepts,
       taxonomies, opportunities and challenges toward responsible AI,” Information Fusion, vol. 58,
       no. December 2019, pp. 82–115, 2020, doi: 10.1016/j.inffus.2019.12.012.
[23]   D. Gunning, “Explainable artificial intelligence (xai),” Defense Advanced Research Projects
       Agency (DARPA), nd Web, vol. 2, no. 2, 2017, [Online]. Available:
       https://www.cc.gatech.edu/~alanwags/DLAI2016/(Gunning) IJCAI-16 DLAI WS.pdf
[24]   H. Khosravi et al., “Explainable Artificial Intelligence in education,” Computers and Education:
       Artificial Intelligence, vol. 3, no. May, 2022, doi: 10.1016/j.caeai.2022.100074.
[25]   L. Alrajhi, F. D. Pereira, A. I. Cristea, and T. Aljohani, “A Good Classifier is Not Enough: A
       XAI Approach for Urgent Instructor-Intervention Models in MOOCs,” in Artificial Intelligence
       in Education (AIED 2022), 2022, pp. 424–427. doi: 10.1007/978-3-031-11647-6_84.
[26]   T. Atapattu, K. Falkner, M. Thilakaratne, L. Sivaneasharajah, and R. Jayashanka, “What Do
       Linguistic Expressions Tell Us about Learners’ Confusion? A Domain-Independent Analysis in
       MOOCs,” IEEE Transactions on Learning Technologies, vol. 13, no. 4, pp. 878–888, Oct. 2020,
       doi: 10.1109/TLT.2020.3027661.
[27]   H. Du and W. Xing, “Leveraging explainability for discussion forum classification: Using
       confusion detection as an example,” Distance Education, vol. 44, no. 1, pp. 190–205, 2023, doi:
       10.1080/01587919.2022.2150145.
[28]   A. Agrawal, J. Venkatraman, S. Leonard, and A. Paepcke, “YouEDU: Addressing confusion in
       MOOC discussion forums by recommending instructional video clips,” Proceedings of the 8th
       International Conference on Educational Data Mining, pp. 297–304, 2015, [Online]. Available:
       http://ilpubs.stanford.edu:8090/1125/1/you_edu.pdf
[29]   O. Almatrafi, A. Johri, and H. Rangwala, “Needle in a haystack: Identifying learner posts that
       require urgent response in MOOC discussion forums,” Comput Educ, vol. 118, pp. 1–9, Mar.
       2018, doi: 10.1016/J.COMPEDU.2017.11.002.
[30]   A. Bakharia, “Towards cross-domain MOOC forum post classification,” in L@S 2016 -
       Proceedings of the 3rd 2016 ACM Conference on Learning at Scale, Association for Computing
       Machinery, Inc, Apr. 2016, pp. 253–256. doi: 10.1145/2876034.2893427.
[31]   A. Chanaa and N. E. El Faddouli, “BERT and Prerequisite Based Ontology for Predicting
       Learner’s Confusion in MOOCs Discussion Forums,” in Lecture Notes in Computer Science
       (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
       Bioinformatics), Springer, 2020, pp. 54–58. doi: 10.1007/978-3-030-52240-7_10.
[32]   Z. Zeng, S. Chaturvedi, and S. Bhat, “Learner Affect Through the Looking Glass:
       Characterization and Detection of Confusion in Online Courses,” in the 10th International
       Conference on Educational Data Mining, 2017, pp. 272–277.
[33]   G. Bhumireddy and V. A. S. M. Anala, “Comparison of Machine Learning algorithms on
       detecting the confusion of students while watching MOOCs,” Master of Science, Blekinge
       Institute of Technology, Karlskrona, Sweden, 2022. Accessed: Dec. 14, 2022. [Online].
       Available: www.bth.se
[34]   V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT:
       smaller,     faster,    cheaper      and    lighter,”    Oct.    2019,    [Online].    Available:
       http://arxiv.org/abs/1910.01108
[35]   M. Jojoa, P. Eftekhar, B. Nowrouzi-Kia, and B. Garcia-Zapirain, “Natural language processing
       analysis applied to COVID-19 open-text opinions using a distilBERT model for sentiment
       categorization,” AI Soc, 2022, doi: 10.1007/s00146-022-01594-w.
[36]   OpenAI et al., “GPT-4 Technical Report,” Mar. 2023, [Online]. Available:
       http://arxiv.org/abs/2303.08774
[37]   N. Kokhlikyan et al., “Captum: A unified and generic model interpretability library for
       PyTorch,” pp. 1–11, 2020, [Online]. Available: http://arxiv.org/abs/2009.07896
[38]   Y. Hu, R. F. Mello, and D. Gašević, “Automatic analysis of cognitive presence in online
       discussions: An approach using deep learning and explainable artificial intelligence,”
       Computers and Education: Artificial Intelligence, vol. 2, p. 100037, 2021, doi:
       10.1016/j.caeai.2021.100037.
[39]   C. Li and W. Xing, “Natural Language Generation Using Deep Learning to Support MOOC
       Learners,” Int J Artif Intell Educ, vol. 31, no. 2, pp. 186–214, Jun. 2021, doi: 10.1007/S40593-
       020-00235-X/FIGURES/6.
[40]   S. Knight et al., “AcaWriter A learning analytics tool for formative feedback on academic
       writing,” J Writ Res, vol. 12, no. 1, pp. 141–186, 2020, doi: 10.17239/JOWR-2020.12.01.06.
[41]   D. Clow, “The learning analytics cycle: Closing the loop effectively,” in the 2nd International
       Conference on Learning Analytics and Knowledge - LAK ’12, Vancouver, BC., 2012, pp. 134–
       138. doi: 10.1145/2330601.2330636.
[42]   L. Yan, R. Martinez-Maldonado, and D. Gašević, “Generative Artificial Intelligence in Learning
       Analytics: Contextualising Opportunities and Challenges through the Learning Analytics
       Cycle,” Nov. 2023. [Online]. Available: http://arxiv.org/abs/2312.00087
[43]   M. Sharples, “Towards social generative AI for education: theory, practices and ethics,”
       Learning: Research and Practice, vol. 9, no. 2, pp. 159–167, Jun. 2023, doi:
       10.1080/23735082.2023.2261131.
Appendix


Figure 1: Samples of confusion messages in the Education Dataset
Figure 2: Samples of non-confusion messages in the Education Data set