<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>H. Ahmed);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>User-centric Evaluation of GenAI Alignment and Recommendations based on Predictive Learning Analytics⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hesham Ahmed</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Halil Kayaduman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonsoles López-Pernas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markku Tukiainen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammed Saqr</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Distance Education Application and Research Center, Inonu University</institution>
          ,
          <addr-line>Malatya</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Eastern Finland</institution>
          ,
          <addr-line>Yliopistokatu 2, 80100 Joensuu</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Predictive models are one of the hallmarks of learning analytics research, relying on learner data to predict academic achievement and dropouts, enabling targeted interventions. Using a user-centric evaluation framework, we assessed the recommendations generated by ChatGPT based on the results of 136 studies that used student data for predictive modeling. The evaluation considered general attributes (accuracy, coherence, justification) as well as education-specific criteria (alignment with learning theories, ethics, learner-centeredness). The results indicate that, while LLM-generated recommendations are generally accurate, coherent and useful, they often lack alignment with diverse learning theories and fail to address inclusivity and higher-order cognitive skills effectively. Therefore, to operationalize LLMs to provide automated feedback to students, these aspects should be explicitly considered in the prompt design.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Generative AI</kwd>
        <kwd>Predictive Learning Analytics</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>User-centric Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Generative Artificial Intelligence describes a set of computational techniques that can generate
mostly comprehensible content in the form of text, image, and video that is new out of training data
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Large Language Models (LLMs) are a subset of such techniques that is based on neural networks
which is trained on hundreds of terabytes of textual data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. An example of such models is ChatGPT,
which has demonstrated human-like performance on a wide range of natural-language oriented
objectives ranging from translation, writing intelligible essays, and creating functional code [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This,
in return, has encouraged many to explore its possible benefits in the realm of education [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Some studies have explored the utility of LLMs-powered Recommendation Systems in
educational-related contexts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. While it is early as the adoption is ramping up, it is necessary
to understand their impact not only from a functional standpoint [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], but also in terms of user
experience and alignment with pedagogical objectives. Teachers, students, and other stakeholders
will use these tools in varied ways, underlining the need for user-centric evaluation to ensure that
the recommendations generated are of a high quality from different aspects like implementability,
alignment with learning theories, and ethicality.
      </p>
      <p>
        Such recommendation systems can be built to support certain objectives, for example, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] aimed
to support student learning recommendations using, among others, Knowledge Graph
Contextualization. In our case, the recommendations are based on predictive Learning Analytics (LA)
models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Predictive LA is concerned with using learner-related data to create predictions of
possible future scenarios to aid in making interventions that avoid the negative ones. Such
predictions are a result of revealing statistical correlative relations between different features like
previous academic performance, current credit load, and behavior on learning management systems.
      </p>
      <p>
        Evaluating LLMs has generally been through rigorous frameworks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] that have focused on
evaluating its performance automatically using standardized datasets and benchmarks. In a study
similar to ours, [10] have evaluated LLM-powered recommendation systems using both objective
measures and user-centric subjective criteria based on a revised version of the ResQue framework
[11]. However, the recommendation system was concerned with leisure-related events and activities
and did not relate to education. Recently, we have evaluated recommendations from a
RetrievalAugmented Generation (RAG) system. RAG is a set of methods that enhances the quality of the LLMs
responses through supplying it with additional external knowledge [12]. This RAG relied on
predictive LA models extracted from state-of-the-art research on LA. The responses were generally
more specific compared to the responses of a typical LLM. However, the responses were not very
accurate in many cases and lacked precision.
      </p>
      <p>In this study, we aim to comprehensively evaluate the quality of recommendations provided by
LLMs based on their interpretation of learning analytics research findings. This is because the ability
to offer recommendations rests on the ability to digest and translate research findings into actual
practical recommendations that account for different criteria as in some instances, the resulting
recommendations could be not only unintelligible but also potentially harmful.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In this study, we aim to follow the steps shown in Figure 1 to answer the following questions:
1. How accurate are LLMs in interpreting predictive learning analytics results and providing
useful recommendations to students?
2. How aligned are the recommendations with learning theories, learner-centeredness, ethics
and engagement with higher order cognitive skills?</p>
      <sec id="sec-2-1">
        <title>2.1. Studies collection</title>
        <p>The first step of this process was to collect all predictive LA research through snowballing from
existing systematic literature reviews that used student data to create predictive models on student
achievement, retention, success and all other students’ outcomes. First, we identified a total of 13
relevant systematic literature reviews (see Figure 2). The second step was mining the references of
the systematic reviews and compiling them into a list of 1,517 references. After eliminating duplicate
entries, non-English articles, and those published before 2011, by examining the title, abstract, and
keywords, a total of 476 articles remained. Next, each article was manually inspected to verify
whether they used student data (such as learning management system activity) to predict a target
(such as grades) and reported results that displayed the correlation between the predictors/features
and the predictor in an interpretable format (such as piece of text, a table, or a figure). To account
for possible mistakes and misses, the inclusion and exclusion procedure was validated by a second
researcher. At the end, a total of 136 articles fulfilled all the criteria and were passed for later stages.
The flow of this process is illustrated in Figure 2.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data extraction</title>
        <p>The following data were collected manually out of each study in a tabular format: study title, year of
publication of the study, level of education, type of study (STEM/ Non-STEM), duration of the data
collection, number of students, data sources (features), description of the features (if available),
targeted variable, prediction method, and format of the results (table/graph/text). The targeted
variable, the features and their description, and the statistical model that describes the relation
between the predictors and the predicted were also collected as a screenshot in JPEG and PNG file
formats. The rationale for this is that the presentation formats such data was described in each study
were rarely the same. So, this will aid in standardizing the format of the input to the LLM. Such
screenshots were taken in the highest possible resolution to avoid misinterpretation and
hallucination.</p>
        <p>The dominant level of education targeted in the study was university-level representing
approximately 74%. STEM represented the largest portion of types of study with 55.1% followed by
mixed types with 28.7%. The median duration of the data collection was around 1 year. The
number of students in the studies was mostly below 1000. Figure 3 shows the years in which the
studies collected were published.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. LLM prompting and response collection</title>
        <p>A prompt was created to cover each of the 136 studies. We followed the Basic Prompting technique
[13], that is, supplying the LLM with the input and a request without giving it examples of the
expected output as a guide. The prompt was meant to be general without suggestive language as we
wanted not to prime its output towards a specific shape or form or anchor it to consider any specific
criteria. ChatGPT4 was used to create the recommendations not only because it is one of the most
advanced and commonly used but also it allowed multiple images as an input. After creating the
prompts, each prompt was supplied to ChatGPT4 with each in a separate conversation to avoid
contextual overlap. Afterwards, the responses were collected as text to facilitate its evaluation. The
format of the prompt is shown in Table 1.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Evaluation framework development</title>
        <p>To evaluate the responses, a questionnaire, with the name of Learning Analytics Recommendations
Alignment Questionnaire (LARAQ), was developed that relied on criteria that address general
attributes of recommendation systems and criteria that address learner-related attributes. For the
general criteria, the ResQue framework has been widely used to allow the users of a recommendation
system to subjectively assess it holistically [10]. This framework has been utilized to assess
recommendation systems in different domains such as music [14] and movies [15]. Similarly, [16]
has cultivated a framework to evaluate conversational recommendation systems specifically. The
general criteria that were chosen are: Recommendation Accuracy, Subjective measure of the
presentability of the recommendation, Justification, Perceived Usefulness, Consistency &amp; Coherence.</p>
        <p>Learner-related attributes assess that the recommendations are not only applicable but also how
much it adheres to pedagogical and ethical principles. The criteria are: Implementability &amp;
practicality, Privacy &amp; ethicality, Alignment with learning Theories, Diversity, Equity, Inclusion,
Learner-Centeredness, Engagement with Higher-Order Cognitive Skills. The criteria in LARAQ were
extracted from different frameworks and evaluations and adapted to the context of education. Two
evaluators independently assessed the evaluation. For quantitative questions, the average score was
calculated, whereas qualitative questions were evaluated through consensus. Tables 2 and 3 show
each criterion and their respective questions alongside its sources.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>The box plot in Figure 4 shows the evaluations of Questions 1-6 and Questions 9-13. It reveals that
Accuracy, Presentability, Justification, Usefulness, Consistency &amp; Coherence, and Practicality
receive higher median ratings (around 4.0 to 4.5), indicating generally positive assessments. On the
other hand, Diversity, Equity, Inclusion, and Engagement with Higher-Order Cognitive Skills have
lower median ratings (around 2.0 to 3.0). Lastly, Learner-Centeredness shows moderate ratings with
some variability.</p>
      <p>Figure 5 illustrates the frequency of the learning theories that the recommendations were aligned
with. The X-Axis lists the learning theories, while the Y-Axis shows their frequency, ranging from 0
to around 90. The graph reveals notable variations in the prominence of these theories.
Constructivism, Cognitivism, Behaviorism in order are the most frequently mentioned theories with
scores of almost 90, around 75, and almost 60, respectively. With slightly above 40, Motivation
theories follow the order. A mid-range cluster includes Social Learning, Humanism, Situated
Learning, Metacognition, and Self-Regulated Learning, all hovering around 20. In contrast, several
learning theories appear much less frequently, Connectivism and Transformative Learning register
frequencies below 5 each. Overall, the data suggests that traditional theories like Behaviorism and
Cognitivism dominate the landscape of learning recommendations, while emerging or specialized
theories are referenced far less often. Lastly, Question 8 showed an overwhelming majority of
recommendations (97.1) did not suggest using protected information according to the GDPR while a
very small minority (2.9%) did.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and conclusion</title>
      <p>The recommendations were generally perceived as relevant to the supplied papers, succeeding in
mentioning the most important feature. However, in cases where the number of features supplied
was high, due to the LLM's output size limit, a lot of the features were neglected. The
recommendations were well-structured and easy to read as well as beneficial and essential for
improving outcomes. The clarity in reasoning was generally good as a rationale for the
recommendations was mostly present. The suggestions were practical with some recommendations
being vague and hard to implement. The arguments were understandable and the logic holding the
recommendations together was consistent. A noticeable decline appears when examining
learningrelated criteria. These results suggest that the recommendations inadequately consider diverse
backgrounds and needs of different disadvantaged groups or foster inclusivity as they both hold the
two lowest medians. Furthermore, promotion of higher-order cognitive processes such as synthesis
and evaluation were insufficient. Additionally, the recommendations did not suggest using any
sensitive data (according to the GDPR) of the learners if it was not included in the data of the supplied
paper. Finally, the recommendations are well-grounded in learning theories.</p>
      <p>The results suggest the prompt should be crafted with emphasis on the learner-related criteria by
explicitly mentioning it while the LLM seems to perform well in understanding the tasks and in
formatting the recommendation logically and aesthetically. Furthermore, in the absence of a
description of the features, the LLM struggled to infer the meaning of some features from their names
solely. Instead, it attempted to guess its meaning from the context and in many cases it either failed
in its interpretation or took the safe route and did not include such ambiguous features in the
recommendations.</p>
      <p>For future work, we plan to evaluate LLMs fine-tuned with educational datasets. Moreover, we
plan to use raw results and individual predictions for each student, combined with eXplainable AI
methods that provide explanations for the predictions. This approach aims to offer personalized
insights, addressing some of the gaps identified in our current study.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The paper is co-funded by the Academy of Finland (Suomen Akatemia) Research Council for the
project "Towards precision education: Idiographic learning analytics (TOPEILA)", Decision Number
350560 which was received by the last author, and the project "Optimizing Clinical Reasoning in
Time-Critical Scenarios: A data-driven multimodal approach (CRETIC)", Decision Number 360746,
which was received by the third author.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not leverage Generative AI tools when preparing the manuscript.
[10] H. Kunstmann, J. Ollier, J. Persson, and F. von Wangenheim, “EventChat: Implementation and
user-centric evaluation of a large language model-driven conversational recommender system
for exploring leisure events in an SME context,” Jul. 09, 2024, arXiv: arXiv:2407.04472. doi:
10.48550/arXiv.2407.04472.
[11] P. Pu, L. Chen, and R. Hu, “A user-centric evaluation framework for recommender systems,” in
Proceedings of the fifth ACM conference on Recommender systems, Chicago Illinois USA: ACM,
Oct. 2011, pp. 157–164. doi: 10.1145/2043932.2043962.
[12] Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” Mar.</p>
      <p>27, 2024, arXiv: arXiv:2312.10997. doi: 10.48550/arXiv.2312.10997.
[13] J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, and H. Hemmati, “Prompt Engineering or Fine
Tuning: An Empirical Assessment of Large Language Models in Automated Software
Engineering Tasks,” Oct. 11, 2023, arXiv: arXiv:2310.10508. doi: 10.48550/arXiv.2310.10508.
[14] Y. Jin, W. Cai, L. Chen, N. N. Htun, and K. Verbert, “MusicBot: Evaluating Critiquing-Based
Music Recommenders with Conversational Interaction,” in Proceedings of the 28th ACM
International Conference on Information and Knowledge Management, Beijing China: ACM, Nov.
2019, pp. 951–960. doi: 10.1145/3357384.3357923.
[15] F. Pecune, S. Murali, V. Tsai, Y. Matsuyama, and J. Cassell, “A Model of Social Explanations for
a Conversational Movie Recommendation System,” in Proceedings of the 7th International
Conference on Human-Agent Interaction, Kyoto Japan: ACM, Sep. 2019, pp. 135–143. doi:
10.1145/3349537.3351899.
[16] Y. Jin, L. Chen, W. Cai, and X. Zhao, “CRS-Que : A User-centric Evaluation Framework for
Conversational Recommender Systems,” ACM Trans. Recomm. Syst., vol. 2, no. 1, pp. 1–34, Mar.
2024, doi: 10.1145/3631534.
[17] B. P. Knijnenburg, M. C. Willemsen, and A. Kobsa, “A pragmatic procedure to support the
usercentric evaluation of recommender systems,” in Proceedings of the fifth ACM conference on
Recommender systems, Chicago Illinois USA: ACM, Oct. 2011, pp. 321–324. doi:
10.1145/2043932.2043993.
[18] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell, “Explaining the user
experience of recommender systems,” User Model. User-Adapt. Interact., vol. 22, no. 4–5, pp.
441–504, Oct. 2012, doi: 10.1007/s11257-011-9118-4.
[19] S. Fazeli, H. Drachsler, M. Bitter-Rijpkema, F. Brouns, W. van der Vegt, and P. B. Sloep,
“Usercentric evaluation of recommender systems in social learning platforms: accuracy is just the
tip of the iceberg,” IEEE Trans. Learn. Technol., vol. 11, no. 3, pp. 294–306, 2017.
[20] L. W. Dietz, S. Myftija, and W. Wörndl, “Designing a Conversational Travel Recommender
System Based on Data-Driven Destination Characterization.,” in RecTour@ RecSys, 2019, pp.
17–21. Accessed: Dec. 02, 2024. [Online]. Available:
http://www.ec.tuwien.ac.at/rectour2019/wpcontent/uploads/2019/09/RecTour2019_Proceedings.pdf#page=24
[21] J. O. Álvarez Márquez and J. Ziegler, “Hootle+: A Group Recommender System Supporting
Preference Negotiation,” in Collaboration and Technology, vol. 9848, T. Yuizono, H. Ogata, U.
Hoppe, and J. Vassileva, Eds., in Lecture Notes in Computer Science, vol. 9848. , Cham: Springer
International Publishing, 2016, pp. 151–166. doi: 10.1007/978-3-319-44799-5_12.
[22] B. Loepp, T. Hussein, and J. Ziegler, “Choice-based preference elicitation for collaborative
filtering recommender systems,” in Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems, Toronto Ontario Canada: ACM, Apr. 2014, pp. 3085–3094. doi:
10.1145/2556288.2557069.
[23] E. C. Ling, I. Tussyadiah, A. Tuomi, J. Stienmetz, and A. Ioannou, “Factors influencing users’
adoption and use of conversational agents: A systematic review,” Psychol. Mark., vol. 38, no. 7,
pp. 1031–1051, Jul. 2021, doi: 10.1002/mar.21491.
[24] D. Gašević, V. Kovanović, and S. Joksimović, “Piecing the learning analytics puzzle: a
consolidated model of a field of research and practice,” Learn. Res. Pract., vol. 3, no. 1, pp. 63–
78, Jan. 2017, doi: 10.1080/23735082.2017.1286142.
[25] E. Fincham, D. Gašević, J. Jovanović, and A. Pardo, “From study tactics to learning strategies:
An analytical method for extracting interpretable representations,” IEEE Trans. Learn. Technol.,
vol. 12, no. 1, pp. 59–72, 2018.
[26] M. Richardson and M. Healy, “Examining the ethical environment in higher education,” Br.</p>
      <p>Educ. Res. J., vol. 45, no. 6, pp. 1089–1104, Dec. 2019, doi: 10.1002/berj.3552.
[27] T. Cerratto Pargman and C. McGrath, “Mapping the ethics of learning analytics in higher
education: A systematic literature review of empirical research,” J. Learn. Anal., vol. 8, no. 2,
pp. 123–139, 2021.
[28] S. Knight and S. B. Shum, “Theory and learning analytics,” Handb. Learn. Anal., vol. 1, pp. 17–
22, 2017.
[29] D. Gašević, S. Dawson, and G. Siemens, “Let’s not forget: Learning analytics are about
learning,” TechTrends, vol. 59, no. 1, pp. 64–71, Jan. 2015, doi: 10.1007/s11528-014-0822-x.
[30] A. Woolfolk Hoy, H. A. Davis, and E. M. Anderman, “Theories of Learning and Teaching in</p>
      <p>TIP,” Theory Pract., vol. 52, no. sup1, pp. 9–21, Oct. 2013, doi: 10.1080/00405841.2013.795437.
[31] L. Corsino and A. T. Fuller, “Educating for diversity, equity, and inclusion: A review of
commonly used educational approaches,” J. Clin. Transl. Sci., vol. 5, no. 1, p. e169, 2021.
[32] P. Jurado de los Santos, A.-J. Moreno-Guerrero, J.-A. Marín-Marín, and R. Soler Costa, “The
Term Equity in Education: A Literature Review with Scientific Mapping in Web of Science,”
Int. J. Environ. Res. Public. Health, vol. 17, no. 10, Art. no. 10, Jan. 2020, doi:
10.3390/ijerph17103526.
[33] M. Khalil, S. Slade, and P. Prinsloo, “Learning analytics in support of inclusiveness and disabled
students: a systematic review,” J. Comput. High. Educ., vol. 36, no. 1, pp. 202–219, Apr. 2024,
doi: 10.1007/s12528-023-09363-4.
[34] C. Magno and J. Sembrano, “Integrating Learner Centeredness and Teacher Performance in a</p>
      <p>Framework,” Int. J. Teach. Learn. High. Educ., vol. 21, no. 2, pp. 158–170, 2009.
[35] R. Collins, “Skills for the 21st Century: teaching higher-order thinking”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Feuerriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Janiesch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Zschech</surname>
          </string-name>
          , “
          <string-name>
            <surname>Generative</surname>
            <given-names>AI</given-names>
          </string-name>
          ,” Bus. Inf. Syst. Eng., vol.
          <volume>66</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>126</lpage>
          , Feb.
          <year>2024</year>
          , doi: 10.1007/s12599-023-00834-7.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shanahan</surname>
          </string-name>
          , “
          <source>Talking about Large Language Models,” Commun. ACM</source>
          , vol.
          <volume>67</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>79</lpage>
          , Feb.
          <year>2024</year>
          , doi: 10.1145/3624724.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Lo</surname>
          </string-name>
          , “
          <article-title>What is the impact of ChatGPT on education? A rapid review of the literature,” Educ</article-title>
          . Sci., vol.
          <volume>13</volume>
          , no.
          <issue>4</issue>
          , p.
          <fpage>410</fpage>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          et al.,
          <article-title>“ChatGPT for good? On opportunities and challenges of large language models for education,” Learn</article-title>
          . Individ. Differ., vol.
          <volume>103</volume>
          , p.
          <fpage>102274</fpage>
          ,
          <string-name>
            <surname>Apr</surname>
          </string-name>
          .
          <year>2023</year>
          , doi: 10.1016/j.lindif.
          <year>2023</year>
          .
          <volume>102274</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Abu-Rasheed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Abdulsalam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Weber</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Fathi</surname>
          </string-name>
          , “
          <article-title>Supporting Student Decisions on Learning Recommendations: An LLM-Based Chatbot with Knowledge Graph Contextualization for Conversational Explainability</article-title>
          and Mentoring,” Jan.
          <volume>24</volume>
          ,
          <year>2024</year>
          , arXiv: arXiv:
          <fpage>2401</fpage>
          .08517. doi:
          <volume>10</volume>
          .48550/arXiv.2401.08517.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liao</surname>
          </string-name>
          et al.,
          <string-name>
            <surname>“LLaRA: Large Language-Recommendation</surname>
            <given-names>Assistant</given-names>
          </string-name>
          ,” May 04,
          <year>2024</year>
          , arXiv: arXiv:
          <fpage>2312</fpage>
          .02445. doi:
          <volume>10</volume>
          .48550/arXiv.2312.02445.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Palma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Biancofiore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Anelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Narducci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Noia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Sciascio</surname>
          </string-name>
          , “
          <article-title>Evaluating ChatGPT as a Recommender System: A Rigorous Approach</article-title>
          ,” Jun.
          <volume>04</volume>
          ,
          <year>2024</year>
          , arXiv: arXiv:
          <fpage>2309</fpage>
          .03613. doi:
          <volume>10</volume>
          .48550/arXiv.2309.03613.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Brooks</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Thompson</surname>
          </string-name>
          , “
          <article-title>Predictive modelling in teaching and learning</article-title>
          ,” Handb. Learn. Anal., pp.
          <fpage>61</fpage>
          -
          <lpage>68</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          et al.,
          <article-title>“A Survey on Evaluation of Large Language Models,”</article-title>
          <source>ACM Trans. Intell. Syst. Technol.</source>
          , vol.
          <volume>15</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>45</lpage>
          , Jun.
          <year>2024</year>
          , doi: 10.1145/3641289.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>