<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (H. Kotte);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Retrieval-Augmented Chatbots for Scalable Educational Support in Higher Education</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hassan Soliman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hitesh Kotte</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miloš Kravčík</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norbert Pengel</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nghia Duong-Trung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Artificial Intelligence (DFKI)</institution>
          ,
          <addr-line>Alt-Moabit 91C, 10559 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IU International University of Applied Sciences</institution>
          ,
          <addr-line>Frankfurter Allee 73A, 10247 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leipzig University</institution>
          ,
          <addr-line>Dittrichring 5-7, 04109 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Students of educational sciences participate in learning activities, where appropriate support and timely feedback are crucial. However, providing scalable, personalized, and timely support becomes a major challenge. This work focuses on developing a didactic chatbot based on a Large Language Model (LLM) and enhancing its potential with existing learning materials. Retrieval Augmented Generation (RAG) allows the system to provide comprehensive, context-aware answers to specicfi course questions. Previous results suggested that it is possible to distinguish between diferent contexts in which students work and provide them with prompt responses that consider the relevant material. This paper presents insights from the technical implementation and the first results on the quality of LLM-based chatbot responses to content and organizational questions in an educational science module for student teachers. We compare previous automated evaluations using GPT-4 with newly conducted human evaluations of chatbot-generated results. Our experimentation demonstrated that the chatbot could achieve the highest correct response rate of 87%. Furthermore, human evaluations conducted by five expert annotators assessed the chatbot's responses. The agreement between the majority vote of these human judges and the GPT-4 evaluation showed substantial alignment. This study helps to demonstrate the potential of generative AI in the delivery of digitally supported courses.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Model</kwd>
        <kwd>Chatbot</kwd>
        <kwd>Retrieval-Augmented Generation</kwd>
        <kwd>Scalable Mentoring</kwd>
        <kwd>Higher Education</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Providing individualised assistance and prompt feedback through scalable mentoring is a major
educational challenge. However, new opportunities in digital higher education are being made possible
by the rapid development of computing technologies, especially Artificial Intelligence (AI). Our goal
is to enhance the student learning experience by designing a chatbot that allows for more flexible
and adaptable responses. While previous rule-based systems struggled with adaptability and were
limited to template-driven interactions, LLMs ofer the opportunity to generate more nuanced and
contextually aware responses, addressing the dynamic needs of students and accommodating a wide
range of inquiries, from course content to organizational matters.</p>
      <p>In educational science modules and teacher training programs, students benefit from receiving
context-aware responses that address their specific learning needs. LLMs can potentially analyze
existing learning and information materials and process descriptions (e.g., mentoring structure or
feedback systems) to generate responses that go beyond static, predefined answers. This leads to our
central research question: How can an LLM-enhanced chatbot be designed, implemented, and evaluated
to support scalable educational support in higher education? Our focus is on applying LLMs in a
didactically meaningful way to provide students with personalized and contextualized responses via a
web-based interface, thereby promoting self-regulated learning and facilitating mentoring experiences.</p>
      <p>Our paper presents the conceptual foundation, design, and implementation status of this iterative
process, with a particular focus on the technological aspects, such as chatbot design and LLM integration.</p>
      <p>In the subsequent sections, we first discuss related work and explain the pedagogical context. The
main section presents the technical background, including designing and implementing the LLM-based
chatbot prototype. The paper then moves on to the experimental results, discussing the chatbot’s
performance based on human evaluations and automated assessments. Finally, we conclude with
insights from these outcomes and propose future directions to enhance the chatbot’s capabilities further.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        AI has become integral to education, ofering solutions for students, teachers, and administrators.
By analyzing extensive data from these groups, AI enhances personalized learning, optimizes
administrative processes, and provides insightful feedback [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Among generative AI technologies, LLMs
are particularly impactful, enabling human-like text generation and interactive educational tools [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
These models underpin sophisticated educational chatbots that engage in meaningful conversations as
teachers, learners, guides, or mentors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Early educational chatbots relied on rule-based or template-driven systems, using predefined
responses and basic Natural Language Understanding (NLU) techniques to interact with users [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For
example, chatbots built with the RASA framework1 utilized NLU models to classify user intents and
recognize entities. However, these approaches sufered from limited flexibility and contextual awareness,
leading to rigid responses that struggled with dynamic or complex queries, resulting in user frustration
and reduced engagement [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. To address these limitations, recent research has leveraged LLMs to
develop more adaptable and context-aware conversational agents. LLM-based chatbots generate
nuanced and relevant responses by understanding user context and intent, thereby enhancing the overall
user experience [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This shift represents a significant advancement in educational chatbots’ ability to
support deeper and more meaningful student interactions.
      </p>
      <p>
        RAG approaches have recently emerged as a promising solution to enhance educational chatbot
performance by combining traditional document retrieval with LLMs’ generative capabilities, resulting
in more informed and contextually relevant responses [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This hybrid method allows chatbots to
utilize extensive educational content repositories, ensuring coherent and accurate information. In
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], AI-powered chatbots were used to scale mentoring support in higher education, providing 24/7
assistance, answering FAQs, and ofering personalized feedback. Further research implemented chatbots
in large-scale settings with over 700 students [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], showing that chatbots significantly supported
selfstudy and alleviated traditional mentoring resource constraints. Additionally, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] introduced a RAG
approach for academic environments, demonstrating that integrating document retrieval with LLMs
enhances information access eficiency and relevance, thereby creating more efective educational
assistants. Similarly, [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] developed MoodleBot, an LLM-driven chatbot integrated into the Moodle
Learning Management System (LMS) to support self-regulated learning. The study involving 46 students
revealed an 88% accuracy rate in course-related assistance and positive student acceptance, highlighting
LLM-based chatbots’ potential to enhance higher education despite challenges like bias, hallucinations,
and resistance to AI technologies.
      </p>
      <p>
        Despite advancements, deploying retrieval-augmented chatbots in education faces several challenges.
Organizational and pedagogical issues, such as ensuring data privacy, maintaining information quality,
and aligning chatbot responses with educational objectives, are critical [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Additionally, scaling these
systems across diverse educational environments and adapting to various instructional styles and
curricula remain ongoing challenges. Moreover, as highlighted in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], LLM-driven chatbots must ensure
response accuracy, manage potential biases and hallucinations, and overcome educators’ resistance
to new AI technologies. The study underscores the need for robust fact-checking mechanisms and
alignment of chatbot responses with course content to preserve educational integrity. Keeping indexed
materials current and reflective of course content is essential for maintaining chatbot reliability and
efectiveness.
      </p>
      <p>
        The efectiveness of RAG-based chatbots fundamentally depends on the quality of their retrieval
processes. In our earlier work [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] we utilized basic RAG techniques to facilitate information retrieval.
Building upon this prototype, we further refined and expanded our approach [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], experimenting with
a curated evaluation dataset and introducing hybrid ensemble retrievers that combine diferent methods
(e.g., keyword-based and semantic similarity searches) to optimize the retrieval of relevant information
from large datasets. This enabled more accurate retrieval of relevant content from course materials,
improving the chatbot’s overall performance. However, we identified limitations in response relevance
and depth, which prompted us to explore more advanced methods to enhance performance.
      </p>
      <p>To overcome these challenges, we incorporated reranker models into the retrieval pipeline. Rerankers
analyze the initially retrieved chunks and reorder them based on their relevance to the user’s query,
significantly improving the precision of the retrieved context and enabling more accurate, contextually
relevant responses. Additionally, we conducted extensive evaluations to assess the impact of reranker
models. Moreover, we compared the chatbot’s performance using automated GPT-4 evaluations with
human evaluations by domain experts. These human evaluations provided insights into the agreement
and discrepancies between human judgment and machine assessments, essential for understanding the
potential and limitations of LLM-based chatbots in education. Overall, integrating reranker models
and a comprehensive evaluation approach represents a significant advancement in developing scalable,
intelligent chatbots. By enhancing retrieval precision and thoroughly assessing chatbot performance,
we advance the creation of more efective and reliable educational tools.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Design and Implementation</title>
      <p>
        From a didactic perspective, we focus on self-regulated learning, mentoring, and counseling, with
mentoring being an efective way to support learning [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. For example, supporting students in an
advisory capacity helps clarify problematic situations. A dyadic mentor-mentee relationship [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is ideal
but often unattainable due to limited resources, posing the challenge of scaling mentoring processes.
This requires an integrated environment with various facilities, where the chatbot serves as a permanent
virtual contact [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], fulfilling dual roles as an expert and learning companion [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. As an expert, the
chatbot answers questions about course content and organization. As a learning companion and mentor,
it supports the individual learning process with feedback on submitted writing tasks, encouraging
students to plan, monitor, and reflect on their learning. The BiWi (course acronym) AI Tutor addresses
the scalability challenge by primarily acting as an expert on course material and answering students’
questions about content and organizational information. The latest version of the chatbot includes the
mentoring module (psychosocial support), which was not available in our evaluations.
      </p>
      <sec id="sec-3-1">
        <title>3.1. LLM Based Prototype</title>
        <p>The BiWi AI Tutor, an LLM-based chatbot prototype, provides scalable learning support by
retrieving knowledge from lecture slides, seminar texts, and organizational materials for a German-taught
university-level Education Science course. Utilizing the GPT-3.5-turbo model from OpenAI2 and the
LangChain3 library, the chatbot ofers responsive, contextually aware dialogic interactions (see Figure
1). Based on LangChain’s Function Calling Agent, it dynamically selects relevant tools or data according
to contextual needs. It determines when to select tools and appropriate context materials for a query,
feeding the results back to the agent to determine subsequent steps. This iterative loop enables dynamic,
context-sensitive interactions and handles multi-question queries.</p>
        <p>The chatbot’s retriever mechanism is crucial for selecting the most relevant learning material for a
given query. For instance, the query "What are the main points of lecture 1 and lecture 3?" is split into
two sub-queries: "main points of lecture 1" and "main points of lecture 3" as shown in Figure 2 (originally
in German). This allows the retrieval of the most pertinent material chunks for each sub-query. Using
2https://platform.openai.com/docs/models/gpt-3-5-turbo
3https://www.langchain.com/
the LangSmith4 library for observability, the retrieved chunks for each sub-query can be displayed and
they are ranked by relevance. The agent then formulates a comprehensive final answer by combining
the retrieved materials. The retriever employs semantic similarity and keyword matching to locate
relevant content in real-time, streamlining content selection for user queries. This integration of an
LLM’s reasoning and generative capabilities with retrieval systems eficiently locates relevant learning
content, a process known as RAG in the literature.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Learning Material Indexing and Chatbot Interaction Flow</title>
        <p>To enable the chatbot to provide accurate and contextually relevant answers, we implemented a
comprehensive indexing and retrieval process of the course materials, alongside a well-defined interaction
lfow for the chatbot. These processes are illustrated in Figure 3 and Figure 4. The steps in Figure 3 serve
as the foundational processes that support the interaction flow in Figure 4. In addition to the indexing
and interaction processes, we employed a basic system prompt. This prompt defines the chatbot’s
role as a tutor for the course, explains to it the course materials, and directs it to answer students’
questions in German. This setup ensures that the chatbot maintains a consistent and helpful persona
while interacting with students.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Learning Material Indexing and Retrieval Process</title>
          <p>The indexing and retrieval process involves the following steps:
1. Indexing Course Material: Collect and prepare course materials, including lectures, seminar</p>
          <p>PDFs, and organizational PDFs, for indexing.
2. Text Parsing: Utilize the Llama Parser module from the LlamaIndex5 library to parse PDF files
into structured formats like Markdown, simplifying processing and enhancing compatibility with
LLMs.
3. Text Chunking: The parsed text is divided into manageable chunks of 1024 tokens with an
overlap of 20 tokens. This choice was based on preliminary experiments that demonstrated
1024 tokens provided an optimal balance between maintaining suficient context and ensuring
processing eficiency. The 20-token overlap helps preserve continuity between chunks, reducing
the likelihood of losing critical contextual information that spans chunk boundaries.
4. Course Material Indexing: Organize chunks into:
a) Vector Index: Generate embeddings using OpenAI’s ("text-embedding-3-large") model6 for
semantic retrieval, storing them in a vector database.
b) BM25 Index: Apply the BM25 algorithm for keyword-based retrieval based on term
frequency and inverse document frequency.
5. Query Translation: Preprocess and translate user queries into a suitable format for retrieval
systems, potentially involving language translation or keyword extraction.
6. Semantic Retrieval: Embed the query using the same embedding model to create a vector
representation and retrieve semantically similar chunks from the Vector Index.
7. Keyword Retrieval: Use the BM25 Index to extract chunks containing relevant keywords from
the query.
5https://www.llamaindex.ai/
6https://platform.openai.com/docs/guides/embeddings/embedding-models
8. Retrieve Top Similar Chunks: Combine the top 50 chunks from both semantic and keyword
retrievals, totaling 100 candidate chunks.
9. Context Reranking: To refine the retrieved context, we employ a reranker model from
Cohere7 ("cohere-rerank-v3.0"). The reranker re-evaluates the 100 candidate chunks based on their
relevance to the query and selects the top 5 most relevant chunks.
10. Relevant Context Extraction: Utilize the top-ranked chunks as the relevant context for
generating the chatbot’s response.</p>
          <p>This indexing and retrieval process ensures that the chatbot accesses the most pertinent sections of
the course material, enabling accurate and context-aware answers. Statistics for the learning material
are shown in Table 1.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Chatbot Interaction Flow</title>
          <p>Building upon the indexing and retrieval process, the chatbot’s interaction flow is designed for a
seamless and contextually rich user experience, as illustrated in Figure 4. The interaction flow involves
the following steps:
1. User’s Question: The user submits a question to the chatbot, e.g., "Wann ist die Klausur?" (When
is the exam?).
2. Retrieve Conversation History: Retrieve the past  = 10 messages from the conversation
history to provide context.
3. Get Tool’s Description: Consider descriptions of available tools (e.g., the course material tool)
to decide their usage in the response.
4. Decision to Use Tool: Determine whether to answer based on conversation history or utilize
the course material tool (indexed course materials):
a) Option A: If suficient information exists in the conversation history, respond directly.
• Example: "Laut dem Gespräch ist der Klausurtermin am Dienstag, den 09.07.2024, um
13.00 Uhr." (Based on the conversation, the exam date is Tuesday, July 9, 2024, at 1:00
PM.)
b) Option B: If additional information is needed, use the course material tool.
5. Query Translation: Translate the user’s query or extract relevant keywords to facilitate retrieval.</p>
          <p>• Example: Extracting "Klausurtermin" (exam date) from the query.
6. Relevant Context Extraction: Invoke the retrieval process in the previous subsection to obtain
relevant context from indexed materials.</p>
          <p>• Involves semantic and keyword retrieval, context reranking, and extracting top chunks.
7. Chatbot’s Answer: Generate a response using the retrieved context.</p>
          <p>• Tool-based Response: "Der Klausurtermin ist am Dienstag, den 09.07.2024, um 13.00 Uhr."
(The exam date is Tuesday, July 9, 2024, at 1:00 PM.)
• Direct Response: If not using the tool, respond based on conversation history.
8. Store Conversation History: Update the conversation history with the user’s query and the
chatbot’s response to maintain context for future interactions.</p>
          <p>By integrating the indexing and retrieval process with the interaction flow, the chatbot efectively
serves as an expert on the course material, supporting students in their learning journey. The
decisionmaking process allows the chatbot to handle queries eficiently, providing direct answers when possible
and accessing the broader course material knowledge base when necessary.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>The evaluation of the BiWi AI Tutor chatbot utilized a dataset comprising questions derived from the
course materials, corresponding true answers, and the chatbot’s generated responses. These
questionanswer pairs were developed by the instructors of the educational science course to assess the chatbot’s
ability to generate accurate and relevant answers. In terms of evaluation methodology, we believe it is
important to take into account the learning objectives of the course authors, especially in a domain
where expert agreement is not always easy to obtain. The dataset was curated to reflect the diversity
of learning materials, including lecture slides, seminar readings, and organizational information, and
consisted of 60 questions evenly distributed across the three categories.</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluating Chatbot Responses</title>
        <p>To assess the performance of the BiWi AI Tutor chatbot, two evaluation methods were employed:
manual evaluation using human annotators and automated evaluation using GPT-4 from OpenAI8. This
dual approach provided a comprehensive understanding of the chatbot’s accuracy and reliability.
Manual Evaluation Using Human Annotators Five human raters, all domain experts and
instructors of the course, independently evaluated the chatbot’s responses. Each evaluator reviewed the same
set of 60 questions and scored the chatbot’s answers as either correct (1) or incorrect (0). The majority
vote among the five raters was calculated for each response to provide a consensus judgment.
Automated Evaluation Using GPT-4 The second evaluation method utilized the GPT-4 model to
assess the correctness of the chatbot’s answers. We employed the Question Answer (QA) evaluation
prompt from the LangChain library to judge the factual accuracy of the chatbot’s responses, disregarding
diferences in style, wording, and format. Each response was graded as either correct (1) if factually
accurate or incorrect (0) otherwise.</p>
        <p>Addressing Potential Bias We acknowledge that having the course instructors both develop the
evaluation dataset and serve as evaluators may introduce potential bias, as they are familiar with the
expected answers and may have subconscious expectations about the chatbot’s performance. However,
involving external domain experts was not feasible due to resource constraints and the specialized
nature of the course content. To mitigate bias:
• Independent Evaluations: Each evaluator assessed the responses independently to reduce
groupthink and collective bias.
• Clear Evaluation Criteria: Evaluators used a binary grading system focused solely on factual
accuracy, minimizing subjective interpretations.
• Inter-Rater Reliability Analysis: Calculated Fleiss’ Kappa to assess agreement levels,
highlighting variability and reducing overconfidence in the results.
8https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo
Evaluation Results The results from both the human majority vote and GPT-4’s judgments are
summarized in Table 2. The table presents the percentage of correct answers as determined by both
evaluations for each category.</p>
        <sec id="sec-4-1-1">
          <title>Inter-Rater Reliability and Agreement Analysis To evaluate the consistency among human raters,</title>
          <p>we calculated Fleiss’ Kappa scores for each category:
• Lecture Questions: Fleiss’ Kappa = 0.37 (fair agreement)
• Seminar Questions: Fleiss’ Kappa = 0.47 (moderate agreement)
• Organizational Questions: Fleiss’ Kappa = 0.15 (slight agreement)</p>
          <p>We also calculated Cohen’s Kappa to assess the agreement between GPT-4’s judgments and each
human evaluator, as well as the majority vote. The results are presented in Table 3.</p>
          <p>The varying levels of agreement reflect individual diferences among evaluators. The substantial to
perfect agreement between GPT-4 and the majority vote indicates that GPT-4’s assessments align well
with collective human judgment.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluating the Efect of Using Rerankers</title>
        <p>The previous evaluations were conducted without rerankers as the human annotations were gathered
before the reranking mechanism was adopted. To investigate the impact of using rerankers, we
conducted additional experiments comparing GPT-3.5-turbo’s performance with and without rerankers.
As seen in Table 4, the reranking mechanism provided a noticeable improvement, particularly in
the organizational questions, where accuracy reached 100%. The reranker-based approach leverages
semantic re-ranking to filter and improve the retrieval of the most contextually relevant text fragments,
leading to better overall answer quality. Although the overall correct response rate was high, certain
types of questions, especially open-ended ones, such as those related to seminars and lectures, exhibited
greater variability in responses. This reflects the inherent complexity of interpreting such data, leading
to lower scores compared to the organizational questions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In prior works, educational chatbots primarily utilized template or rule-based systems to address
students’ questions. While efective for Frequently Asked Questions (FAQs), these systems lacked
lfexibility and adaptability, with pre-defined responses leading to static, context-insensitive interactions.
With evolving technology, LLMs ofer a dynamic alternative, enabling chatbots to generate flexible and
deeply contextualized responses. This shift from rule-based to LLM-powered chatbots significantly
enhances personalized and nuanced student conversations, accommodating more complex queries. The
BiWi AI Tutor chatbot exemplifies an LLM-based system that eficiently retrieves information from
sources like lecture slides, seminar materials, and organizational documents. Utilizing a Function Calling
Agent from the LangChain library, it accesses specific tools to retrieve relevant material for any query.
The system combines generative capabilities with an advanced RAG approach by processing material
chunks into embeddings stored in a vector database, facilitating retrieval based on semantic similarity.
Additionally, a reranker model filters and prioritizes the most relevant chunks, enhancing information
precision and ensuring accurate, targeted responses. The chatbot iteratively refines its answers by
dynamically selecting appropriate material chunks, guaranteeing that students receive precise and
relevant information. Our experiments provide valuable insights into the chatbot’s performance.
Comparing GPT-4 with human evaluators, the chatbot consistently delivered correct answers, closely
aligning with most human judgments. For organizational questions, there was perfect agreement
between the chatbot and majority human evaluations. Introducing rerankers further enhanced accuracy,
achieving 100% for organizational content, which underscores the rerankers’ efectiveness in filtering
retrieved material and improving overall response quality.</p>
      <p>Future enhancements can explore multiple dimensions. From a use-case perspective, developing
mentoring-style chatbots that provide factual answers while responding to students’ emotional needs
could involve routing questions to diferent models based on query nature and support type. However,
addressing the psychological aspects and scalability introduces ethical considerations, recognizing that
machines may sometimes require human expert involvement. For evaluation, implementing a student
feedback mechanism with ratings on a 0-5 scale could assess mentoring efectiveness. Additionally,
future iterations could generate personalized learning pathways based on interaction history, allowing
the chatbot to adapt to individual learning preferences. From a safety standpoint, the chatbot must
include privacy guardrails to filter sensitive or inappropriate data before retrieval. As LLMs become
more integral to education, addressing potential biases and ensuring AI decisions are transparent and
explainable to both students and educators is crucial. Lastly, scalability and openness are essential for
wider adoption, enabling educators to deploy customized chatbot versions by uploading their materials
and configuring custom instructions. Future work will also benchmark state-of-the-art open-source
LLMs against proprietary models like those from OpenAI to determine the best fit for this use case.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The research leading to these results has received funding from the German Federal Ministry of
Education and Research (BMBF) through the project “Personalisierte Kompetenzentwicklung und
hybrides KI-Mentoring” (tech4compKI) (grant no. 16DHB2206, 16DHB2208)
The authors declare that Generative AI tools have not been used during manuscript preparation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kizilcec</surname>
          </string-name>
          ,
          <article-title>To advance ai use in education, focus on understanding educators</article-title>
          ,
          <source>International Journal of Artificial Intelligence in Education</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>F. H.</surname>
          </string-name>
          et al.,
          <article-title>The world of generative ai: Deepfakes and large language models, arXiv preprint (</article-title>
          <year>2024</year>
          ). URL: https://ar5iv.labs.arxiv.org/html/2402.04373v1.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sharples</surname>
          </string-name>
          ,
          <article-title>Towards social generative ai for education: theory, practices and ethics</article-title>
          ,
          <source>Learning: Research and Practice</source>
          <volume>9</volume>
          (
          <year>2023</year>
          )
          <fpage>159</fpage>
          -
          <lpage>167</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Neumann</surname>
          </string-name>
          , P. de Lange, R. Klamma,
          <article-title>Collaborative creation and training of social bots in learning communities</article-title>
          ,
          <source>in: 2019 IEEE 5th International Conference on Collaboration and Internet Computing (CIC)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Arndt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Köbis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Meissner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martin</surname>
          </string-name>
          , P. de Lange,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pengel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Klamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wollersheim</surname>
          </string-name>
          ,
          <article-title>Chatbots as a tool to scale mentoring processes: Individually supporting selfstudy in higher education</article-title>
          ,
          <source>Frontiers in Artificial Intelligence</source>
          <volume>4</volume>
          (
          <year>2021</year>
          )
          <fpage>64</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Maryamah</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Irfani</surname>
            ,
            <given-names>E. B. T.</given-names>
          </string-name>
          <string-name>
            <surname>Raharjo</surname>
            ,
            <given-names>N. A.</given-names>
          </string-name>
          <string-name>
            <surname>Rahmi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ghani</surname>
            ,
            <given-names>I. K.</given-names>
          </string-name>
          <string-name>
            <surname>Raharjana</surname>
          </string-name>
          ,
          <article-title>Chatbots in academia: a retrieval-augmented generation approach for improved eficient information access</article-title>
          ,
          <source>in: 16th Int. Conference on Knowledge and Smart Technology (KST)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Klamma</surname>
          </string-name>
          , P. de Lange,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kravcik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kuzilek</surname>
          </string-name>
          ,
          <article-title>Scaling mentoring support with distributed artificial intelligence</article-title>
          ,
          <source>in: International Conference on Intelligent Tutoring Systems</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jarke</surname>
          </string-name>
          ,
          <article-title>An llm-driven chatbot in higher education for databases and information systems</article-title>
          ,
          <source>IEEE Transactions on Education</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Soliman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kravcik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pengel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haag</surname>
          </string-name>
          ,
          <article-title>Scalable mentoring support with a large language model chatbot</article-title>
          ,
          <source>in: Technology Enhanced Learning for Inclusive and Equitable Quality Education</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Soliman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kravcik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pengel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haag</surname>
          </string-name>
          , H.-W. Wollersheim,
          <article-title>Generative ki zur lernenbegleitung in den bildungswissenschaften: Implementierung eines llm-basierten chatbots im lehramtsstudium</article-title>
          ,
          <source>in: Proceedings of DELFI</source>
          <year>2024</year>
          , Gesellschaft für Informatik eV,
          <year>2024</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Eby</surname>
          </string-name>
          , E. Dolan,
          <article-title>Mentoring in postsecondary education and organizational settings (</article-title>
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <article-title>Mentoring: konzeptuelle grundlagen und wirksamkeitsanalyse</article-title>
          , in: Mentoring:
          <article-title>Theoretische hintergründe, empirische befunde und praktische anwendungen</article-title>
          ,
          <year>2009</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dibitonto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Leszczynska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Medaglia</surname>
          </string-name>
          ,
          <article-title>Chatbot in a campus environment: Design of lisa, a virtual assistant to help students in their university life</article-title>
          , in: 2018 International Conference on Human-Computer Interaction, volume
          <volume>10903</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>116</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -91250-
          <issue>9</issue>
          _
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dyrna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulze-Achatz</surname>
          </string-name>
          ,
          <article-title>Wann ist lernen mit digitalen medien (wirklich) selbstgesteuert? ansätze zur ermöglichung und förderung von selbststeuerung in technologieunterstützten lernprozessen (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>