<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Digital Textbook and Classroom Data to Explore Multimodal (Audio, Visual, &amp; Textual) LLM Retrieval Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Brian Wright</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vishwanath Guruvayur</string-name>
          <email>vish@virginia.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luke Napolitano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Doruk Ozar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ali Rivera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ananya Sai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bereket Tafesse</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Virginia, School of Data Science</institution>
          ,
          <addr-line>Charlottesville, VA</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The use of digital content to support classroom learning is evolving rapidly. Retrieval Augmented Generation (RAG), as an approach to training Large Language Models (LLMs), has emerged as a powerful framework to ground generation in trusted content. In the educational context, these are materials sourced by professors/teachers for a specific courses. Although RAG systems traditionally rely on textual input, modern digital textbooks often include a blend of modalities such as course slides, video lectures, and other interactive content containing both textual and visual information. In this project, we investigate the role of multimodal retrieval in an educational context using digital textbooks and other multimodal course data to build an intelligent assistant. We embed and store textual and visual components from an undergraduate machine learning course into a vector database and use them to enhance chatbot responses. Through several versions of text only and multimodal Large Language Models and evaluation metrics such as Context Recall, Faithfulness and Factual Correctness, we examine how supplementing text with images impacts retrieval and response quality. Our findings show that multimodal input significantly improves factual correctness for complex or specific questions but not for generic, although excessive image inclusion may reduce performance. Conversely, image inclusion does not provide gains on more generic questions. We propose an agent-based RAG system that dynamically selects relevant vectors based on query specificity.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLMs</kwd>
        <kwd>Chatbot</kwd>
        <kwd>RAG</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Recent advances in Large Language Models (LLMs) have catalyzed interest in the application of
generative models to educational tools. However, standard LLMs lack awareness of the multimodal data
prevalent in classrooms. This includes interactive elements of textbooks, slide visuals and lecture
recordings. One promising approach to addressing this limitation is Retrieval Augmented Generation
(RAG), which enables models to incorporate external knowledge into the generative process.</p>
      <p>Rather than relying solely on pre-trained outputs, RAG retrieves relevant documents from an external
corpus based on a user’s query. In an educational context, this includes material sourced or generated
by professors/teachers. These augmented documents can be used to aid in the response generation
process. This approach could prove to be particularly valuable in educational settings, where a chatbot
can generate responses that align closely with the specific content and instructional level of any given
course. Although RAG typically relies on text-based retrieval, we explore its extension to include
both text and image embeddings from digital educational materials as seen in appendix 3 figure 6 and
appendix 4 figure 7.</p>
      <p>This is an exploratory study designed to position a deeper understanding on potential technical
approaches for developing an LLM based Intelligent Assistant to support students with a focus on
self-regulated learning. Our goal was to first understand the utility of using a RAG based approach
with data from open source textbooks and lecture materials. This was followed by an exploration on
whether the incorporation of visual content enhances the generation of educational response and under
what circumstances it may hinder it. The RAG-bot is specifically designed for an undergraduate course,
Foundations of Machine Learning, taught at the University of Virginia School of Data Science.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        During the last decade, the inclusion of AI driven education tools has increased dramatically resulting
in the maturity of a new field; Artificial Intelligence in Education (AIEd) [
        <xref ref-type="bibr" rid="ref11 ref5">10</xref>
        ]. The rise in the presence
of AI in global society and its emergence in our daily lives has not only produced the development of
additional educational tools but has also driven the need for the creation of a new literacy. Growing out
of Data Literacy [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], AI Literacy is maturing to the point of being referenced in educational programs
and research [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ]. In addition, the belief that AI has the potential to continue to transform how we
communicate, consume information, learn, and interact in society seems like a forgone conclusion.
Consequently, the need to measure the efectiveness of teaching methods in pursuit of AI literacy is
currently high, with additional research still needed. Ouyang and co-authors made note of this point
in their construction of an AI literacy framework through a meta-analysis of papers spanning several
disciplines [
        <xref ref-type="bibr" rid="ref11 ref5">10</xref>
        ]. The authors further suggest this is especially true of courses in Data Science oriented
programs designed to teach AI fundamentals, that often pull students from a variety of backgrounds
[
        <xref ref-type="bibr" rid="ref15">13</xref>
        ].
      </p>
      <p>
        The nature of how AI driven tools are incorporated into higher education occurs at essentially three
levels; instruction/service, learning, and administration [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Instruction/service-oriented can be seen
as tools that help instructors grade assignments, facilitate students in choosing courses or identifying
university resources but do not directly aid in knowledge growth. Learning oriented is focused mostly
on classroom applications with the goal of helping students achieve learning outcomes. This may
include tutoring, providing learning materials, facilitating students self-guided learning or intelligent
assistants that have been tailored to course content [
        <xref ref-type="bibr" rid="ref2 ref6">2, 5</xref>
        ]. This category could also include general
Large Language Models that aid in answering student questions, or in the case of Data Science or
Computer Science courses, generating code. Administrative tools are geared toward educational staf or
professionals that function out of direct line of sight of students. These could be anything from business
intelligence systems for financial analyses or application tools that help aid in the admissions processes.
      </p>
      <p>
        This project focuses on the learning level by exploring the creation of a multimodal chatbot to help
students in a specific course. The multimodal nature of the approach is a growing research area, but
one that requires more attention [9]. The follow-on work will not only present the tool, but will give
students an understanding of how it is trained and opportunities to augment with new data throughout
the course. Thus, touching the previous referenced ideas of facilitating AI literacy. This also allows for
an active learning approach known to be productive for learning in STEM environments [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The original RAG framework [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ] introduced a method of augmenting LLM output with external
documents. Follow-up research has explored knowledge-grounded dialogue, domain-specific retrieval,
and image-text fusion models like CLIP [11]. Our work draws from these threads, but focuses on
integrating image and text embeddings within RAG for a specific instructional context, aligning with
eforts in educational NLP and multimodal LLMs (refer to Appendix A.3 and A.4 for model architectures).
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Data Sources</title>
        <p>
          We curated multimodal data from DS3001: Foundations of Machine Learning, including:
• Lecture slides (text + images)
• Lecture audio transcripts (text)
• Open Source ML Textbooks (text + images)
• Open Source ML papers (text + images)
Images and textual content were extracted and segregated from lecture slides, machine learning research
papers, and textbooks originally accessed in PDF format. Additionally, audio recordings from lecture
videos were transcribed into text using YouTube’s speech-to-text transcription tool and incorporated as
part of the textual dataset [
          <xref ref-type="bibr" rid="ref16">14</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Embedding Details</title>
        <p>Textual Data: As shown in Appendix A.3 figure 6, text chunks (1500 tokens, 100-token overlap)
were embedded using SentenceTransformer all-mpnet-base-v2 and stored in a text-only Pinecone
Database (dim=768).</p>
        <p>Visual Data: As shown in Appendix A.3 figure 6, to support multimodal retrieval in a RAG pipeline, we
leverage OpenAI’s CLIP model, a pretrained model which embeds texts and images into the same vector
space and minimizes the distance between semantically similar image and text vectors [12]. CLIP has
proven efective in zero-shot image classification, reducing or eliminating the need for expensive
training on application-specific image datasets. This alignment enables cross-modal similarity comparisons:
both textual passages and visual assets (e.g., images) can be encoded using CLIP’s respective encoders
and stored as embeddings in a vector database. At inference time, a user’s natural language query is
encoded via the text encoder and used to identify embeddings using nearest neighbors. This enables
semantically consistent retrieval across modalities—e.g., matching a query like “building a decision tree”
to visual representations of decision tree materials. Results are then passed to a language model for
generation.</p>
        <p>Retrieval Method: While querying to the RAG system, the prompt is embedded using both models
(SentenceTransformers and CLIP). As illustrated in figure 7 in Appendix A.4, this creates a dual process
that ends with feeding a multimodal LLM with supplemental content from both the image and text
databases. The reason for keeping two separate databases is to allow for the comparison of text only
versus text plus image generation. The raw images are stored in MongoDB for retrieval post the
embedding search phase. The name of each image is stored as the metadata in the image database.
Upon completing the search, the filenames of the most relevant images are retrieved and used to fetch
the corresponding raw images from the database.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experiment Design</title>
        <sec id="sec-3-3-1">
          <title>We tested multiple RAG configurations:</title>
          <p>• Zero-shot LLMs: Not including the RAG component. This treatment is the baseline for how the</p>
          <p>LLM responds to the query without additional content added to the question passed to the LLM.
• Text Only RAG (10 text vectors): Using only text retrieved from our vector DB. This treatment
includes the user’s query along with the top 10 text vectors retrieved by highest cosine similarity
score to the user’s query.
• Balanced Swap (5 text + 5 image vectors): Text with less relevant vectors replaced by top
images. This treatment includes the user’s query along with the top five text vectors retrieved by
highest cosine similarity score to the user’s query and the top five images retrieved by highest
cosine similarity score to the user’s query.
• Text + Image (10 text + 10 image vectors): Addition of more visual information along with
base textual information. This treatment includes the user’s query along with the top ten text
vectors retrieved by highest cosine similarity score to the user’s query and the top ten images
retrieved using nearest neighbors as the model and cosine similarity as the distance measure to
the user’s query.</p>
          <p>The evaluation dataset was constructed to reflect course-aligned content, incorporating both generic
and specific questions derived from educational materials. Generic questions are designed to be
answerable using general knowledge, while specific questions require information from unique sources
such as lecture notes, specific textbooks, and lecture recordings. Consequently, standard LLMs may
under perform on specific questions compared to generic ones, highlighting the importance of RAG in
generating accurate responses (See appendix A.3 and A.4).</p>
          <p>To ensure the questions’ specificity and relevance, we curated a set of 30 questions, 15 specific and 15
generic. This approach was preferred over mass-generation using LLMs, as LLM-generated questions
may lack the desired specificity and could lead to inconsistent answers for evaluation. These questions
and answers were generated using the authors’ expertise as graduate students and then cross-reference
through multiple School of Data Science faculty. The model-generated answers to the questions serve as
the key metric for evaluation. Bootstrap resampling method allows for a robust measure of the quality
of the responses. Additionally, we imposed word limits on the answers to further reduce variability in
the system’s responses, enabling a more controlled assessment of its performance.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluation</title>
        <p>Conext Recall, Faithfulness, and Factual Correctness were chosen to target three critical components
of RAG for evaluation. We wanted to evaluate the key elements of the model’s performance, seen in
ifgure 2, in terms of the Context retrieved, the relevance to the Query, and the generated Response. The
list below outlines how each metric aligns with one of these components. From the Context, we need
to ensure that the generated response aligns with the retrieved information. From the Query, we need
to ensure that the retrieved context is relevant to the query at hand. From the Response, we need to
ensure that the response actually answers the query we began with.</p>
        <p>
          We used the RAGAS [
          <xref ref-type="bibr" rid="ref7">6</xref>
          ] evaluation package to obtain metrics for our models:
• Context Recall: Measures whether the model successfully retrieved the right pieces of
information. For example, when a user asks a question, the model pulls in background documents to
help it answer. Context recall tells us what fraction of the relevant supporting documents were
actually retrieved. In our project, this metric helped us assess whether models could correctly
surface source material, particularly for domain-specific queries that rely on subtle or technical
context (Context).
• Faithfulness: Evaluates whether the model’s reasoning is grounded in the material it retrieved.
        </p>
        <p>Even if the model finds the right documents, it might still produce responses that are misleading
or overconfident. Faithfulness measures whether the model’s answer can be directly supported
by the retrieved evidence. In our use case, we applied this metric to ensure that the generated
answers didn’t hallucinate facts or stray from the actual contents of the documents (Query).
• Factual Correctness: Assesses whether the final answer itself is accurate in relation to the query,
even if the model uses correct information in the wrong way. This is the most outcome-focused of
the three metrics: it checks if the model ultimately gives a factually valid response. For example,
even if the right context was retrieved and used, the final output still needs to be judged on
whether it answers the user’s question truthfully. This was especially important for us when
evaluating model responses to specific, high-stakes queries (Response).</p>
        <p>In order to evaluate the three metrics, 30 questions were sampled 400 times for each of the four
treatment groups, 200 for the generic and 200 for the specific. This resulted in a total of 1,600 scored
responses: 800 from generic and 800 from specific questions. This number of model runs was chosen
due to financial constraints associated with continuous use of the GPT API. We employed a pooled
analysis strategy that aggregated all scores within each model and question type. This decision was
guided by several factors: (1) all experimental runs were conducted under identical conditions with
consistent question distributions and evaluation methods, (2) the goal of the study was to assess average
model performance, and (3) the nature of LLMs allows for a certain amount of variability making for a
robust estimate of model output distribution.</p>
        <p>For each model and metric, we applied non-parametric bootstrapping by drawing 10,000 resamples
with replacement from the scored responses to build an empirical distribution of the mean. From this,
we computed 95% confidence intervals using the 2.5th and 97.5th percentiles of the resampled means.
To compare models, we calculated the diference in their observed means and combined their bootstrap
standard errors to form a confidence interval around the diference. If a model’s bootstrapped confidence
interval for the mean diference lay wholly above or below the baseline, it was judged significantly
better or worse; otherwise, its performance was considered statistically indistinguishable from the
baseline.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment Results</title>
      <p>We evaluated the efect of incorporating images into the RAG workflow using a multimodal LLM
(GPT4.1 Nano). Our goal was to assess whether visual content improves response quality and contextual
grounding, especially across question types of varying specificity.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <sec id="sec-4-1-1">
          <title>We tested two configurations for image inclusion:</title>
          <p>• Text + Image (10T + 10I): Adds 10 image vectors to the 10 retrieved text vectors, preserving all
textual context while layering on visual information.
• Balanced Swap (5T + 5I): Replaces the bottom 5 text vectors with the top 5 image vectors,
maintaining the same number of total context inputs but altering the text-image ratio.</p>
          <p>Both configurations were evaluated on the curated dataset of generic and specific questions derived
from course materials previously described. We compared these against a Text-Only RAG baseline and
a Zero-Shot (no retrieval) setting. Evaluation was based on the three key metrics previously described:
Context Recall, Faithfulness, and Factual Correctness. Context Recall and Faithfulness are RAG specific
measures and thus do not include the Zero Shot model.</p>
          <p>In summary, the bootstrap-derived confidence intervals were narrow, and the statistical power of this
design was more than suficient to detect small-to-moderate diferences in model behavior. This pooled
analysis strategy is especially appropriate in studies like this one, where experimental conditions are
controlled and average-case performance is the primary analytic focus.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Generic Questions</title>
        <p>On generic questions, Text-Only RAG performs competitively across all evaluation metrics, with
no statistically significant diferences observed when compared to either Zero Shot or Text + Images
(10T + 10I) models for context recall, factual correctness (F1), or faithfulness. However, two statistically
significant diferences emerged. First, Balanced Swap (5T + 5I) performs significantly worse than
Text-Only RAG on factual correctness (F1), with an average drop of approximately 16 percentage points
(p = 0.0008, 95% CI: [–0.25, –0.07]). Second, the same model also underperformed significantly on
faithfulness, showing a decrease of nearly 29 percentage points relative to Text-Only RAG (p &lt; 0.0001,
95% CI: [–0.42, –0.16]). These findings indicate that while image augmentation at the Text + Images (10T
+ 10I) scale preserves parity with the baseline, the Balanced Swap (5T + 5I) configuration introduces
meaningful degradations in factual accuracy and faithfulness.</p>
        <p>For the RAG specific measures, we observed that the Text + Images (10T + 10I) configuration modestly
improved Context Recall compared to the Text-Only RAG baseline. The Balanced Swap (5T + 5I) setup
led to larger increase in recall, though not statistically significant, suggesting that more testing would
be needed to validate that adding well-ranked images can improve retrieval relevance.</p>
        <p>However, this improvement came at a cost. Faithfulness and Factual Correctness declined in the
Balanced Swap (5T + 5I) setup, likely due to the removal of text content that the LLM relied on for
broader context and coherence. This tradeof implies that generic questions—often answerable via
general textual knowledge—benefit more from rich text contexts than from visual augmentation.</p>
        <p>Summary: Image inclusion boosts Context Recall, but not significantly and replacing even marginally
relevant text hurts Faithfulness and Factual Correctness in generic settings. Retaining broader textual
context is crucial for accurate and coherent answers, thus it might not be worth including images in all
scenarios. Although we observed no significant diferences between zero shot and RAG model metrics,
it is worth noting that using a RAG approach allows for the content to be easily updated and since
performance was not worsened, we would recommend this approach.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Specific Questions</title>
        <p>Performance across models was generally similar to the Text-Only RAG baseline for both context recall
and faithfulness, with no statistically significant diferences observed. All models performed at or near
ceiling for context recall, and faithfulness scores varied slightly but within overlapping confidence
intervals. However, a significant improvement emerged for factual correctness (F1): the Text + Images
(10T + 10I) model outperformed the Zero Shot baseline by approximately 13 percentage points (p =
0.014, 95% CI: [+0.03, +0.23]), indicating a meaningful benefit from richer context integration. As seen
in the graphic above the CI between Zero Shot and Text + Images (10T + 10I) do not overlap. The
Text-Only RAG and Balanced Swap (5T + 5I) models also showed improvements in factual correctness
relative to Zero Shot, but these diferences did not reach statistical significance. Overall, only the Text +
Images (10T + 10I) configuration showed a robust advantage on specific factual accuracy.</p>
        <p>This suggests that the inclusion of images under certain conditions in more specific questions has
a significant positive efect on the factual correctness of the LLM. Moreover, the RAG system allows
for the tracking of where content is getting pulled from inside the vector database to supplement the
generation of responses, which could allow for a deeper level of understanding of relevant content as it
relates to student questions.</p>
        <p>For the RAG specific measures, adding 10 images on top of 10 text vectors in the Text + Images (10T +
10I) configuration slightly reduced context recall, likely due to visual noise introduced by less relevant
images. However, the Balanced Swap (5T + 5I) configuration achieved perfect recall consistently,
showing that highly ranked visual content can provide strong contextual grounding for specialized
queries adding further support for text and images on specific questions.</p>
        <p>When compared to the generic questions, faithfulness and factual correctness remained stable or
slightly improved in both multimodal settings for specific questions. This suggests that relevant visual
content supports accurate generation without undermining the consistency of the LLM responses.</p>
        <p>Summary: For specific questions, selectively replacing lower-ranked text vectors with relevant
images improves retrieval and enhances response quality. Excessive image inclusion, however, may
distract the model. Overall, including images significantly improves results for specific questions when
compared to zero-shot models.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This study explored the impact of multimodal retrieval, specifically the integration of image vectors
within a Retrieval-Augmented Generation (RAG) framework for educational applications. Our
experiments demonstrated that visual content, when selectively incorporated, can enhance certain quality
measures of a multimodal LLM, especially for conceptually dense context. This is important when
utilizing course materials such as lectures or digital textbooks for the creation of intelligent assistants,
as the multimodal approach does seem to have advantages but in limited context. It is also important
to note that while we did not experience any hallucinations in our experiment, this approach is not
designed to prevent hallucinations from occurring, though the Faithfulness measure is design to quantify
false or misleading content.</p>
      <p>Specifically, our results also show that a fixed or naive strategy for image inclusion is suboptimal,
meaning tuning the LLM to only include a limited and most relevant images is ideal. In the Text +
Images (10T + 10I) setup, the inclusion of excessive visual information led to performance degradation
in certain metrics, particularly Faithfulness and Factual Correctness. These findings underscore the
importance of context curation and relevance filtering in multimodal systems.</p>
      <p>Future work will focus on developing dynamic, adaptive strategies to optimize retrieval and improve
LLM responses. Key directions include:
• Designing an agentic RAG selector that adjusts the mix of text and image vectors based on
real-time query specificity analysis.
• Exploring semantic clustering and alignment across modalities to better group and rank
context vectors.
• Enhancing evaluation eficiency through smarter sampling, reproducible scoring pipelines, and
reduced compute requirements.
• Knowledge Graph based RAG would work very well on this corpus of data as observed from
the PCA Analysis of Clustered Text Vectors.</p>
      <p>These improvements aim to support the development of intelligent, multimodal RAG systems that
dynamically tailor context inputs—maximizing educational value and improving user engagement in
classroom and self-guided learning environments.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Statement on Use of Generative AI</title>
      <p>During the preparation of this manuscript, the author(s) used GPT4o to edit background context
and summarize relevant research articles. The output was subsequently reviewed, revised, and fully
controlled by the author(s). The authors take full responsibility for the accuracy and integrity of the
content presented.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Appendix</title>
      <sec id="sec-7-1">
        <title>A.1. Vector Store Visualization</title>
        <p>This is a live link to an example of how questions and documents are embedded in our vector store.
The most semantically similar documents used in the response are highlighted in purple and green.
https://msds-capstone-project.github.io/MultiModalRAGViz/</p>
      </sec>
      <sec id="sec-7-2">
        <title>A.2. Evaluation Metrics</title>
        <p>These are the evaluation metrics calculated via 10 Bootstrapped sampling rounds of 50 queries each.</p>
      </sec>
      <sec id="sec-7-3">
        <title>A.3. Storage Pipeline Diagram</title>
      </sec>
      <sec id="sec-7-4">
        <title>A.4. User Pipeline Diagram</title>
        <p>Figure 7: A.4 Pipeline of how the user is going to experience the architecture</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>José</given-names>
            <surname>Rafael</surname>
          </string-name>
          Aguilar-Mejía et al. “
          <article-title>Design and Use of a Chatbot for Learning Selected Topics of Physics”. en</article-title>
          . In: Technology-Enabled Innovations in Education. Ed. by Samira Hosseini et al.
          <source>Singapore: Springer Nature</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>175</fpage>
          -
          <lpage>188</lpage>
          . isbn:
          <fpage>978</fpage>
          -
          <lpage>981</lpage>
          -19-3383-7. doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-19- 3383-7_
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Vincent</given-names>
            <surname>Aleven</surname>
          </string-name>
          et al. “
          <article-title>Help Helps, But Only So Much: Research on Help Seeking with Intelligent Tutoring Systems”</article-title>
          . en.
          <source>In: International Journal of Artificial Intelligence in Education 26.1</source>
          (
          <issue>Mar</issue>
          .
          <year>2016</year>
          ), pp.
          <fpage>205</fpage>
          -
          <lpage>223</lpage>
          . issn:
          <fpage>1560</fpage>
          -
          <lpage>4292</lpage>
          ,
          <fpage>1560</fpage>
          -
          <lpage>4306</lpage>
          . doi:
          <volume>10</volume>
          . 1007 / s40593 - 015 - 0089 - 1. url: http : //link.springer.
          <source>com/10.1007/s40593-015-0089-1 (visited on 05/21/</source>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Javier</surname>
          </string-name>
          Calzada-Prado and
          <string-name>
            <given-names>Miguel</given-names>
            <surname>Marzal</surname>
          </string-name>
          . “
          <article-title>Incorporating Data Literacy into Information Literacy Programs: Core Competencies and Contents”</article-title>
          .
          <source>In: Libri</source>
          <volume>63</volume>
          (
          <year>June 2013</year>
          ). doi:
          <volume>10</volume>
          .1515/libri-2013-0010.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Maud</given-names>
            <surname>Chassignol</surname>
          </string-name>
          et al. “
          <article-title>Artificial Intelligence trends in education: a narrative overview”</article-title>
          .
          <source>In: Procedia Computer Science. 7th International Young Scientists Conference on Computational Science, YSC2018</source>
          ,
          <fpage>02</fpage>
          -
          <lpage>06</lpage>
          July2018, Heraklion, Greece
          <volume>136</volume>
          (
          <issue>Jan</issue>
          .
          <year>2018</year>
          ), pp.
          <fpage>16</fpage>
          -
          <lpage>24</lpage>
          . issn:
          <fpage>1877</fpage>
          -
          <lpage>0509</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>doi: 10</source>
          .1016/j.procs.
          <year>2018</year>
          .
          <volume>08</volume>
          .233. url: https://www.sciencedirect.com/science/article/pii/ S1877050918315382 (visited on 03/28/
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Lijia</given-names>
            <surname>Chen</surname>
          </string-name>
          , Pingping Chen, and
          <string-name>
            <given-names>Zhijian</given-names>
            <surname>Lin</surname>
          </string-name>
          . “
          <article-title>Artificial Intelligence in Education: A Review”</article-title>
          .
          <source>In: IEEE Access 8</source>
          (
          <year>2020</year>
          ). Conference Name: IEEE Access, pp.
          <fpage>75264</fpage>
          -
          <lpage>75278</lpage>
          . issn:
          <fpage>2169</fpage>
          -
          <lpage>3536</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2020</year>
          .
          <volume>2988510</volume>
          . url: https://ieeexplore.ieee.org/document/9069875 (visited on 03/29/
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <surname>ExplodingGradients. Ragas</surname>
          </string-name>
          :
          <article-title>Evaluation framework for LLM-generated responses</article-title>
          . https://docs.ragas. io/en/latest/. Accessed:
          <fpage>2025</fpage>
          -07-
          <lpage>08</lpage>
          .
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Lewis</surname>
          </string-name>
          et al.
          <article-title>Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</article-title>
          . arXiv:
          <year>2005</year>
          .11401 [cs].
          <source>Apr</source>
          .
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>11401</volume>
          . url: http://arxiv.org/abs/
          <year>2005</year>
          .11401 (visited on 05/03/
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8] [9] [11] [12]
          <string-name>
            <given-names>Duri</given-names>
            <surname>Long</surname>
          </string-name>
          and
          <string-name>
            <given-names>Brian</given-names>
            <surname>Magerko</surname>
          </string-name>
          . “
          <article-title>What is AI Literacy? Competencies and Design Considerations”</article-title>
          .
          <source>In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. CHI '20</source>
          . New York, NY, USA: Association for Computing Machinery, Apr.
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          . isbn:
          <fpage>978</fpage>
          -1-
          <fpage>4503</fpage>
          - 6708-0. doi:
          <volume>10</volume>
          .1145/3313831.3376727. url: https://doi.org/10.1145/3313831.3376727 (visited on 03/27/
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Mehrnoush</given-names>
            <surname>Mohammadi</surname>
          </string-name>
          et al. “
          <article-title>Artificial Intelligence in Multimodal Learning Analytics: A Systematic Literature Review”</article-title>
          .
          <source>In: Computers and Education: Artificial Intelligence (May</source>
          <year>2025</year>
          ), p.
          <fpage>100426</fpage>
          . issn:
          <fpage>2666</fpage>
          -
          <lpage>920X</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.caeai.
          <year>2025</year>
          .
          <volume>100426</volume>
          . url: https://www.sciencedirect.com/ science/article/pii/S2666920X25000669 (visited on 05/22/
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Fan</given-names>
            <surname>Ouyang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pengcheng</given-names>
            <surname>Jiao</surname>
          </string-name>
          . “
          <article-title>Artificial Intelligence in Education: The Three Paradigms”</article-title>
          .
          <source>In: Computers and Education: Artificial Intelligence</source>
          <volume>2</volume>
          (
          <issue>Apr</issue>
          .
          <year>2021</year>
          ), p.
          <fpage>100020</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.caeai.
          <year>2021</year>
          .
          <volume>100020</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          et al.
          <source>Learning Transferable Visual Models From Natural Language Supervision</source>
          .
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>arXiv: 2103</source>
          .00020 [cs.CV]. url: https://arxiv.org/abs/2103.00020.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          et al. “
          <article-title>Learning Transferable Visual Models From Natural Language Supervision”</article-title>
          .
          <source>In: Proceedings of the 38th International Conference on Machine Learning</source>
          .
          <year>2021</year>
          . url: https://arxiv. org/abs/2103.00020.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Elisabeth</surname>
            <given-names>Sulmont</given-names>
          </string-name>
          , Elizabeth Patitsas, and
          <string-name>
            <surname>Jeremy</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cooperstock</surname>
          </string-name>
          . “Can You Teach Me To Machine Learn?”
          <source>In: Proceedings of the 50th ACM Technical Symposium on Computer Science Education. SIGCSE '19</source>
          . New York, NY, USA: Association for Computing Machinery, Feb.
          <year>2019</year>
          , pp.
          <fpage>948</fpage>
          -
          <lpage>954</lpage>
          . isbn:
          <fpage>978</fpage>
          -1-
          <fpage>4503</fpage>
          -5890-3. doi:
          <volume>10</volume>
          .1145/3287324.3287392. url: https://dl.acm.org/doi/10.1145/ 3287324.3287392 (visited on 03/28/
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Brian</given-names>
            <surname>Wright</surname>
          </string-name>
          . MultiModalRAGbw. https : / / github . com / NovaVolunteer / MultiModalRAGbw. Accessed:
          <fpage>2025</fpage>
          -07-
          <lpage>08</lpage>
          .
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>