<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Special Session on Harmonising Generative AI and Semantic Web Technologies, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Hybrid Evaluation of Socratic Dialogue for Teaching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eleni Ilkou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephan Linzbach</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonas Wallat</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GESIS - Leibniz Institute for the Social Sciences</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>L3S Research Center, Leibniz University Hannover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>13</volume>
      <issue>2024</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>We present a kick-starter paper that addresses the opportunities and challenges in the intersection of Generative AI (GenAI), Semantic Web Technologies, and Human-Computer Interaction in Socratic method for educational purposes. Inspired by the example of Large Language Model (LLM) tutors using the Socratic dialogue for teaching, we motivate the need for new hybrid benchmarks and metrics that calculate the tutor's performance by combining parameters from the LLMs, Knowledge Engineering (KE) and Hybrid Human Artificial Intelligence (HHAI) performance. We explore current problems, and propose a future direction for the hybrid implementation of Socratic dialogue with hybrid evaluation methods.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Generative AI</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Hybrid Human AI</kwd>
        <kwd>Hybrid Benchmarks</kwd>
        <kwd>Hybrid Metrics</kwd>
        <kwd>Socratic Method</kwd>
        <kwd>Socratic Sub-questions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Socratic Dialogue and its Multidisciplinary Technical Dependencies</title>
      <p>
        Socrates, the ancient Greek philosopher, is known for his teaching style, which encouraged students to
explore the limitations of their knowledge and understanding rather than providing direct answers.
Following this example, the Socratic dialogue technique employs six pedagogical measures, including
encouraging critical thinking, leading individuals to uncover knowledge rather than stating it,
developing mutual understanding, and constantly challenging the opponents’ views [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Because of its goal to
lead students to uncover knowledge themselves rather than passively receiving information from the
teacher, the Socratic dialogue is widely used in educational settings [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Recently, Bonino et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed a Socratic method with a fine-tuned LLM for promoting students’
critical thinking and self-discovery. The fine-tuning process yielded substantial enhancements in
performance, with one model exhibiting superior eficacy relative to the GPT-4o model in high-quality
Socratic interactions. Furthermore, the Khan Academy, a well-known personalised educational service
provider, implemented a type of Socratic-LLM support for the students into their e-learning systems,
the Khanmigo [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Khanmigo ofers a new approach to learning, where the learner is actively engaged
through inquiry and discovery. By inputting specific questions or problems into the model, learners
can leverage the platform’s knowledge base to facilitate a guided exploration of complex concepts like
the Socratic method.
      </p>
      <p>
        However, successfully deploying Socratic tutoring systems is not a trivial task as such systems have
technical dependencies across several domains. Firstly, an LLM is deployed for its ability to communicate
in natural language. In parallel, a Knowledge Engineering (KE) component is necessary to ensure factual
correctness and support long-term reasoning through structures, such as ontologies and Knowledge
Graphs, which add a semantic layer of understanding [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Furthermore, a Hybrid Human Artificial
Intelligence (HHAI) component is mandatory to account for the specific needs of human users. Having
human feedback in the loop, the system can integrate personalised input alongside AI capabilities,
fostering a collaborative environment where the human and the machine inputs enhance learning,
discovery, and inquiry. Therefore, the Socratic dialogue’s multidisciplinary nature requires a hybrid
evaluation approach. Beyond metrics like accuracy and Hit@k scores, it is crucial to assess the system’s
reliability as a tutor, its suitability for education, and its ability to adapt to individual learners’ needs. In
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Benchmarking the Socratic Method for Teaching</title>
      <sec id="sec-2-1">
        <title>2.1. Quantity over Quality: Current Limitations of LLM Tutors</title>
        <p>
          LLM personal tutors have been proven to be beneficial, especially for students with no prior domain
knowledge [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Implementing an LLM personal tutor that follows the Socratic dialogue requires the
LLM to act as a surrogate for human teachers. However, optimizing these models to guide student
inquiry is costly and computationally expensive, which poses a barrier to widespread adoption in
education. Furthermore, the use of AI in education involves sensitive issues such as privacy regulations
about students’ data. In a Socratic dialogue, an LLM would need to access the student responses and
interactions, which raises concerns about the data storage, and usage by the LLM-provider. Furthermore,
the LLM knowledge is constrained by a fixed cut-of time [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] that can lead to a lack of information
and an increase in unreliable outputs. Even knowledge included in the training data sufers from
hallucinations, which limits user trust [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and inhibits the direct application in educational context.
Moreover, LLMs performance is highly dependent on syntax and semantics of the phrased prompt,
which can result in sub-optimal performance and unreliable behaviour [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ].
        </p>
        <p>Additionally, a key feature of the Socratic dialogue is detecting the student’s current knowledge state
and guiding them to their learning goal, which requires verified data and the ability to plan. Currently,
LLMs have limited ability to determine precisely the student’s background and assess the student’s
Student</p>
        <p>Query</p>
        <p>Proxy-LLM</p>
        <p>Query</p>
        <p>Explainability</p>
        <p>Consistency</p>
        <p>Centralized-LLM
Benchmarking</p>
        <p>
          Reply
current knowledge state, which makes it challenging to pose the right questions at the right time to
facilitate personalised learning. Retrieval-augmented Generation (RAG) models have addressed some of
these limitations by a pre-pended retrieval stage where relevant information, such as the educational
curricula, is collected and fed into an LLM to assist the performance. RAG reduces the tendency for
hallucinations by grounding the generation process in (retrieved) factual information [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and assists in
easily updating the LLM with more up-to-date information [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. RAG models could be implemented for
the Socratic method, however, we argue that this will require an extension of the current benchmarking
techniques.
        </p>
        <p>
          Although RAG models rely on the quality of the retrieved documents for question-answering, the
implementations do not include evaluation metrics about the quality of the retrieved documents nor
human aspects [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Lastly, although human input was considered in evaluating LLM quality, many
downstream tasks depend on performance metrics like accuracy. These metrics aren’t always aligned
with human preferences and often fail to capture the plausibility or significance of an error. To rigorously
benchmark the Socratic method for teaching, it is imperative to adopt a hybrid evaluation framework,
which would combine conventional performance metrics with assessments of the system’s eficacy in
educational settings. Relying solely on traditional metrics that evaluate a single competency from a
monotone angle may overlook critical aspects of educational interaction, such as engaging students,
adapting to their individual learning needs, and sustaining pedagogical efectiveness.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Team Work makes the Dream Work: Combining GenAI, KGs, and Humans</title>
        <p>
          We propose a hybrid method for the Socratic method, which consists of a centralized GenAI-LLM
model, a smaller proxy LLM, and a KE component, as it is displayed in Figure 1. In the system, the
student makes a query, which is processed and anonymized by the proxy-LLM. Then, the LLM to LLM
communication takes place to optimize the retrieval capabilities. The centralized-LLM communicates
with the KE component, mainly consisting of a Knowledge Graph to fact-check and provide credibility
for the acquired knowledge. Finally, the answer is provided to the student. Generally, the Socratic
method describes an asymmetric dialogue set-up with a more capable teacher and a less capable student.
The system, acting as Socrates, would facilitate adaptive learning by detecting the user’s learning path
and gradually increasing the dificulty of assessments [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], ensuring students’ engagement with the
material in a structured way. The algorithmic tutor is grounded on a KE module which consists of
a Knowledge Graph structured around well-defined, ministry-approved, quality curricula, organized
into content levels of engagement and understanding based on the established framework of Bloom’s
taxonomy [18]. The Knowledge Graph is built around the educational resources that the user is tested
on and includes advanced knowledge about the specific educational field [ 19, 20]. Each learning material
is broken down into smaller sections to align with the diferent components of the Socratic method. The
KE module enables guidance through increasingly complex topics while ensuring that the questions
and answers posed by the LLM are appropriate for the learner’s cognitive level and aligned with their
learning goals [21].
        </p>
        <p>A second, smaller proxy LLM is interposed in the pipeline to shield the student from third-party
surveillance or privacy breaches from the centralized LLM. Especially, smaller language model
architectures used by BERT [22], Distill-BERT [23], and GPT-2 [24] could serve as a candidate for a locally run
proxy LLM. The proxy LLM takes the raw student query as input, processes it, and discovers the best
way to retrieve information from the centralized LLM, while it filters out all personal data irrelevant
to the current topic to provide a privacy-secure learning environment to the student. Furthermore,
the proxy LLM allows the training and benchmarking of the Socratic dialogue without the necessity
of human involvement. This is critical, as human feedback can be expensive (i.e., user studies) and
sometimes even impossible to attain (i.e., overnight software updates), and the latent variables impacting
humans are notoriously plenty and hard to control (i.e., learning types and prior knowledge). In contrast,
by controlling the model’s behavior via the training data, known vocabulary, multilingual abilities,
and train paradigms, we make it more feasible to test and train the centralized LLM capabilities and
determine its educational capabilities for the learners.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Communication begins with Connection: New Hybrid Benchmarks and Metrics</title>
        <p>The evaluation of Socratic dialogue for teaching poses unique challenges, as traditional metrics often
fail to capture the complexity of fostering meaningful learning experiences. Hybrid metrics that
integrate key aspects of KE, LLM performance, and human-AI interaction are essential to accomplish
a sophisticated and well-evaluated Socratic dialogue system. Without a human in the loop, the
LLMgenerated prompts may not align with real-world learning scenarios, or drive meaningful discussion, as
they might lack the nuance of human interaction. This is a major concern in settings where no human
teacher is present to guide the AI-student dialogue. Therefore, there is a demand for hybrid metrics that
will enable a more holistic evaluation of LLMs as Socratic tutors, ensuring they not only deliver factual
accuracy but also facilitate cognitive growth, adaptive learning, and reflective thinking. To introduce
such hybrid metrics, new benchmarks must be created that account for the needs of each stakeholder:
LLMs must be assessed for their ability to generate adaptive, pedagogically sound questions; KE must
focus on the alignment of LLM-driven interactions with structured learning objectives and conceptual
frameworks; and human-AI interaction should ensure that the dialogue supports engagement, curiosity,
and student autonomy.</p>
        <p>Current datasets used for Socratic method [25, 26] are limited to the breakdown of the dialogue and
interactions to a small number of sub-parts. As these datasets are develop to evaluate the LLMs’ ability
to generate questions similar to the given dataset, they lack to include the multi-set of parameters related
to education, such as the complexity of human interactions and learning aspects, as we presented earlier
in Table 1. Therefore, the need for extending benchmarks to include more parameters is prominent.
To motivate further the novelty of our proposed approach, we present below an example of Socratic
sub-questions highlighted in bold based on a mathematical problem as presented by Cobbe et al. [27]1:
A carnival snack booth made $50 selling popcorn each day. It made three times as much
selling cotton candy. For a 5-day activity, the booth has to pay $30 rent and $75 for the cost
of the ingredients. How much did the booth earn for 5 days after paying the rent and the
cost of ingredients?
How much did the booth make selling cotton candy each day?
The booth made $50 x 3 = $«50*3=150»150 selling cotton candy each day.</p>
        <p>How much did the booth make in a day?
In a day, the booth made a total of $150 + $50 = $«150+50=200»200.</p>
        <p>How much did the booth make in 5 days?
In 5 days, they made a total of $200 x 5 = $«200*5=1000»1000.</p>
        <p>How much did the booth have to pay? The booth has to pay a total of $30 + $75 =
1You can find the Socratic Dataset and example at https://github.com/openai/grade-school-math
$«30+75=105»105.</p>
        <p>How much did the booth earn after paying the rent and the cost of ingredients?
Thus, the booth earned $1000 - $105 = $«1000-105=895»895.</p>
        <p>In Figure 2, we present three hypothetical teacher-student interactions of our system that build upon
the previous example. The three examples highlight various ways in which benchmarking datasets
and metrics can evolve to incorporate hybrid aspects into the parameters they include and assess.
More specifically, in Figure 2a, the system demonstrates the technical competence of adaptive dificulty
adjustment and skips a few steps of the predefined dialogue to correspond to the student learning needs.
In Figure 2b, the system shows flexibility in communication and adaptive responses based on student’s
input. In Figure 2c, there is an emotional state detection and empathetic communication based on
emotional needs.</p>
        <p>(a) Dificulty adaptation
(b) Cultural adaptation
(c) Emotional adaptation</p>
        <p>Furthermore, hybrid benchmarks could measure cognitive depth, evaluating how well the LLM’s
questions promote higher-order thinking, such as analysis and evaluation, rather than merely focusing
on factual recall. Conceptual progression would assess whether the LLM can guide students through
increasingly complex topics, while adaptive questioning would track how the model adjusts its queries
based on the student’s understanding. Human-centric metrics would ensure that the interaction fosters
emotional involvement and independent problem-solving. Additionally, benchmarks should quantify
the serendipity or surprisal of the generated text—ensuring that LLMs provide students with novel
insights that challenge their thinking without overwhelming them.</p>
        <p>These benchmarks must also guarantee pedagogical soundness, ensuring that feedback corrects
misconceptions while encouraging further inquiry. By incorporating these elements, the evaluation
framework will ofer a comprehensive, cross-dimensional view of LLM performance, ensuring the
deployment of LLMs in educational settings promotes meaningful, interactive, and cognitively stimulating
learning experiences.</p>
        <p>TL;DR</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>In this paper, we explore the connections between LLMs, KE, and HHAI in deploying GenAI-LLM tutors
using the Socratic dialogue for teaching. We present a recommendation for the future development and
evaluation of hybrid models with new benchmarks and metrics.</p>
      <p>The authors would like to thank Prof. Dr. Stefan Dietze and Prof. Dr. Wolfgang Nejdl for constructive
feedback. This collaboration was enabled by the L3S/TIB/GESIS Workshop 2024. The paper was inspired
by the discussions in HHAI 2024: Hybrid Human AI Systems for the Social Good.
[18] J. Conklin, A taxonomy for learning, teaching, and assessing: A revision of bloom’s taxonomy of
educational objectives complete edition, 2005.
[19] E. Ilkou, H. Abu-Rasheed, D. Chaves-Fraga, E. Engelbrecht, E. Jiménez-Ruiz, J. E. Labra-Gayo,
Teaching knowledge graph for knowledge graphs education, Semantic Web Journal (Under
submission).
[20] E. Ilkou, E. Jiménez-Ruiz, Towards a knowledge graph for teaching knowledge graphs, in: Posters,</p>
      <p>Demos, and Industry Tracks at ISWC 2024, November 13–15, 2024, Baltimore, USA, CEUR, 2024.
[21] C. D. Jaldi, E. Ilkou, N. Schroeder, C. Shimizu, Education in the era of neurosymbolic ai, Journal of</p>
      <p>Web Semantics (2024) 100857.
[22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423.
doi:10.18653/v1/N19-1423.
[23] V. Sanh, Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter, arXiv preprint
arXiv:1910.01108 (2019).
[24] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[25] K. Shridhar, J. Macina, M. El-Assady, T. Sinha, M. Kapur, M. Sachan, Automatic generation of
socratic subquestions for teaching math word problems, in: Y. Goldberg, Z. Kozareva, Y. Zhang
(Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 4136–4149.
URL: https://aclanthology.org/2022.emnlp-main.277. doi:10.18653/v1/2022.emnlp-main.
277.
[26] B. H. Ang, S. D. Gollapalli, S. K. Ng, Socratic question generation: A novel dataset, models, and
evaluation, in: Proceedings of the 17th Conference of the European Chapter of the Association for
Computational Linguistics, 2023, pp. 147–165.
[27] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton,
R. Nakano, C. Hesse, J. Schulman, Training verifiers to solve math word problems, arXiv preprint
arXiv:2110.14168 (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Knezic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wubbels</surname>
          </string-name>
          , E. Elbers,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hajer</surname>
          </string-name>
          ,
          <article-title>The socratic dialogue and teacher education, Teaching and teacher education 26 (</article-title>
          <year>2010</year>
          )
          <fpage>1104</fpage>
          -
          <lpage>1111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zare</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Mukundan,</surname>
          </string-name>
          <article-title>The use of socratic method as a teaching/learning tool to develop students' critical thinking: A review of literature, Language in India 15 (</article-title>
          <year>2015</year>
          )
          <fpage>256</fpage>
          -
          <lpage>265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bonino</surname>
          </string-name>
          , G. Sanmartino,
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Pinheiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Michiardi</surname>
          </string-name>
          ,
          <article-title>Fine tuning a large language model for socratic interactions</article-title>
          ,
          <source>in: Proceedings of the Workshop On AI For</source>
          Education (
          <article-title>AI4EDU), in conjunction with the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), ACM</article-title>
          , ACM Press, Barcelona,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shetye</surname>
          </string-name>
          ,
          <article-title>An evaluation of khanmigo, a generative ai tool, as a computer-assisted language learning app</article-title>
          ,
          <source>Studies in Applied Linguistics and TESOL</source>
          <volume>24</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ilkou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Abu-Rasheed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tavakoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hakimov</surname>
          </string-name>
          , G. Kismihók,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , W. Nejdl,
          <string-name>
            <surname>Educor:</surname>
          </string-name>
          <article-title>An educational and career-oriented recommendation ontology</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2021</year>
          , pp.
          <fpage>546</fpage>
          -
          <lpage>562</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Siren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tzerpos</surname>
          </string-name>
          ,
          <article-title>Automatic learning path creation using oer: a systematic literature mapping</article-title>
          ,
          <source>IEEE Transactions on Learning Technologies</source>
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <fpage>493</fpage>
          -
          <lpage>507</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Avdic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. A.</given-names>
            <surname>Wissa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hatakka</surname>
          </string-name>
          ,
          <article-title>Socratic flipped classroom: What types of questions and tasks promote learning?</article-title>
          , in: European Conference on e-Learning, Academic Conferences International Limited,
          <year>2016</year>
          , p.
          <fpage>41</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Overholser</surname>
          </string-name>
          ,
          <article-title>Elements of the socratic method: Iii. universal definitions</article-title>
          .,
          <source>Psychotherapy: Theory, Research</source>
          , Practice, Training
          <volume>31</volume>
          (
          <year>1994</year>
          )
          <fpage>286</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>AlKhuzaey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Grasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tamma</surname>
          </string-name>
          ,
          <article-title>Text-based question dificulty prediction: A systematic review of automatic approaches</article-title>
          ,
          <source>International Journal of Artificial Intelligence in Education</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Cornelius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Sting</surname>
          </string-name>
          ,
          <article-title>Ai meets the classroom: When does chatgpt harm learning?</article-title>
          ,
          <source>arXiv preprint arXiv:2409.09047</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11] J. Cheng, M. Marone,
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lawrie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khashabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. V.</given-names>
            <surname>Durme</surname>
          </string-name>
          ,
          <article-title>Dated data: Tracing knowledge cutofs in large language models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2403.12958. arXiv:
          <volume>2403</volume>
          .
          <fpage>12958</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Waldo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Boussard</surname>
          </string-name>
          ,
          <article-title>Gpts and hallucination: Why do large language models hallucinate?</article-title>
          ,
          <source>Queue</source>
          <volume>22</volume>
          (
          <year>2024</year>
          )
          <fpage>19</fpage>
          -
          <lpage>33</lpage>
          . URL: https://doi.org/10.1145/3688007. doi:
          <volume>10</volume>
          .1145/3688007.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mondal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <article-title>A systematic survey of prompt engineering in large language models: Techniques and applications</article-title>
          ,
          <source>CoRR abs/2402</source>
          .07927 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2402.07927. doi:
          <volume>10</volume>
          .48550/ARXIV.2402.07927. arXiv:
          <volume>2402</volume>
          .
          <fpage>07927</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Linzbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kallmeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Evang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jabeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <article-title>Dissecting paraphrases: The impact of prompt syntax and supplementary information on knowledge retrieval from pretrained language models</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.),
          <source>Proceedings of the</source>
          <year>2024</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>3645</fpage>
          -
          <lpage>3655</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>201</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>201</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Béchard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Ayala</surname>
          </string-name>
          ,
          <article-title>Reducing hallucination in structured outputs via retrieval-augmented generation</article-title>
          ,
          <source>CoRR abs/2404</source>
          .08189 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2404.08189. doi:
          <volume>10</volume>
          . 48550/ARXIV.2404.08189. arXiv:
          <volume>2404</volume>
          .
          <fpage>08189</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for large language models: A survey</article-title>
          ,
          <source>CoRR abs/2312</source>
          .10997 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2312.10997. doi:
          <volume>10</volume>
          .48550/ARXIV.2312.10997. arXiv:
          <volume>2312</volume>
          .
          <fpage>10997</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ilkou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Signer</surname>
          </string-name>
          ,
          <article-title>A technology-enhanced smart learning environment based on the combination of knowledge graphs and learning paths</article-title>
          .,
          <source>in: CSEDU (2)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>461</fpage>
          -
          <lpage>468</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>