<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating LLMs' Performance At Automatic Short-Answer Grading</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rositsa V. Ivanova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siegfried Handschuh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of St. Gallen</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, the use of Large Language Models (LLMs) has become more accessible and wide-spread. With a free-of-charge access types people have began applying the models to various tasks beyond the task of next-word prediction. In an exploratory study, we take a closer look at the use of LLMs for Automatic Short Answer Grading. We compare the grading of short-answer tasks by two human graders to this of an LLM. We discuss the results and present examples of observed short-comings in the annotation and grading.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;automatic short-answer grading</kwd>
        <kwd>large language models</kwd>
        <kwd>automated scoring</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) have become our assistants in many everyday activities. Over
the last few years, the speed at which new models are developed has become overwhelming
to daily users, researchers, politicians, and law makers struggling to keep up with all options
and opportunities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Yet, their application has been explored and accepted in various domains
[
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>
        Automatic Short Answer Grading (ASAG) systems have emerged as an educational technology,
addressing the need for eficient assessment methods in both online and traditional educational
environments long before the hype of LLMs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The primary objective of ASAG systems is to
automatically evaluate and score students’ responses to short answer questions. The dificulty of
the task arises from the length of the texts - often even simply a few words - and thus the limited
given context [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. One of the approaches to the task of ASAG for closed-ended questions is
the comparison of the student answer to a predefined correct answer [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. The developments
in ASAG have been heavily influenced by advancements in Natural Language Processing (NLP)
and Machine Learning [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Accordingly, LLMs have found their applications in the creation of datasets and tools. While
they are of great help for generic tasks such as answering questions or writing text [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], they
often fall short when applied to domain specific tasks [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ]. One primary concern is the
risk for LLMs to amplify biases present in their training data [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Further, it is a challenge to
ensuring the factual accuracy and relevance of the content generated by LLMs [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Previous
attempts using Retrieval-Augmented Generation have been made to incorporate external sources
and enrich LLMs answers with knowledge, improving the factual grounding and thus the safety
of answers [17, 18, 19]. However, such approaches rely on knowledge databases and annotated
datasets to learn from, which underlines the critical importance of creating qualitative gold
standard datasets [20, 21].
      </p>
      <p>We explore the use of LLMs for the automated grading of short-answer texts as an example
of a complex task that requires an understanding of a brief answer without receiving more
than a sample solution. Our exploratory study aims to address the question of whether the
LLMs have implicitly learned to perform well on specific NLP tasks (e.g. ASAG). We believe that
understanding the short-comings of LLMs is one of many steps towards developing more suitable
annotation approaches that could be used for the support by LLMs in process of automated
grading.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experiment</title>
      <p>We compare the grading of students answers to exam questions done by two people to that of a
popular, widely-used and free-of-charge LLM (i.e. ChatGPT-3.5). We acknowledge the fact that
the chosen model is merely one amongst many, which all have their individual strengths and
weaknesses, and that it is being continuously updated. However, due to the wide spread use of
the model in various domains and the exploratory scope of this study, we build our use-case on
ChatGPT-3.5, while pointing out the limitations of our choice.</p>
      <p>Human annotation The initial dataset of this experiment was created in two steps. First,
Mohler and Mihalcea [22] graded the assignments of undergraduate students in an introductory
computer science (CS) course. The 630 short-answers given by 30 students were evaluated by
two graduate CS students on an interval scale from 0 to 5. The second dataset extended the
former by expanding the total number of short-answers to 2 273 [23]. The grading of the new
texts was also done by the same two people. The grading scale ranged from 0 to 10 and in some
cases the graders gave half points. The conversion of this scale to an equivalent from 1 to 5
lead to the use of rational numbers with a decimal increment of 0.25 interval for some of the
grades. For the purpose of our study, we kept the answers, which received a whole-number
grade, as we deemed the comparison to grades with various initial granularity (i.e. only whole
numbers for first part and a mix for the second) to be introducing unnecessary bias and 89%
(2 022 answers) of the answers received whole-number grades.</p>
      <p>ChatGPT The prompt consisted of instruction incl. the grading scale, the initial question, the
desired correct answer, and the student answer. To gain a better insight in the grading decisions,
we requested a text comment for each grade selection.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>We compared the grading of the human annotators and ChatGPT in multiple steps and using
various approaches. First, we compare the grades given to the answers by the first grader
(H1) and the second grader (H2). Second, we compare them individually to the automatically
assigned score by ChatGPT. For the three pairs, we derive a simple percentage of inter-annotator
agreement (IAA), evaluate the agreement beyond chance (Kappa Score), the agreement with
a focus on the severity of disagreement (Weighted Kappa Score), and the linear correlation
between the scoring (Pearson’s Correlation Coeficient). A detailed discussion on choice of
correlation metric is provided by the dataset creators [22].</p>
      <p>Pair
H1 &amp; H2
H1 &amp; ChatGPT
H2 &amp; ChatGPT
H* &amp; ChatGPT</p>
      <p>Table 1 depicts the results for each pair and score. The agreement between the two human
annotators (i.e. H1 &amp; H2) served as a benchmark for expected IAA. The Inter-annotator Score
was 60.88%, indicating that both human annotators agreed on grades more than half of the time.
The Kappa Score (0.295) indicates an agreement below moderate (0.41-0.60) underlined by the
Weighted Kappa Score at 0.395, showing a slightly better but still modest agreement. However,
considering the applied grading scale, the Pearson’s Correlation Coeficient (0.586) reflects a
moderate positive correlation between the two sets of grades.</p>
      <p>On the contrary, the comparison between each human annotator and ChatGPT (i.e. H1 &amp;
ChatGPT; H2 &amp; ChatGPT) reveals a lower level of agreement. For H1 &amp; ChatGPT, the
Interannotator Score, the Kappa Score and the Weighted Kappa Score indicate a minimal agreement
beyond what would be expected by chance. A surprisingly high value is achieved for the
Pearson’s Correlation Coeficient at 0.628, suggesting a stronger correlation. One explanation
for this could be the diferent distributions of the grading of H1 and H2. The agreement between
the second human annotator (H2) and ChatGPT was even lower for all of the measures, yet also
here the Pearson’s Correlation Coeficient remained high, indicating a moderate correlation
despite the low agreement scores.</p>
      <p>In addition to the evaluation for the three pairs, we created a subset of the initial dataset
(with 1 231 answers), where H1 and H2 agreed on the grade (i.e. H*). We view these instances
as examples of answers, which were graded more objectively and where the assignment of the
grade may be more straight forward. We calculate the IAA measures for the subset against
ChatGPT. This yielded an Inter-annotator Score of 33.96%, which is the highest of the scores
achieved by pairs including ChatGPT. However, also here the Kappa and the Weighted Kappa
Scores remained noticeably lower. This suggests that even when humans were in agreement,
ChatGPT’s grading did not significantly align with the human consensus. The Pearson’s
Correlation Coeficient was 0.537, indicating a moderate positive correlation but not a strong
agreement.</p>
      <p>In summary, while we observe a moderate level of agreement between human annotators,
the agreement between ChatGPT and the humans is significantly lower. However, the Pearson’s
Correlation Coeficients suggest there is still a moderate positive relationship in the grading
patterns between humans and ChatGPT. The results indicate that while ChatGPT can follow a
grading pattern similar to humans to some extent, the consistency of these grades with human
annotators varies and is generally lower than the human-human agreement levels.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Bias. In our reduced dataset, the grading of H1 and H2 overlapped only in 60.88% of the
cases. In the remaining cases H2 has demonstrated a bias in their grading by giving a higher
grade to 76.61% of the answers. While Mohler et al. [23] describe this as a “real-world [issue]
associated with the task of grading”, such subjectivity can also be perceived as the strength
of human annotation. Plank [24] criticizes the assumption that a single gold label should be
assigned to instances, as it diminishes the variety in opinions and interpretations of human
language. Particularly when creating new gold standards, such richness in the annotation may
be an essential step in the aim to reduce bias in models trained on them Kasneci et al. [25]. In
this context, we observe that ChatGPT assigned lower grades than H1 and H2 in 79.56% and
94.03% of all cases of disagreement.</p>
      <p>Question / Answers
Q1: What is the base case for a recursive implementation of merge sort?
Best case is one element. One element is sorted.</p>
      <p>A list size of 1, where it is already sorted.</p>
      <p>Q2: When does C++ create a default constructor?
whenevery you dont specifiy your own
When you dont specify any constructors.</p>
      <p>Q3: What is the role of a header-file?
To allow the compiler to recognize the classes when used elsewhere.</p>
      <p>Allow compiler to recognize the classes when used elsewhere
H1</p>
      <p>H2</p>
      <p>ChatGPT</p>
      <p>Inconsistency. Next, we took a closer look at the exam tasks, which were answered by
students very similarly, yet have received diferent grades. We manually grouped similar
answers to the same questions. While we discovered some inconsistencies in the human
annotation within these groups, ChatGPT provided various grades and difering justifications
for the assigned grade within nearly all of the answer groups. Table 2 provides three such
examples. In Q1 and Q2 both graders assigned highest mark to the pairs of similar answers
consistently. In both cases ChatGPT gave diferent marks.</p>
      <p>Similar observations have been made by Duong and Solomon [26] in particular when the
authors asked the same questions multiple times. Filighera et al. [27] discuss potential
weaknesses of LLMs that can easily be manipulated via minor changes in the syntax of an answer
(e.g. adding adjectives and adverbs). Depending on the manipulation, Filighera et al. [28]
discovered that students even manage to pass a 50% threshold on an exam “without answering
a single question correctly”. This underlines the dificulty of automating tasks such as ASAG.
Such varieties can be crucial when two answers are assessed as equivalent by a human, yet
distinguished by a LLMs due to diferences which a human would consider neglectable (e.g. an
extra empty character or a period in the end of an answer).</p>
      <p>The third example (Q3) depicts a case where one of the annotators also graded the answers
diferently, despite high similarity of the text. As mentioned by the authors of the initial dataset,
one of the graders (i.e. H2) frequently assigned higher grades. In addition to this fact, H2 also
tended to grade similar answers diferently more frequently than H1, for whom this was a rare
exception. These results indicate that may be a need for finer-grained grading (i.e. annotation)
guidelines to reduce the discrepancies between graders.</p>
      <p>The results shed light on some issues associated with human annotation. One note-worthy
issue is the low inter-annotator scores achieved by human annotators. Previous work has
suggested the use of finer-grained and precise annotation guidelines to achieve higher annotation
accuracy [29, 30]. Additionally, human annotation can be time-consuming and costly [31], which
leaves dataset creators to look for alternatives such as the use of LLMs.</p>
      <p>
        Large Language Models (LLMs) like ChatGPT present their own set of challenges. One issue
is that closed-source models like GPT-3.5 are fundamentally diferent from their successors
(e.g., GPT-4), making it dificult to understand and predict their behavior. While open-source
models accessible, they often become large ’black boxes’ that are challenging to interpret or
understand fully [32]. Providing more precise instructions to LLMs could potentially improve
their performance. Yet, we need to consider the risk that they may still miss nuances, which are
easily spotted by human annotators especially in complex or subtle domains. Lastly, the use of
LLMs such as ChatGPT require a substantial computational infrastructure [
        <xref ref-type="bibr" rid="ref15">33, 15</xref>
        ], posing the
question whether the same (if not better) performance can be achieved without their excessive
use.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Large Language Models (LLMs) like ChatGPT present their own set of challenges. Closed-source
models like GPT-3.5 are fundamentally diferent from their successors (e.g., GPT-4), making
it dificult to understand and predict their behavior. While open-source models are accessible,
they often become large ’black boxes’ that are challenging to interpret or understand fully.
Providing more precise instructions to LLMs could potentially improve their performance. Yet,
we need to consider the risk that they may still miss nuances, which are easily spotted by
human annotators especially in complex or subtle domains. Generalization of the results to
other domains may not be trivial, however the results of this survey already hint at the need
for further research in the potential use of LLMs as an aid for domain-specific tasks such as
ASAG. At this stage we believe that the ability of humans to interpret and detect nuances in
brief answers remains unmatched. Due to the complexity of the task, its time-intensive nature,
and the costs associated with manual annotation, the use of LLMs as support in the annotation
process for domain specific datasets should further be explored.
&amp; data mining, 2019, pp. 166–175.
[17] F. Hill, R. Reichart, A. Korhonen, Simlex-999: Evaluating semantic models with (genuine)
similarity estimation, Computational Linguistics 41 (2015) 665–695.
[18] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.</p>
      <p>Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp
tasks, Advances in Neural Information Processing Systems 33 (2020) 9459–9474.
[19] M. Glass, G. Rossiello, M. F. M. Chowdhury, A. Naik, P. Cai, A. Gliozzo, Re2G: Retrieve,
rerank, generate, in: M. Carpuat, M.-C. de Marnefe, I. V. Meza Ruiz (Eds.), Proceedings of
the 2022 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Association for Computational Linguistics,
Seattle, United States, 2022, pp. 2701–2715.
[20] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, W. Redmond, M. B.</p>
      <p>McDermott, Publicly available clinical bert embeddings, NAACL HLT 2019 (2019) 72.
[21] D. Song, S. Gao, B. He, F. Schilder, On the efectiveness of pre-trained language models for
legal natural language processing: An empirical study, IEEE Access 10 (2022) 75835–75858.
[22] M. Mohler, R. Mihalcea, Text-to-text semantic similarity for automatic short answer
grading, in: Proceedings of the 12th Conference of the European Chapter of the ACL
(EACL 2009), 2009, pp. 567–575.
[23] M. Mohler, R. Bunescu, R. Mihalcea, Learning to grade short answer questions using
semantic similarity measures and dependency graph alignments, in: Proceedings of the
49th annual meeting of the association for computational linguistics: Human language
technologies, 2011, pp. 752–762.
[24] B. Plank, The “problem” of human label variation: On ground truth in data, modeling
and evaluation, in: Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing, 2022, pp. 10671–10682.
[25] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser,
G. Groh, S. Günnemann, E. Hüllermeier, et al., Chatgpt for good? on opportunities and
challenges of large language models for education, Learning and individual diferences
103 (2023) 102274.
[26] D. Duong, B. D. Solomon, Analysis of large-language model versus human performance
for genetics questions, European Journal of Human Genetics (2023) 1–3.
[27] A. Filighera, S. Ochs, T. Steuer, T. Tregel, Cheating automatic short answer grading with the
adversarial usage of adjectives and adverbs, International Journal of Artificial Intelligence
in Education (2023) 1–31.
[28] A. Filighera, T. Steuer, C. Rensing, Fooling automatic short answer grading systems, in:</p>
      <p>International conference on artificial intelligence in education, Springer, 2020, pp. 177–190.
[29] A. Rigouts Terryn, V. Hoste, E. Lefever, In no uncertain terms: a dataset for monolingual
and multilingual automatic term extraction from comparable corpora, Language Resources
and Evaluation 54 (2020) 385–418.
[30] R. Ivanova, M. Van Erp, S. Kirrane, Comparing annotated datasets for named entity
recognition in english literature, in: Proceedings of the Thirteenth Language Resources
and Evaluation Conference, 2022, pp. 3788–3797.
[31] I. Habernal, I. Gurevych, Exploiting debate portals for semi-supervised argumentation
mining in user-generated web discourse, in: Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing, 2015, pp. 2127–2137.
[32] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li,
X. V. Lin, et al., Opt: Open pre-trained transformer language models, arXiv preprint
arXiv:2205.01068 (2022).
[33] T. Schick, H. Schütze, It’s not just size that matters: Small language models are also
few-shot learners, in: Proceedings of the 2021 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, 2021,
pp. 2339–2352.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Walter</surname>
          </string-name>
          ,
          <article-title>The rapid competitive economy of machine learning development: a discussion on the social risks and benefits</article-title>
          ,
          <source>AI and Ethics</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>J.-M. Hu</surname>
          </string-name>
          , F.-C. Liu,
          <string-name>
            <surname>C.-M. Chu</surname>
            ,
            <given-names>Y.-T.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Health care trainees' and professionals' perceptions of chatgpt in improving medical knowledge training: rapid survey study</article-title>
          ,
          <source>Journal of Medical Internet Research</source>
          <volume>25</volume>
          (
          <year>2023</year>
          )
          <article-title>e49385</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <article-title>What is the impact of chatgpt on education? a rapid review of the literature</article-title>
          ,
          <source>Education Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>410</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. I.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Houde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Weisz</surname>
          </string-name>
          ,
          <article-title>The programmer's assistant: Conversational interaction with a large language model for software development</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on Intelligent User Interfaces</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>491</fpage>
          -
          <lpage>514</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Burstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Using lexical semantic techniques to classify free-responses</article-title>
          , Springer,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Galhardi</surname>
          </string-name>
          , R. C. T. de Souza,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brancher</surname>
          </string-name>
          ,
          <article-title>Automatic grading of portuguese short answers using a machine learning approach</article-title>
          , in: Anais Estendidos do XVI Simpósio Brasileiro de Sistemas de Informação, SBC,
          <year>2020</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Willms</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Padó</surname>
          </string-name>
          ,
          <article-title>A transformer for sag: What does it grade?</article-title>
          ,
          <source>in: Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>114</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>U.</given-names>
            <surname>Hasanah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Permanasari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Kusumawardani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Pribadi</surname>
          </string-name>
          ,
          <article-title>A review of an information extraction technique approach for automatic short answer grading</article-title>
          ,
          <source>in: 2016 1st International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>192</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <article-title>An automatic short-answer grading model for semi-open-ended questions</article-title>
          ,
          <source>Interactive learning environments 30</source>
          (
          <year>2022</year>
          )
          <fpage>177</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joorabchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <article-title>On deep learning approaches to automated assessment: Strategies for short answer grading</article-title>
          .,
          <source>CSEDU (2)</source>
          (
          <year>2022</year>
          )
          <fpage>85</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Taecharungroj</surname>
          </string-name>
          , “
          <article-title>what can chatgpt do?” analyzing early reactions to the innovative ai chatbot on twitter</article-title>
          ,
          <source>Big Data and Cognitive Computing</source>
          <volume>7</volume>
          (
          <year>2023</year>
          )
          <fpage>35</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Creswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shanahan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Higgins</surname>
          </string-name>
          ,
          <article-title>Selection-inference: Exploiting large language models for interpretable logical reasoning</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mekala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wolfe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          , Zerotop:
          <article-title>Zero-shot task-oriented semantic parsing using large language models</article-title>
          ,
          <source>in: Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Blukis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mousavian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tremblay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thomason</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garg</surname>
          </string-name>
          , Progprompt:
          <article-title>Generating situated robot task plans using large language models</article-title>
          ,
          <source>in: ICRA</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>11523</fpage>
          -
          <lpage>11530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the dangers of stochastic parrots: Can language models be too big?</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goodrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Saleh, Assessing the factual accuracy of generated text</article-title>
          ,
          <source>in: proceedings of the 25th ACM SIGKDD international conference on knowledge discovery</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>