<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deduction under Perturbed Evidence: Probing Student Simulation (Knowledge Tracing) Capabilities of Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shashank Sonkar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard G. Baraniuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Reasoning, GPT, Student Simulation Models, Knowledge Tracing</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rice University</institution>
          ,
          <addr-line>Houston, Texas</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We explore whether Large Language Models (LLMs ) are capable of logical reasoning with distorted facts, which we call Deduction under Perturbed Evidence (DUPE). DUPE presents a unique challenge to LLMs since they typically rely on their parameters, which encode mostly accurate information, to reason and make inferences. However, in DUPE, LLMs must reason over manipulated or falsified evidence present in their prompts, which can result in false conclusions that are valid only under the manipulated evidence. Our goal with DUPE is to determine whether LLMs can arrive at these false conclusions and identify whether the dominant factor influencing the deduction process is the encoded data in the parameters or the manipulated evidence in the prompts. To evaluate the DUPE capabilities of LLMs, we create a DUPEd version of the StrategyQA dataset, where facts are manipulated to reverse the answer to the question. Our findings show that even the most advanced GPT models struggle to reason on manipulated facts - showcasing poor DUPE skills - with accuracy dropping by 45% compared to the original dataset. We also investigate prompt settings inspired from student simulation models a.k.a. knowledge tracing models, which mitigate the accuracy drop to some extent. Our findings have practical implications for understanding the performance of LLMs in real-world applications such as student simulation models that involve reasoning over inaccurate information. The prompts and dataset are available at https://github.com/lufycodes/gpt-knowledge-tracing .</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Over the last several years, Transformer models have played a significant role in shaping the
ifeld of Natural Language Processing (NLP) [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6">1, 2, 3, 4, 5, 6</xref>
        ]. Their exceptional ability to reason
across a broad range of NLP tasks [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ] has been a key factor contributing to their success.
The success of LLMs on challenging datasets like HellaSwag [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], AI2 Reasoning Challenge
(ARC) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], WinoGrande [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and GSM-8K [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is a testament to their advanced reasoning
skills and their potential to address challenging NLP tasks.
      </p>
      <p>
        In this paper, we investigate the reasoning abilities of LLMs models under a novel paradigm
we dub Deduction under Perturbed Evidence (DUPE for short). By testing LLMs’ capacity to
https://sites.google.com/view/shashanksonkar/ (S. Sonkar); https://richb.rice.edu/ (R. G. Baraniuk)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
reason with flawed or perturbed evidence, we aim to determine whether LLMs can generate
logically sound yet erroneous conclusions when presented with misleading information. Strong
DUPE skills are critical in NLP applications like student simulations [
        <xref ref-type="bibr" rid="ref14">14, 15</xref>
        ], where models
simulate student responses to understand how they may respond in certain scenarios. As
student responses often contain inaccuracies and misconceptions, it is important for a model
to analyze and utilize these inaccuracies and misconceptions as evidence to arrive at the same
conclusion as the student. For instance, a student may have the misconception that the heavier
an object is, the faster it falls, leading them to conclude that a bowling ball will fall faster than a
ball bearing. If we provide LLMs with evidence that a heavier object falls faster, would LLMs
also arrive at the conclusion that a bowling ball will fall faster than a ball bearing? We introduce
DUPE as our approach to investigate this question.
      </p>
      <p>Contributions: This paper develops a novel reasoning paradigm – Deduction under
Perturbed Evidence (DUPE) – to examine whether LLMs arrive at diferent conclusions when
presented with distorted initial facts. To test the DUPE capabilities of LLMs, we create a DUPEd
version of StrategyQA dataset (Figures 1, 2). StrategyQA [16] is an open-domain QA dataset
that is characterized by its explicit provision of the necessary facts required to answer each
yes-no question. In the DUPEd version of the dataset, we manipulate the facts provided in a
way that results in a diferent answer to the original question.</p>
      <p>Our findings reveal that state-of-the-art LLMs, , including GPT3.5 and GPT4, struggle
significantly on the newly introduced DUPEd-StrategyQA dataset. The accuracy of these models
dropped drastically by approximately 45%, falling from an impressive 91.9% on the original
dataset to only 46.7% on the DUPEd-StrategyQA dataset. In addition, we conduct an ablation
study on the DUPEd-StrategyQA dataset by categorizing it into two distinct parts based on
the type of manipulation used – one involving language perturbations and the other involving
mathematical manipulations. Furthermore, our results demonstrate that the accuracy drop
can be mitigated by using prompt settings inspired by student simulation models. This
approach reduced the accuracy drop to 29%, with the models achieving an accuracy of 62.7% on
the DUPEd-StrategyQA dataset. Our findings carry crucial implications for practical LLMs
applications, particularly in the realm of student simulation models that demand reasoning over
erroneous information.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology, Dataset, and Prompting</title>
      <p>In this section, we overview the DUPE reasoning framework, provide details on the DUPEd
version of AllenAI’s StrategyQA dataset, and then explore customized prompt settings designed
to assess the DUPE skills of LLMs.
2.1. DUPE
Given a true-false question  , the correct response   ∈ { ,  } and facts   that determine
the truth or falsehood of  (  ), we change   to   ′ s.t. the correct response to  flips to ¬ 
under altered facts   ′,</p>
      <p>DUPE((,   ,  ) ) = (,   ′,  ′)
s.t.  ′ = ¬ , editdist(  ,   ′) &lt;  ,
(1)
where editdist ensures that the edit distance between the fact strings   and   ′ is less than a
threshold  . The threshold  is generally set to two to three words to ensure minimal changes to
underlying facts (examples in figure 2). The new DUPEd-tuple (,   ′,  ′) can be used to probe
the DUPE capabilities of LLMs as shown in Figure 1.</p>
      <sec id="sec-2-1">
        <title>2.2. DUPEd-StrategyQA</title>
        <p>We use AllenAI’s StrategyQA dataset [16] to assess the DUPE skills of LLMs. StrategyQA
dataset provides explicit facts for answering open-domain questions. We create a DUPEd
version of StrategyQA dataset composed of a total of 325 examples, of which 173 introduce
natural language perturbations, while the remainder introduce mathematical errors (refer to
examples in figure 2).</p>
        <p>While designing the DUPEd version, we were careful to modify the facts in the most minimal
way possible As a result, we made a conscious efort to only alter one or two words in the
original facts whenever possible, in order to preserve the overall meaning and context of the
original question. Additionally, we refrained from using explicit negation, such as the word
not, to modify the facts, since our intent is not to evaluate the reasoning proficiency of LLMs in
handling negation.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Student Simulation and Prompt Design</title>
        <p>
          DUPE is highly relevant to student simulation models [
          <xref ref-type="bibr" rid="ref14">14, 17, 15</xref>
          ], which are widely used in
education and cognitive psychology research. These models help in predicting and understanding
student responses to various tasks, and thus their ability to reason over false information is
critical to their success. Given this strong connection between simulation models and DUPE,
these models can inspire innovative approaches to prompt design, which can be used to probe
DUPE skills of LLMs [
          <xref ref-type="bibr" rid="ref8">8, 18</xref>
          ]. An example of such a prompt is illustrated in figure 1 and section 3.
        </p>
        <p>DUPE and Counterfactual Reasoning: Counterfactual reasoning and student simulation
models require diferent types of reasoning. In counterfactual reasoning, the focus is on
exploring hypothetical scenarios that may or may not correspond to actual reality. The fact that
the information being considered is hypothetical or counterfactual is usually known beforehand.</p>
        <p>In contrast, a student simulation model needs to reason about both true and false information,
and may not know beforehand whether the information being considered is true or false. For
example, in figure 2, the model lacks prior knowledge about which facts are true and which ones
are perturbed. The model must identify incorrect answers from the student to make inferences
about future questions, which requires robust and nuanced reasoning capabilities beyond those
needed for counterfactual reasoning.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>We evaluate the DUPE capabilities of the two largest GPT models – GPT3.5 (version
gpt-3.5turbo-0301) and the latest GPT4 model (version gpt-4-0314) – via experiments under two
diferent prompt settings, P1) “You are a question answering model. Your task is reason on
provided evidence to answer a YES or NO question”, and P2) “You are a student simulation
model. Your task is reason on student’s responses to accurately measure the student’s current
knowledge state and predict the student’s response to a YES or NO question based on the
student’s current knowledge state” from section 2.3. An example is illustrated in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Main Results</title>
        <p>In the prompt setting P1, both GPT3.5 and GPT4 performed poorly on the DUPEd version of the
dataset, with a decrease in accuracy by 46.0%. and 45.2% respectively. As expected, the latest
GPT4 model demonstrates superior performance to GPT3.5 on both the original and the DUPEd
StrategyQA dataset.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Student Simulation Prompt</title>
          <p>Prompt P2 inspired by student simulation setting informs/ primes the models that the provided
evidence may be incorrect since the evidence reflects the erroneous nature of students’ responses.
We found that prompt setting P2 performs significantly better than P1 by a margin of 16.0%
for the GPT4 model. However, there was still a significant 29.2% drop in accuracy compared to
GPT4’s performance on the original dataset.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Language vs. Math Perturbations</title>
          <p>While curating the DUPEd-StrategyQA dataset, we divided the perturbations introduced into
two distinct categories - one that involved language perturbations, while the other manipulated
mathematical information (see figure 2). Our finding suggest that both GPT models are more
resilient to math perturbations compared to language perturbations. E.g. for GPT3.5 there was
accuracy drop of 58.7% and 32.4 for language and math Perturbations respectively, while for
GPT4 the accuracy drops were 50.3% and 39.4.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Root Cause of Poor DUPE Skills</title>
        <p>To explain the GPT models’ poor performance on the DUPEd dataset, we need to identify the
main factor influencing their reasoning process, i.e., whether it is the encoded information in
parameters or the manipulated evidence in prompts. Recent studies have shed light on this
issue, suggesting that factual information encoded in the parameters of LLMs plays a dominant
role in governing the generated output. For instance, the feed-forward layers in transformer
models function as key-value memories, which implies that they encode factual information, as
noted by Geva et al. [19]. Moreover, Meng et al. [20] demonstrated that localized computations,
such as Rank-One Model Editing (ROME), can modify these factual associations, leading to
alternative conclusions. These findings suggest that the encoded information in parameters has
a significant impact on LLMs’ reasoning process; further investigation is left for future work.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper, we have introduced a new reasoning paradigm we call Deduction under Perturbed
Evidence (DUPE for short). Through DUPE , we have assessed the ability of LLMs models to
arrive at logically sound yet erroneous conclusions when faced with distorted initial facts. Our
study, which used a carefully curated dataset to evaluate DUPE abilities, has revealed that even
the most advanced GPT models struggle with logical reasoning in the presence of falsified
information. Moving forward, we plan to investigate into the performance of diferent LLMs
with our dataset in varied prompt settings.</p>
    </sec>
    <sec id="sec-5">
      <title>Limitations</title>
      <p>Due to limitations in both financial and computational resources, we had to limit our testing
to only the most advanced LLMs – the GPT models. Consequently, we directed our attention
towards developing a dataset for evaluating proposed reasoning scenarios. As a result of these
limitations, we chose to focus specifically on the evaluation of the two largest models ofered
by OpenAI. While we recognize that other LLMs may produce diferent outcomes, we believe
that our dataset could serve as a valuable resource for further research into the capabilities and
limitations of LLMs .</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>This work was supported by NSF grants 1842378, ONR grant N0014-20-1-2534, AFOSR grant
FA9550-22-1-0060, and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047.
[15] N. Liu, Z. Wang, R. Baraniuk, A. Lan, Open-ended knowledge tracing for computer science
education, in: Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing, 2022, pp. 3849–3862.
[16] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, J. Berant, Did Aristotle use a laptop? A
Question Answering benchmark with implicit reasoning strategies, Transactions of the
Association for Computational Linguistics 9 (2021) 346–361.
[17] S. Sonkar, A. E. Waters, A. S. Lan, P. J. Grimaldi, R. G. Baraniuk, qdkt: Question-centric</p>
      <p>Deep Knowledge Tracing, arXiv preprint arXiv:2005.12442 (2020).
[18] M. Bommarito II, D. M. Katz, GPT takes the Bar Exam, arXiv preprint arXiv:2212.14402
(2022).
[19] M. Geva, R. Schuster, J. Berant, O. Levy, Transformer feed-forward layers are key-value
memories, arXiv preprint arXiv:2012.14913 (2020).
[20] K. Meng, D. Bau, A. Andonian, Y. Belinkov, Locating and editing factual associations in
gpt, Advances in Neural Information Processing Systems 35 (2022) 17359–17372.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of Deep Bidirectional Transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized BERT pretraining approach</article-title>
          , CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1907</year>
          .11692.
          <article-title>a r X i v : 1 9 0 7 . 1 1 6 9 2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances In Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] OpenAI, GPT-4
          <source>technical report</source>
          ,
          <year>2023</year>
          .
          <article-title>a r X i v : 2 3 0 3 . 0 8 7 7 4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suzgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Srivats</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Language models are multilingual chain-of-thought reasoners</article-title>
          ,
          <source>arXiv preprint arXiv:2210.03057</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schärli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Scales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bousquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , E. Chi,
          <article-title>Least-to-most prompting enables complex reasoning in Large Language Models</article-title>
          ,
          <source>arXiv preprint arXiv:2205.10625</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Horvitz</surname>
          </string-name>
          , E. Kamar,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          , et al.,
          <source>Sparks of Artificial General Intelligence: Early experiments with GPT-4</source>
          , arXiv preprint arXiv:
          <volume>2303</volume>
          .12712 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          , Hellaswag:
          <article-title>Can a machine really ifnish your sentence?</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>07830</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Cowhey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Khot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sabharwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schoenick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tafjord</surname>
          </string-name>
          ,
          <article-title>Think you have solved Question Answering? Try ARC, the AI2 reasoning challenge</article-title>
          , arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>05457</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Winogrande:</surname>
          </string-name>
          <article-title>An adversarial winograd schema challenge at scale</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>64</volume>
          (
          <year>2021</year>
          )
          <fpage>99</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bavarian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plappert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tworek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nakano</surname>
          </string-name>
          , et al.,
          <article-title>Training verifiers to solve math word problems</article-title>
          , arXiv preprint arXiv:
          <volume>2110</volume>
          .14168 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Piech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ganguli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Guibas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sohl-Dickstein</surname>
          </string-name>
          ,
          <source>Deep Knowledge Tracing, Advances in Neural Information Processing Systems</source>
          <volume>28</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>