1. Introduction

Deduction under Perturbed Evidence: Probing Student Simulation (Knowledge Tracing) Capabilities of Large Language Models

Shashank Sonkar

Richard G. Baraniuk

Large Language Models, Reasoning, GPT, Student Simulation Models, Knowledge Tracing

0 Rice University , Houston, Texas , USA

We explore whether Large Language Models (LLMs ) are capable of logical reasoning with distorted facts, which we call Deduction under Perturbed Evidence (DUPE). DUPE presents a unique challenge to LLMs since they typically rely on their parameters, which encode mostly accurate information, to reason and make inferences. However, in DUPE, LLMs must reason over manipulated or falsified evidence present in their prompts, which can result in false conclusions that are valid only under the manipulated evidence. Our goal with DUPE is to determine whether LLMs can arrive at these false conclusions and identify whether the dominant factor influencing the deduction process is the encoded data in the parameters or the manipulated evidence in the prompts. To evaluate the DUPE capabilities of LLMs, we create a DUPEd version of the StrategyQA dataset, where facts are manipulated to reverse the answer to the question. Our findings show that even the most advanced GPT models struggle to reason on manipulated facts - showcasing poor DUPE skills - with accuracy dropping by 45% compared to the original dataset. We also investigate prompt settings inspired from student simulation models a.k.a. knowledge tracing models, which mitigate the accuracy drop to some extent. Our findings have practical implications for understanding the performance of LLMs in real-world applications such as student simulation models that involve reasoning over inaccurate information. The prompts and dataset are available at https://github.com/lufycodes/gpt-knowledge-tracing .

1. Introduction

Over the last several years, Transformer models have played a significant role in shaping the ifeld of Natural Language Processing (NLP) [ 1, 2, 3, 4, 5, 6 ]. Their exceptional ability to reason across a broad range of NLP tasks [ 7, 8, 9 ] has been a key factor contributing to their success. The success of LLMs on challenging datasets like HellaSwag [ 10 ], AI2 Reasoning Challenge (ARC) [ 11 ], WinoGrande [ 12 ], and GSM-8K [ 13 ] is a testament to their advanced reasoning skills and their potential to address challenging NLP tasks.

In this paper, we investigate the reasoning abilities of LLMs models under a novel paradigm we dub Deduction under Perturbed Evidence (DUPE for short). By testing LLMs’ capacity to https://sites.google.com/view/shashanksonkar/ (S. Sonkar); https://richb.rice.edu/ (R. G. Baraniuk) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). reason with flawed or perturbed evidence, we aim to determine whether LLMs can generate logically sound yet erroneous conclusions when presented with misleading information. Strong DUPE skills are critical in NLP applications like student simulations [ 14, 15 ], where models simulate student responses to understand how they may respond in certain scenarios. As student responses often contain inaccuracies and misconceptions, it is important for a model to analyze and utilize these inaccuracies and misconceptions as evidence to arrive at the same conclusion as the student. For instance, a student may have the misconception that the heavier an object is, the faster it falls, leading them to conclude that a bowling ball will fall faster than a ball bearing. If we provide LLMs with evidence that a heavier object falls faster, would LLMs also arrive at the conclusion that a bowling ball will fall faster than a ball bearing? We introduce DUPE as our approach to investigate this question.

Contributions: This paper develops a novel reasoning paradigm – Deduction under Perturbed Evidence (DUPE) – to examine whether LLMs arrive at diferent conclusions when presented with distorted initial facts. To test the DUPE capabilities of LLMs, we create a DUPEd version of StrategyQA dataset (Figures 1, 2). StrategyQA [16] is an open-domain QA dataset that is characterized by its explicit provision of the necessary facts required to answer each yes-no question. In the DUPEd version of the dataset, we manipulate the facts provided in a way that results in a diferent answer to the original question.

Our findings reveal that state-of-the-art LLMs, , including GPT3.5 and GPT4, struggle significantly on the newly introduced DUPEd-StrategyQA dataset. The accuracy of these models dropped drastically by approximately 45%, falling from an impressive 91.9% on the original dataset to only 46.7% on the DUPEd-StrategyQA dataset. In addition, we conduct an ablation study on the DUPEd-StrategyQA dataset by categorizing it into two distinct parts based on the type of manipulation used – one involving language perturbations and the other involving mathematical manipulations. Furthermore, our results demonstrate that the accuracy drop can be mitigated by using prompt settings inspired by student simulation models. This approach reduced the accuracy drop to 29%, with the models achieving an accuracy of 62.7% on the DUPEd-StrategyQA dataset. Our findings carry crucial implications for practical LLMs applications, particularly in the realm of student simulation models that demand reasoning over erroneous information.

2. Methodology, Dataset, and Prompting

In this section, we overview the DUPE reasoning framework, provide details on the DUPEd version of AllenAI’s StrategyQA dataset, and then explore customized prompt settings designed to assess the DUPE skills of LLMs. 2.1. DUPE Given a true-false question , the correct response ∈ { , } and facts that determine the truth or falsehood of ( ), we change to ′ s.t. the correct response to flips to ¬ under altered facts ′,

DUPE((, , ) ) = (, ′, ′) s.t. ′ = ¬ , editdist( , ′) < , (1) where editdist ensures that the edit distance between the fact strings and ′ is less than a threshold . The threshold is generally set to two to three words to ensure minimal changes to underlying facts (examples in figure 2). The new DUPEd-tuple (, ′, ′) can be used to probe the DUPE capabilities of LLMs as shown in Figure 1.

2.2. DUPEd-StrategyQA

We use AllenAI’s StrategyQA dataset [16] to assess the DUPE skills of LLMs. StrategyQA dataset provides explicit facts for answering open-domain questions. We create a DUPEd version of StrategyQA dataset composed of a total of 325 examples, of which 173 introduce natural language perturbations, while the remainder introduce mathematical errors (refer to examples in figure 2).

While designing the DUPEd version, we were careful to modify the facts in the most minimal way possible As a result, we made a conscious efort to only alter one or two words in the original facts whenever possible, in order to preserve the overall meaning and context of the original question. Additionally, we refrained from using explicit negation, such as the word not, to modify the facts, since our intent is not to evaluate the reasoning proficiency of LLMs in handling negation.

2.3. Student Simulation and Prompt Design

DUPE is highly relevant to student simulation models [ 14, 17, 15 ], which are widely used in education and cognitive psychology research. These models help in predicting and understanding student responses to various tasks, and thus their ability to reason over false information is critical to their success. Given this strong connection between simulation models and DUPE, these models can inspire innovative approaches to prompt design, which can be used to probe DUPE skills of LLMs [ 8, 18 ]. An example of such a prompt is illustrated in figure 1 and section 3.

DUPE and Counterfactual Reasoning: Counterfactual reasoning and student simulation models require diferent types of reasoning. In counterfactual reasoning, the focus is on exploring hypothetical scenarios that may or may not correspond to actual reality. The fact that the information being considered is hypothetical or counterfactual is usually known beforehand.

In contrast, a student simulation model needs to reason about both true and false information, and may not know beforehand whether the information being considered is true or false. For example, in figure 2, the model lacks prior knowledge about which facts are true and which ones are perturbed. The model must identify incorrect answers from the student to make inferences about future questions, which requires robust and nuanced reasoning capabilities beyond those needed for counterfactual reasoning.

3. Experiments

We evaluate the DUPE capabilities of the two largest GPT models – GPT3.5 (version gpt-3.5turbo-0301) and the latest GPT4 model (version gpt-4-0314) – via experiments under two diferent prompt settings, P1) “You are a question answering model. Your task is reason on provided evidence to answer a YES or NO question”, and P2) “You are a student simulation model. Your task is reason on student’s responses to accurately measure the student’s current knowledge state and predict the student’s response to a YES or NO question based on the student’s current knowledge state” from section 2.3. An example is illustrated in Figure 1.

3.1. Main Results

In the prompt setting P1, both GPT3.5 and GPT4 performed poorly on the DUPEd version of the dataset, with a decrease in accuracy by 46.0%. and 45.2% respectively. As expected, the latest GPT4 model demonstrates superior performance to GPT3.5 on both the original and the DUPEd StrategyQA dataset.

3.1.1. Student Simulation Prompt

Prompt P2 inspired by student simulation setting informs/ primes the models that the provided evidence may be incorrect since the evidence reflects the erroneous nature of students’ responses. We found that prompt setting P2 performs significantly better than P1 by a margin of 16.0% for the GPT4 model. However, there was still a significant 29.2% drop in accuracy compared to GPT4’s performance on the original dataset.

3.1.2. Language vs. Math Perturbations

While curating the DUPEd-StrategyQA dataset, we divided the perturbations introduced into two distinct categories - one that involved language perturbations, while the other manipulated mathematical information (see figure 2). Our finding suggest that both GPT models are more resilient to math perturbations compared to language perturbations. E.g. for GPT3.5 there was accuracy drop of 58.7% and 32.4 for language and math Perturbations respectively, while for GPT4 the accuracy drops were 50.3% and 39.4.

3.2. Root Cause of Poor DUPE Skills

To explain the GPT models’ poor performance on the DUPEd dataset, we need to identify the main factor influencing their reasoning process, i.e., whether it is the encoded information in parameters or the manipulated evidence in prompts. Recent studies have shed light on this issue, suggesting that factual information encoded in the parameters of LLMs plays a dominant role in governing the generated output. For instance, the feed-forward layers in transformer models function as key-value memories, which implies that they encode factual information, as noted by Geva et al. [19]. Moreover, Meng et al. [20] demonstrated that localized computations, such as Rank-One Model Editing (ROME), can modify these factual associations, leading to alternative conclusions. These findings suggest that the encoded information in parameters has a significant impact on LLMs’ reasoning process; further investigation is left for future work.

4. Conclusions

In this paper, we have introduced a new reasoning paradigm we call Deduction under Perturbed Evidence (DUPE for short). Through DUPE , we have assessed the ability of LLMs models to arrive at logically sound yet erroneous conclusions when faced with distorted initial facts. Our study, which used a carefully curated dataset to evaluate DUPE abilities, has revealed that even the most advanced GPT models struggle with logical reasoning in the presence of falsified information. Moving forward, we plan to investigate into the performance of diferent LLMs with our dataset in varied prompt settings.

Limitations

Due to limitations in both financial and computational resources, we had to limit our testing to only the most advanced LLMs – the GPT models. Consequently, we directed our attention towards developing a dataset for evaluating proposed reasoning scenarios. As a result of these limitations, we chose to focus specifically on the evaluation of the two largest models ofered by OpenAI. While we recognize that other LLMs may produce diferent outcomes, we believe that our dataset could serve as a valuable resource for further research into the capabilities and limitations of LLMs .

Acknowledgement

This work was supported by NSF grants 1842378, ONR grant N0014-20-1-2534, AFOSR grant FA9550-22-1-0060, and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047. [15] N. Liu, Z. Wang, R. Baraniuk, A. Lan, Open-ended knowledge tracing for computer science education, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3849–3862. [16] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, J. Berant, Did Aristotle use a laptop? A Question Answering benchmark with implicit reasoning strategies, Transactions of the Association for Computational Linguistics 9 (2021) 346–361. [17] S. Sonkar, A. E. Waters, A. S. Lan, P. J. Grimaldi, R. G. Baraniuk, qdkt: Question-centric

Deep Knowledge Tracing, arXiv preprint arXiv:2005.12442 (2020). [18] M. Bommarito II, D. M. Katz, GPT takes the Bar Exam, arXiv preprint arXiv:2212.14402 (2022). [19] M. Geva, R. Schuster, J. Berant, O. Levy, Transformer feed-forward layers are key-value memories, arXiv preprint arXiv:2012.14913 (2020). [20] K. Meng, D. Bau, A. Andonian, Y. Belinkov, Locating and editing factual associations in gpt, Advances in Neural Information Processing Systems 35 (2022) 17359–17372.

[1]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in Neural Information Processing Systems 30 ( 2017 ).

[2]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of Deep Bidirectional Transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[3]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized BERT pretraining approach , CoRR abs/ 1907 .11692 ( 2019 ). URL: http://arxiv.org/abs/ 1907 .11692. a r X i v : 1 9 0 7 . 1 1 6 9 2 .

[4]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances In Neural Information Processing Systems 33 ( 2020 ) 1877 - 1901 .

[5]

Ouyang ,

Wu ,

Jiang ,

Almeida ,

Wainwright ,

Mishkin ,

Zhang , S. Agarwal,

Slama ,

Ray , et al., Training language models to follow instructions with human feedback , Advances in Neural Information Processing Systems 35 ( 2022 ) 27730 - 27744 .

[6] OpenAI, GPT-4 technical report , 2023 . a r X i v : 2 3 0 3 . 0 8 7 7 4 .

[7]

Shi ,

Suzgun ,

Freitag ,

Wang ,

Srivats ,

Vosoughi ,

H. W.

Chung ,

Tay ,

Ruder ,

Zhou , et al., Language models are multilingual chain-of-thought reasoners , arXiv preprint arXiv:2210.03057 ( 2022 ).

[8]

Zhou ,

Schärli ,

Hou ,

Wei ,

Scales ,

Wang ,

Schuurmans ,

Bousquet ,

Le , E. Chi, Least-to-most prompting enables complex reasoning in Large Language Models , arXiv preprint arXiv:2205.10625 ( 2022 ).

[9]

Bubeck ,

Chandrasekaran ,

Eldan ,

Gehrke ,

Horvitz , E. Kamar,

Lee ,

Y. T.

Lee ,

Li ,

Lundberg , et al., Sparks of Artificial General Intelligence: Early experiments with GPT-4 , arXiv preprint arXiv: 2303 .12712 ( 2023 ).

[10]

Zellers ,

Holtzman ,

Bisk ,

Farhadi ,

Choi , Hellaswag: Can a machine really ifnish your sentence? , arXiv preprint arXiv: 1905 . 07830 ( 2019 ).

[11]

Clark ,

Cowhey ,

Etzioni ,

Khot ,

Sabharwal ,

Schoenick ,

Tafjord , Think you have solved Question Answering? Try ARC, the AI2 reasoning challenge , arXiv preprint arXiv: 1803 . 05457 ( 2018 ).

[12]

Sakaguchi ,

R. L.

Bras ,

Bhagavatula ,

Choi , Winogrande: An adversarial winograd schema challenge at scale , Communications of the ACM 64 ( 2021 ) 99 - 106 .

[13]

Cobbe ,

Kosaraju ,

Bavarian ,

Chen ,

Jun ,

Kaiser ,

Plappert ,

Tworek ,

Hilton ,

Nakano , et al., Training verifiers to solve math word problems , arXiv preprint arXiv: 2110 .14168 ( 2021 ).

[14]

Piech ,

Bassen ,

Huang ,

Ganguli ,

Sahami ,

L. J.

Guibas ,

Sohl-Dickstein , Deep Knowledge Tracing, Advances in Neural Information Processing Systems 28 ( 2015 ).