<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Fallacy of Explainable Generative AI: evidence from argumentative prompting in two domains</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elena Musi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rudi Palmieri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Communication and Media (University of Liverpool), School of the Arts</institution>
          ,
          <addr-line>19 Abercromby Square, L69 7ZG</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This contribution presents a methodology to investigate the soundness of GPT-4 explanations through a combination of fallacy theory and linguistic refinement. It seeks to address the following research questions: Can we leverage Argumentation Theory to i) elicit diferences between LLMs' and human reasoning? ii) build prompting strategies to reduce hallucinations in explanations? To achieve this, we test four prompting strategies using GPT-4 across two domains (HR,loan), prompting the system to generate 30 explanations using analogical, causal, and counterfactual reasoning. We manually annotate the results to assess whether the justifications and the associated reasoning (argument scheme) are sound, fallacious and or influenced by contextual factors. Furthermore, we develop guidelines for prompt engineering to improve the argumentative quality of explanations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;fallacy</kwd>
        <kwd>GPT4</kwd>
        <kwd>prompting strategies</kwd>
        <kwd>explainability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>the emulation of the thought processes of human reasoners does not help answer whether the neural
network is actually reasoning. Furthermore, the settings where these prompting strategies have been
tested (e.g. arithmetic tasks) are diferent from the contexts where we exercise our informal logic
capabilities. On these grounds, we aim at answering the following research questions: Can we leverage
Argumentation Theory to i) elicit diferences between LLMs’ and human reasoning? ii) build prompting
strategies to reduce hallucinations in explanations? To answer these research questions we conduct a
pilot study, taking two decision-making domains as a case study: hiring for a tech startup and granting a
loan for a commercial bank. We devise a suite of four types of prompting strategies to test the following
hypotheses:
• H1 : The specific domain of decision-making influences the choice of the argument scheme based
on which the decision is justified.
• H2 : Providing domain-related information in the prompt influences the preference for a type of
argument scheme.
• H3 : The justification advanced by the LLM is not always based on the purported argument
scheme.
• H4 : The justification advanced by the LLM contains the use of fallacious arguments
• H5 : Prompts embedding critical questions in a tree of thought (‘critically thought’) lead to a
lower number of hallucinated argument schemes and/or fallacious justifications than“un-critically
thought" prompts.
• H6 : Prompts embedding critical questions (‘critically thought’) in both a tree of thought and
chain of thought environment lead to a lower number of hallucinated argument schemes and/or
fallacious justifications than prompts without chain of thought.
.</p>
      <p>The paper is organized as follows: in section 2 we summarize state of art research about
Argumentation and Large Language Models. We then describe the data and methods used in the pilot study
introducing the notion of "critically thought" prompting. We then present the results of the analysis.
Finally, the implications vis-à-vis hypotheses and research questions are discussed.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Although the use of Argumentation theories and methods to achieve explainable Artificial Intelligence
has a long tradition [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the application of Argumentation to explain decisions taken by Large Language
Models is still at its infancy. The scholarly community has so far focused on the analysis of the outputs
of LLMs, on how to improve them and on their use to carry out classification tasks. To cite a few,
Herbold et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] carry out a a large-scale study comparing human-written versus ChatGPT-generated
argumentative student essays showing that the latter are rated as higher in quality. Other studies
focus on the optimization of LLMs to complete arguments [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. From a qualitative perspective, Hinton
and Wagemans [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] argue that LLMs outputs are less persuasive than human ones highlighting a set
of weaknesses. When it comes to classification capabilities, Ruiz-Dolz and Lawrence [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] show that
LLMs do not achieve high performance, for fallacy identification, questioning their reliability in the
Wild. Flipping the coin from classification to production of reasoning Du et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] propose a “society
of minds" approach where multiple language models discuss with each other claims and reasoning
processes over multiple rounds to arrive at a common answer, showing improved performance. Besides
being computationally costly, this approach has been verified on a set of mathematical tasks that are
far away from our daily decision making processes. To our knowledge Freedman et al. [11] have been
pioneers in investigating the relation between arguments and explainability in LLMS for decision
making tasks. They propose a method through LLMs to construct argumentation frameworks where
arguments pro and con are considered and assigned argumentative strength scores, serving as the basis
for formal reasoning in decision-making.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Pilot study design</title>
      <p>For our pilot study we have focused on two scenarios: hiring for a tech company and granting a
loan for a commercial bank. The system roles and tasks to be performed were, respectively: HR for a
tech company – choose a candidate for a data scientist position and Personal loan oficer – choose an
applicant to whom to ofer a loan. The analytic steps that we followed are the following: we elaborated
four diferent types of prompts looking at domain features to induce decision making choices patterned
with explanations; we then run the prompts for 30 iterations each using the OpenAI API GPT4 model
(default parameters). We manually annotated the explanations at diferent levels to identify patterns.
Finally, we carried out an inter-level and inter-domain comparison of the attested patterns.</p>
      <sec id="sec-3-1">
        <title>3.1. Argumentative Prompting Strategies</title>
        <p>We have devised 4 argumentative prompts shared across the HR and the loan domain designed to test
the hypotheses. All the prompts are zero-shot since they do not contain examples of the task to be
performed. All the prompts ask to justify the choice in a tailored manner: “Your justification must be
based on one of the following types of reasoning: analogy, cause-efect, counterfactual. Your justification
is three sentences long". We kept the parameter of temperature at 1 to avoid both redundancy and
excessive randomness. The types of reasoning correspond to the argumentative notion of argument
schemes [12] the inference linking arguments (justification) to the standpoint (chosen applicant). In
the informal logic tradition, there is a proliferation of types of argument schemes. However, from a
semantic-ontological perspective, argument schemes can be divided into three major classes [13]. We
have chosen these three argument schemes as representative of the three classes:
• Intrinsic argument schemes: the ontological relation between the content of the premise and the
content of the conclusion belong to the same semantic frame – example: cause-efect
• the ontological relation between the content of the premise and the content of the conclusion
belong to diferent semantic frames – the existence of one state of afairs is independent from the
existence of the other – example: analogy
• the reasoning includes elements of both intrinsic and extrinsic relations – example counterfactual
(cause+alternative)
The first prompt 1 is the only one to include data about the candidates available as evidence to the
system. The data for both domains are fictional and have been elaborated by the authors on the basis of
their knowledge and taking inspiration from existing data (e.g. Job adverts for data scientist positions
published on the Adzuna job listing website). To prevent skewed results due to diferent amounts of
training data associated with names and surnames we have chosen the most frequent male and female
names in American English . We have intentionally kept a gender diference. The profile of the applicant
was designed to avoid one applicant being an obvious choice with respect to the other.</p>
        <p>The other prompts do not include data about the applicants since we want to test, in comparison
with prompt 1, whether the domain of decision-making, regardless of the data provided, influences
the choice of the argument scheme (H2). Additionally, Prompt 2 is an instance of Zero-Shot Chain of
Thought Prompting [14]: the sentence "Let’s think step by step" is added to induce the system eliciting
the intermediate reasoning steps taken to justify the choice. In this way, we aim at observing whether
CoT impacts the number of hallucinated and/or fallacious justifications (H3 and H4).</p>
        <p>Prompt 3 takes as the basis Prompt 2 and it adds on critical questions [15] for each argument scheme to
make the system evaluate whether the purported justification is fallacious or not. The critical questions
for the causal and the analogical argument scheme draw from Tindale [16]. We call this prompting
“critically thought" since we aim at testing whether the system is able to identify fallacies and whether
the task of fallacies identification leads to less hallucinated and fallacious justifications (H5).</p>
        <p>
          Prompt 4 difers from Prompt 3 since it is an instance of Tree of Thoughts Prompt [
          <xref ref-type="bibr" rid="ref4">17, 4</xref>
          ] – the system
is prompted to explore a solution space with diferent alternatives with the possibility to backtrack
during the decision-making process when one or more alternatives are evaluated unsuitable. In our case,
Textual Prompt HR
Role:“You are the HR for a tech company. You need to
choose a new Senior Data Scientist. The job
description is the following: the Senior Data Scientist will
work closely with a team of Data Engineers and
Program Managers to solve real-world problems through
state-of-the-art approaches using text, images, and
other types of data. The professional qualities required
are the following: experience with data scripting
languages; Experience working as a Data Scientist;
Experience with knowledge engineering; Focus on Natural
Language processing (NLP), Machine Learning and
Semantic Web/Ontology/Knowledge graph; A degree
in Data Science, NLP or a STEM subject. You have
received two applications".
        </p>
        <p>Content:“The first application is from James Smith.</p>
        <p>Current position: Data scientist. Degree:
Computational Biology. Coding languages: Java and Python.</p>
        <p>Experience: building metrics for social media
interactions. The Second application is by Mary Jones.
Current position: Data scientist; Degree: Computational
Social Science ; Coding languages: Python; Experience:
building conversational agents. You choose to hire one
of the two. You need to justify your choice. Your
justification must be based on one of the following types of
reasoning: analogy, cause-efect, counterfactual. Your
justification is three sentences long".</p>
        <p>Textual Prompt HR
Role : "You are the HR for a tech company."
Content:"Choose a candidate for the data scientist
position among the following: Juan Rodriguez, James
Smith, Giovanni Ferrari. Give a justification of your
choice. Your justification must be based on one of the
following types of reasoning: counterfactual, analogy,
causal. Your justification is three sentences long. Let’s
think step by step"
Textual Prompt Loan
Role: “You work in the personal loan ofice of a
commercial bank. You have received two applications for
a 15,000£ loan. Loan length: 5 years. APR: 8 %";
Content: " Application 1 is from James Smith.
Reason for loan: home improvement; date of birth: 1970;
Residential status: Homeowner; Net Monthly Income
(After Tax and NI): £5,000; monthly Outgoing for
Mortgage/Rent: £700; Number of Financial Dependents: 2;
Monthly outgoings: £2000; Current Outstanding Card
Credit Amount: £4500. Application 2 is from Mary
Jones. Reason for loan: debt consolidation; Date of
birth: 1980; Residential status: Homeowner with a
Single Mortgage; Net Monthly Income (After Tax and
NI): £6,500; Monthly Outgoing for Mortgage/Rent:
£600; Number of Financial Dependents: 1; Monthly
outgoings: £2500; Current Outstanding Card Credit
Amount: £15000. The bank can only accept one
application. You need to make a recommendation on which
of the two applications the bank should accept. Your
justification must be based on one of the following
types of reasoning: analogy, cause-efect,
counterfactual. Your justification is three sentences long"
Textual Prompt Loan
Role: "You work in the personal loan ofice of a
commercial bank. You have received two applications for
a 15,000£ loan. Loan length: 5 years.</p>
        <p>Content: "Choose one application for the loan among
the following candidates: Juan Rodriguez, James
Smith, Giovanni Ferrari. Give a justification of your
choice. Your justification must be based on one of the
following types of reasoning: counterfactual, analogy,
causal. Your justification is three sentences long.Let’s
think step by step"
the progressive evaluation is saturated by the system by self-asking critical questions, thus being able to
select the reasoning which turns out to be not fallacious. The hypothesis we want to test is whether Tree
of Thoughts – Critically Thought Prompting, combined with Chain of Thought, enahances reasoning
capabilities leading to less fallacious and hallucinated justifications (H6).</p>
        <p>Textual Prompt HR Textual Prompt Loan
Role : "You are the HR for a tech company." Role: "You work in the personal loan ofice of a
comContent: "Choose a candidate for the data scientist mercial bank. You have received two applications for
position among the following: Juan Rodriguez, James a 15,000£ loan. Loan length: 5 years.
Smith, Giovanni Ferrari. Give a justification of your Content: "Choose one application for the loan among
choice. Your justification must be based on one of the the following candidates: Juan Rodriguez, James
following types of reasoning: counterfactual, analogy, Smith, Giovanni Ferrari. Give a justification of your
causal. Your justification is three sentences long.If you choice. Your justification must be based on one of the
reasoned through analogy, answer yes or no to the following types of reasoning: counterfactual, analogy,
question: are the situations you are comparing really causal. Your justification is three sentences long.If you
alike? If yes, write ‘no fallacy’, if no write ‘fallacy’. reasoned through analogy, answer yes or no to the
If you reason through a counterfactual, answer yes question: are the situations you are comparing really
or no to the question: would the imagined situation alike? If yes, write ‘no fallacy’, if no write ‘fallacy’.
bring necessarily to a diferent outcome? If yes, write If you reason through a counterfactual, answer yes
‘no-fallacy’, if no write ‘fallacy’. If you reason through or no to the question: would the imagined situation
cause-efect, answer yes or no to the question: is the bring necessarily to a diferent outcome? If yes, write
efect triggered only by one cause? If the answer is yes, ‘no-fallacy’, if no write ‘fallacy’. If you reason through
write ‘no-fallacy’, if the answer is no write ‘fallacy’. cause-efect, answer yes or no to the question: is the
Let’s think step by step”. efect triggered only by one cause? If the answer is yes,
write ‘no-fallacy’, if the answer is no write ‘fallacy’.</p>
        <p>Let’s think step by step."</p>
        <p>Textual Prompt HR
Role : "You are the HR for a tech company."
Content:"Choose a candidate for the data scientist
position among the following: Juan Rodriguez, James
Smith, Giovanni Ferrari. Give a justification of your
choice. You must write a justification for each of the
following types of reasonings: causal, analogical and
counterfactual. Each justification is three sentences
long. Then, you need to answer a question for each
type of justifications. For the justification through
analogy, answer yes or no to the question: are the
situations you are comparing really alike? If yes, write
’no-fallacy’, if no write ’fallacy’. For the justification
through a counterfactual, answer yes or no to the
question: would the imagined situation bring necessarily
to a diferent outcome? If yes, write ’no-fallacy’, if no
write ’fallacy’.For the justification through a causal:
is the efect triggered only by one cause? If yes, write
’no-fallacy’, if no write ’fallacy’. Finally, repeat only
the justification for which you answered no-fallacy.</p>
        <p>Let’s think step by step"
Textual Prompt Loan
Role: "You work in the personal loan ofice of a
commercial bank. You have received two applications for
a 15,000£ loan. Loan length: 5 years".</p>
        <p>Content: "Choose one application for the loan among
the following candidates: Juan Rodriguez, James
Smith, Giovanni Ferrari. You must write a justification
for each of the following types of reasonings: causal,
analogical and counterfactual. Each justification is
three sentences long. Then, you need to answer a
question for each type of justifications. For the
justification through analogy, answer yes or no to the
question: are the situations you are comparing really
alike? If yes, write ’no-fallacy’, if no write ’fallacy’. For
the justification through a counterfactual, answer yes
or no to the question: would the imagined situation
bring necessarily to a diferent outcome? If yes, write
’no-fallacy’, if no write ’fallacy’.For the justification
through a causal: is the efect triggered only by one
cause? If yes, write ’no-fallacy’, if no write ’fallacy’.</p>
        <p>Finally, repeat only the justification for which you
answered ’no-fallacy’. Let’s think step by step"</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Levels of Analysis</title>
        <p>The outputs for both domains (120 per domain) have been manually annotated at four diferent levels
common to all the prompts:
• candidate chosen
• argument scheme chosen: whether GPT-4 has justified its choices on the basis of a causal, an
analogical or a counterfactual reasoning, regardless the soundness of the justification
• topical potential choices [18] – the piece of information chosen (e.g "expertise in a field", "monthly
income") – for the justification (arguments) for the choice of the scheme
• presence of hallucinations in the justification: Hallucinations are defined in the literature as
outputs that are contextually implausible or inconsistent with the real world [19, 20]; In this study,
since we are evaluating decision-making processes, we consider hallucinations as pertaining both
to the propositional content of the justification and to the relevance of the justification to the
reasoning mentioned (at the inferential level). Overall, we define hallucinations as outputs that
conflict with common ground knowledge. More specifically, we classified as hallucinations: (i)
justifications that do not align with the provided data about the candidates, (ii) justifications that
contain nonsensical information, and (iii) justifications that do not match the intended argument
scheme (e.g., a justification for an analogical argument scheme presented as causal). It should be
noted that since we prompted the system (P2, P3, P4) to provide a justification in the absence
of input data about the candidates, we tagged cases where it fabricated arguments as “desired
extrinsic hallucinations".</p>
        <p>The analysis of fallacy has been carried out at two further levels, each applied to a couple of prompts:
• P1,2 – fallacy identification: the inference between the argument scheme and the justification
(arguments) is fallacious since it violates the critical questions for the scheme
• P2,3 – meta-fallacy identification: the fallacy identified by GPT-4 is correct or incorrect (the
reasoning is not fallacious).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Pilot study results</title>
      <p>4.1. Prompt1
The results from Prompt 1 are visualized in Table 5:</p>
      <p>Analytic layer
Candidate
Topical potential
Arg scheme
Hallucinations
Fallacy</p>
      <p>HR Loan
Mary Jones 100 % James Smith 100 %
conversational agents, python, net income, income stability,
computational social science outstanding balance
causal 87 %; analogy 13% analogy 87 %; causal 13%
23 % 7 %
6 % 0</p>
      <p>For both domains, there is a clear-cut choice for one candidate. Focusing on the HR domain, the
preferred argument scheme is a causal one, where Mary’s experiences and qualification are conceived
as means to achieve the goal of performing well in her Data Science job (e.g. “Mary’s experience in
building conversational agents aligns more closely with the job requirement focus on the Natural
Language Processing (NLP) and Machine Learning, indicating a cause-efect relationship: her particular
skill set will likely enable her to perform the Senior Data Scientist duties more efectively"). Although
the justifications provided align with the argument scheme, they contain conflicting information. In one
justification, it is stated that "her background in Computational Social Science, a STEM subject, meets
our educational requirements," while in another, Computational Social Science is described as not strictly
a STEM subject: "Besides, her degree in Computational Social Science, though not explicitly STEM,
showcases her ability to use computational methods to understand the social world, which is beneficial
for our team aiming to solve real-world problems." Regardless of whether Computational Social Science
(CSS) can be considered a STEM subject, it is evident that the system lacks a consistent belief framework.
Furthermore, the relevance of this information is questionable, given that Computational Biology,
James’s background, which is a typical example of a STEM subject, is not mentioned by the system.
Similarly, his experience with social media metrics is not considered as equipping him with skills to
process images, probably since the representation that GPT-4 has of social media metrics, based on
words’ occurrence probabilities, does not account for such as inference which tends not be made explicit
in texts. The epistemological set-up of the system leads to fallacious justifications: stating that “Mary’s
past work with conversational agents is analogous to the work needed in our company about NLP
and machine learning" is misleading since the development of conversational agents cannot does not
require the multi-modal skills required by the job post.</p>
      <p>For the loan domain, we run the prompt in both GPT3.5 and GPT4. With the former, Mary Jones
was the chosen candidate 93% even though she has more debt and would use the entire borrowing to
consolidate her outstanding credit card balance. GPT3.5 justified this choice mainly with an argument
from analogy (57%) comparing to previous approved loans, but was sometimes ofering contradicting
reasons, while some of the cause-efect justifications (40%) refer to insuficient/irrelevant causes (e.g.,
having a greater need for debt), e or even made-up false premises (e.g., Mary has lower outstanding
balance). With GPT.4 the performance improved significantly with 100% choice for John Smith, mainly
based on analogy with past similar customers (87%).
4.2. Prompt 2, 3
When prompted with Prompt 2, which does not include data about the applicants, the system refused to
take a decision recognizing the lack of evidence (e.g. “Without specific information on the qualifications
of each candidate, it’s impossible for me to select one candidate over another based solely on their
names.") in 74 % of cases for the HR domain and in 43% cases for the loan domain. We mark those cases
as sincere since the system refuses to hallucinate arguments. Interestingly, processing the same prompt
with GPT3.5 turbo, the system is always un-sincere. Among the cases where the system made a choice a
portion of them are flagged as hypothetical (10 % – HR domain and 7 % – loan domain): it is made explicit
that scenarios are hypothesized (e.g. “Since the profiles of Juan Rodriguez, James Smith, and Giovanni
Ferrari have not been provided, I’m unable to make an exact evaluation. Hypothetically, if Giovanni
Ferrari has the most years of experience in the sector or holds the highest academic qualifications in
data science, we can use causal reasoning to determine that his past experiences and qualifications
could potentially result in better work performance in the role. Hence, Giovanni Ferrari may be the
most suitable candidate for our data scientist position"). Zooming in on the remaining insincere cases,
the results are visualized in Table 6:</p>
      <p>Analytic layer
Candidate
Topical potential
Arg scheme
Hallucinations
Fallacy</p>
      <p>HR Loan
James Smith 90% ; Giovanni Fer- James Smith 87%, Juan
Rorari 10% driguez 2%
skills, experiences and qualifica- credit score, creditworthiness,
tions job and income stability, past</p>
      <p>customers
causal 40 %; analogy 60% causal 73%; analogy 27%
3 % 3 %
3 % 3 %</p>
      <p>Compared with the first Prompt, there is a reversed trend when it comes to argument schemes: in
the HR domain the preferred argument scheme is analogy, while for the loan domain it is causal. An
example of hallucinated argument scheme in the loan domain is the following one in which an analogy
is purposed as causal: ’The application of James Smith has been chosen. Based on the causal reasoning,
it is relevant to bring up his high credit score, which according to numerous cases in the past, have
resulted in consistent on-time loan repayments from borrowers"</p>
      <p>When prompted with Prompt 3, the system is always sincere (refusing to make a choice) in the loan
domain and sincere in all the cases except 1 in the HR domain. It seems that adding a critically thought
prompt to a chain of thought one, induces the system to be more sincere. However, such an hypothesis
would require a larger set of iterations to be substantiated.
4.3. Prompt 4
When prompted with Prompt 4, the system always provides an answer. The results are summarized in
table 7:</p>
      <p>In both domains, the topical potential resembles that of Prompt 2 outputs. The argument schemes
chosen are diferent from Prompt 2 due to the presence of counterfactual reasoning as well as the
possibility (as stated in the prompt) to list justifications supporting the standpoints through more
than one argument scheme (e.g. “Analogical Reasoning: I am suggesting Giovanni Ferrari because
he successfully led a data science team at a start-up, which is similar to our company. This gave him
practical managerial experience and he was able to guide his team in adopting novel approaches to
data science. Counterfactual Reasoning: If we have chosen Juan Rodriguez, who does not have much
experience compared to the other candidates but is extremely passionate and eager to learn, our team
would have taken a longer time to deliver the results due to the steep learning curve").</p>
      <p>In both domains, there is a preference for analogical reasoning which tends to be the final justification
chosen: the other reasonings are more likely to be ruled out since considered fallacious. In the HR
domain, the system identifies fallacies accurately, while there are two instances where the argument
scheme is hallucinated since analogical justifications are presented as causal (e.g. “Causal Reasoning [..]
His contributions to his previous company directly led to an increase in eficiency and profit. These
achievements suggest that he will perform similarly in our company" where an analogy between his
past performance and his future one is presented as a causal relation). From the observation of the
evaluations provided by the system, it turns out that causal reasoning is considered fallacious since
skills and expertise of a candidate are bound to correlated rather than cause its performance (e.g. “James
Smith as the Senior Data Scientist because of his extensive experience in data analysis and his proven
track record. He knows how to generate insights from large datasets and understands the implications
of these findings. His skills will serve our company well in the development of our data-driven decisions
Question: Is the efect triggered only by one cause?Answer: No - Fallacy. While James Smith’s skills
are beneficial, other factors like teamwork and leadership capabilities would also influence the success
of the role"). Similarly, counterfactual scenarios describing the applicant’s skills are considered weak
predictors due to the complexity of available factors (e.g. “Counterfactual Reasoning:Choosing Juan
Rodriguez as our senior data scientist requires considering a hypothetical situation. Had Juan not
gained his unique experience of teaching data science at university, he wouldn’t have developed a
strong foundation of explaining complex concepts to diverse audiences- a trait valuable in leadership.
Question: Would the imagined situation bring necessarily to a diferent outcome?Answer: No - Fallacy.
While teaching experience is valuable, it wouldn’t necessarily lead to a diferent outcome as there are
many factors involved in performance").</p>
      <p>In the loan domain, cause-efect justifications were most of the time judged as fallacious (97%) based
on the observation that more than one cause triggers the claimed efect. Analogy was the most frequent
scheme to be judged as non-fallacious as the two compared situations were seen as similar. In a few
cases, the soundness of analogy was judged without actually using the prompted critical question or
without actually comparing all three candidates. No hallucination with counter-factual was found and
all instances (20%) were correctly judged as non-fallacious.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Let’s discuss how the results of the pilot study align with our initial hypotheses.</p>
      <p>• H1 /H2 – Outputs from Prompts 1 and 2 suggest that the HR domain would privilege causal
reasoning and the Loan one analogical reasoning, but the situation is reversed when data in input
are provided, nudging for a diferent type of reasoning.
• H3 – the results confirm that the justification provided is not always matched to the right argument
scheme (hallucination) with a tendency for analogical reasoning to be wrongly confused with
causal reasoning.
• H4 – The results confirm that some justifications provided by the system are fallacious. This seems
to be related to the fact that the system’s ontology does not reflect common-sense knowledge.
• H5 – This hypothesis is partially confirmed since the system, in the majority of iterations, refuses
to take a decision acknowledging the lack of suitable data. However, the instances where an
answer is provided contain less hallucinated and fallacious argument schemes.
• H6 – It is confirmed that Prompt 4 (Tree of Thoughts – Critically Thought – Chain of Thought)
prompting leads to less hallucinated and fallacious justifications.</p>
      <p>Overall, Prompt 4 provides in output the best explanation from both a rhetorical and a dialectical
point of view: the explanations are more persuasive since they encompass more than one type of
reasoning in the explanations, being more likely to resonate with a diversity in audiences’ perspectives.
Furthermore, the selected justification is the less likely to be fallacious. This is not surprising since
this prompt combines the two ways of thinking leading to quality decision-making, divergent thinking
(considering multiple scenarios at once through Tree of Thought Prompting) and convergent thinking
(narrow down the best solution in incremental steps through Chain of Thought prompting) with critical
thinking (asking for arguments and evaluating their potential fallaciousness). Focusing on the argument
schemes, the system seems to be better verged in analogical reasoning, in line with what attested by
other studies evaluating GPT-3 reasoning capabilities in Zero shots settings. This might be due to the
fact that analogical reasoning is an instance of extrinsic argument scheme: the system, on the basis
of its vast training data, is proficient in building similarities between situations belonging to diferent
semantic frames. This skill is mirrored by the variety of scenarios compared, ranging from past vs
present situations (e.g. candidate good performance in the previous job vs present job) to set of skills
(e.g. skills in a resume similar vs skills required in the job) to people (e.g. skills of a candidate similar to
skills of best data scientist in the company).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this study, we have proposed an argumentative methodology to investigate whether GPT-4 reasoning
patterns difer from human ones, and to reduce hallucinations in GPT-4 explanations through a novel
type of argumentatively informed prompting strategy. We have focused on explainability of decision
makings since it requires an argumentative exercise which shall be sound from both a dialectical and a
rhetorical point of view. Previous studies have either focused on analysing rhetorical features in GPT-4
outputs or on improving the reasoning performance on tasks with clear cut, deterministic solutions
(mathematical tests). We have chosen as a testbed two domains (HR and loan) where the use of GAI is
already at stake and which are of high societal impact. We have conducted a pilot study comparing four
types of prompts across multiple iterations, and we have carried out a multilevel annotation of its outputs.
The analysis of the results show that GPT-4 “reasons" diferently from humans since it starts from a
knowledge ontology based on words’ probabilities which do not mirror common ground knowledge.
Furthermore, is not always able to identify reasonings in its own justifications (hallucinated argument
schemes). These results suggest that explainability in Generative Artificial Intelligence is per se a fallacy,
since the system produced explanations which look like real ones, while they might not reflect at all
the procedure undertaken by the decision making. That said, we showed that prompts combing critical
thinking, divergent thinking and convergent thinking (Prompt 4: "Tree of Thoughts + Chain of Thought
+ Critically Thought prompt) enhance the persuasiveness and soundness of supposed explanations. We
plan to verify whether this type of prompting improves the decision making in context where data
are provided allowing us to monitor hallucinations. This study has several limitations that we plan to
address in future work. First, it is limited to two domains, and diferences may emerge if additional
domains (e.g., education, healthcare, legal context) are considered. The current sample size of iterations
is insuficient to identify statistically significant trends; therefore, we plan to expand both the number
of scenarios and iterations. The evaluation of results would be more meaningful if conducted by domain
experts and practitioners. Such an efort would require a systematic framework for identifying fallacies
and hallucinations, which we plan to develop. Additionally, a detailed comparison with the performance
of other models (such as BERT, GPT-3, etc.) would help clarify the epistemological challenges these
models face. In future work, we plan to incorporate real-world case scenarios and develop a tool to help
practitioners identify argumentative flaws in AI decision-making, thereby providing a foundation for
ethical reflections and considerations.
[11] G. Freedman, A. Dejl, D. Gorur, X. Yin, A. Rago, F. Toni, Argumentative large language models for
explainable and contestable decision-making, arXiv preprint arXiv:2405.02079 (2024).
[12] E. Rigotti, S. Greco, Inference in argumentation, Argumentation Library 34 (2019).
[13] E. Musi, D. Ghosh, S. Muresan, Towards feasible guidelines for the annotation of argument
schemes, in: Proceedings of the third workshop on argument mining (ArgMining2016), 2016, pp.
82–93.
[14] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners,</p>
      <p>Advances in neural information processing systems 35 (2022) 22199–22213.
[15] D. Godden, D. Walton, Advances in the theory of argumentation schemes and critical questions,</p>
      <p>Informal Logic 27 (2007) 267–292.
[16] C. W. Tindale, Fallacies and argument appraisal, Cambridge University Press, 2007.
[17] J. Long, Large language model guided tree-of-thought, arXiv preprint arXiv:2305.08291 (2023).
[18] F. H. Van Eemeren, P. Houtlosser, Strategic manoeuvring in argumentative discourse, Discourse
studies 1 (1999) 479–497.
[19] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, P. Fung, Survey of
hallucination in natural language generation, ACM Computing Surveys 55 (2023) 1–38.
[20] J. Maynez, S. Narayan, B. Bohnet, R. McDonald, On faithfulness and factuality in abstractive
summarization, arXiv preprint arXiv:2005.00661 (2020).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>F. H. Van Eemeren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grootendorst</surname>
          </string-name>
          , Argumentation, communication, and
          <article-title>fallacies: A pragmadialectical perspective</article-title>
          , Routledge,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the dangers of stochastic parrots: Can language models be too big?</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Shafran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <article-title>Tree of thoughts: Deliberate problem solving with large language models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vassiliades</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bassiliades</surname>
          </string-name>
          , T. Patkos,
          <source>Argumentation and explainable artificial intelligence:</source>
          a survey,
          <source>The Knowledge Engineering Review</source>
          <volume>36</volume>
          (
          <year>2021</year>
          )
          <article-title>e5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Herbold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hautli-Janisz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Heuer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kikteva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trautsch</surname>
          </string-name>
          ,
          <article-title>A large-scale comparison of human-written versus chatgpt-generated essays</article-title>
          ,
          <source>Scientific reports 13</source>
          (
          <year>2023</year>
          )
          <fpage>18617</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Thorburn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kruger</surname>
          </string-name>
          ,
          <article-title>Optimizing language models for argumentative reasoning</article-title>
          ., in: ArgML@ COMMA,
          <year>2022</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Wagemans</surname>
          </string-name>
          ,
          <article-title>How persuasive is ai-generated argumentation? an analysis of the quality of an argumentative text produced by the gpt-3 ai text generator</article-title>
          ,
          <source>Argument &amp; Computation</source>
          <volume>14</volume>
          (
          <year>2023</year>
          )
          <fpage>59</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ruiz-Dolz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <article-title>Detecting argumentative fallacies in the wild: Problems and limitations of large language models</article-title>
          ,
          <source>in: Proceedings of the 10th Workshop on Argument Mining, Association for Computational Linguistics</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mordatch</surname>
          </string-name>
          ,
          <article-title>Improving factuality and reasoning in language models through multiagent debate</article-title>
          ,
          <source>arXiv preprint arXiv:2305.14325</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>