<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discourse Based Argumentation Analysis for LLM Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Boris Galitsky</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Knowledge Trail Inc. San Jose CA USA</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Large Language Models (LLMs) frequently produce fluent but unverifiable reasoning, resulting in potential hallucinations and faulty inferences. This study proposes an argumentation-based verification framework ValidArgLLM in which the reasoning expressed by an LLM is transformed into a defeasible logic program (DLP) representing world knowledge and a given problem description-such as a patient health complaint. The DLP is executed within a symbolic reasoning engine, and the resulting inferences are compared to the LLM's natural-language conclusions. The strength of arguments is computed based on discourse structure of text expressing arguments. Divergence between symbolic and neural reasoning outcomes indicates possible hallucination or inconsistency in the model's internal logic.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hallucinations</kwd>
        <kwd>faulty inferences</kwd>
        <kwd>argumentation-based verification</kwd>
        <kwd>Defeasible Logic Program (DeLP)</kwd>
        <kwd>discourse analysis 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large language models (LLMs) have achieved impressive results across diverse natural language
processing tasks, inspiring interest in their use for domains that require structured reasoning.
However, integrating LLMs into settings that demand context-sensitive, multi-step decision-making
remains challenging. These models often excel at generating fluent and informed text but struggle
to reason systematically, weigh competing possibilities, or revise conclusions when new information
appears [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Their reasoning is associative rather than strategic, limiting their ability to handle
complex, evolving problems.
      </p>
      <p>Another major limitation is interpretability. Unlike human experts who reason through explicit,
traceable arguments, LLMs reach conclusions through opaque statistical processes [22]. This opacity
makes it difficult to understand or justify their outputs, reducing trust and accountability in
applications that require verifiable logic [23]. Furthermore, LLMs are prone to reasoning
hallucinations—producing statements that sound coherent but conflict with facts or internal logic.
Because they lack explicit mechanisms for defeasibility or conflict resolution [18, 19, 20, 21], such
inconsistencies can undermine reliability. To address these gaps, LLMs must be complemented by
external reasoning and verification layers capable of enforcing logical consistency and explaining
why conclusions hold.</p>
      <p>
        One promising approach is to pair an LLM with a symbolic or logical reasoning engine—for
example, a Prolog-style rule base, a constraint solver, or a formal ontology of medical conditions [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref9">9,
10, 11, 12, 19</xref>
        ]. The LLM produces a candidate answer, while the reasoning system independently
checks whether that answer follows from known facts and rules. This creates a dual-track pipeline:
the generative model proposes, the logical module disposes. Such a framework can flag
contradictions (e.g., a diagnosis incompatible with the patient’s lab values), highlight unsupported
steps in a rationale, or even suggest corrected outputs when classical reasoning yields a different
conclusion. Over time, it can also feed back into training, teaching the LLM to prefer responses that
survive external verification
      </p>
      <p>To mitigate these risks, we introduce a neuro-symbolic verification framework ValidArgLLM that
externalizes and tests an LLM’s reasoning through argumentation analysis. The key idea is to
translate the model’s implicit causal and conditional reasoning into a formal system—defeasible logic
programming (DeLP)—and verify its conclusions via a logical solver.</p>
      <p>
        Despite significant advancements in task performance, current approaches to argumentative
reasoning continue to face critical limitations in terms of justifiability and interpretability. While
models can often produce seemingly coherent explanations, it remains ambiguous how final
decisions are actually reached or which intermediate steps genuinely contributed to the outcome.
Recent studies have revealed that Chain-of-Thought (CoT) reasoning, though designed to enhance
transparency, is itself prone to hallucinations, inconsistencies, and spurious reasoning paths that do
not accurately represent the underlying inference mechanisms of large language models [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]. These
findings call into question the validity of CoT traces as reliable post hoc justifications of model
behavior rather than mere rhetorical artifacts. Moreover, in multi-agent or multi-LLM debate
settings, the reasoning process becomes even more opaque: the discussions between models are
typically unstructured, non-formalized, and difficult to audit, making it nearly impossible to
systematically reconstruct the logic that led to a collective conclusion. This lack of traceability
undermines confidence in the epistemic soundness of model-driven argumentation and highlights
the need for structured frameworks capable of capturing, verifying, and explaining the reasoning
trajectories that lead to final decisions.
      </p>
      <p>
        Figure 1 compares four levels of reasoning used by language models, showing how they evolve
from simple answers to structured, self-correcting discourse-based reasoning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. At the top, the
direct answer model provides an immediate response without explanation. It may be correct or
incorrect, but there is no visibility into why the model chose that answer. Because no reasoning steps
are revealed, errors cannot be traced or corrected.
      </p>
      <p>The next level, chain-of-thought reasoning, adds a sequence of intermediate steps that make the
process more interpretable. However, these steps remain unverified. The reasoning might sound
plausible while still being factually wrong, since the model does not test or challenge its own
statements.</p>
      <p>The argumentative model introduces a more structured approach. Instead of producing one line
of reasoning, it generates multiple arguments, distinguishing between supporting and attacking ones
(Gutiérrez et al. 2024). This enables a form of contestability: each conclusion is backed by explicit
evidence and can be challenged by counterarguments. Still, this stage only formalizes reasoning—it
does not validate or improve it. The model outputs argument structures but lacks a mechanism to
revise its own conclusions.</p>
      <sec id="sec-1-1">
        <title>The patient reports severe Whereas the ankle pain has largely</title>
        <p>knee pain. subsided.
hTehaertpaattiteanckt.experienced a iSmnpyfeeocrciifoairrcdawilalayll,li.nafnaorcnt-iSoTn aeflefevcattiinognthe
The patient’s cholesterol Although his diet compliance was
level improved. inconsistent.</p>
        <p>The surgeon performed an The patient had arrived at the hospital
emergency appendectomy. only two hours earlier.</p>
        <p>The nurse administered a To facilitate the insertion of a central
sedative. venous catheter.</p>
        <p>The patient is diagnosed Supported by chest X-ray showing
with pneumonia. bilateral infiltrates.</p>
        <p>The treatment outcome is According to the hospital’s quality
considered successful. benchmarks.</p>
        <p>The discourse-based argument validation model ValidArgLLM at the bottom represents the most
advanced form. It integrates discourse structure to evaluate how strongly each argument contributes
to the overall reasoning. By analyzing rhetorical relations such as elaboration, justification, or
background, it can weigh the importance of each argument and decide which should dominate the
conclusion. This allows the system to detect when the original model’s answer is inconsistent with
the discourse-level balance of evidence and automatically correct it. Over time, this validation loop
enables the model to refine its reasoning and become more consistent and interpretable. This
architecture adds discourse awareness, verifiable structure, and self-correction. It moves beyond
merely producing or scoring arguments—it understands how arguments interact and uses that
understanding to ensure that reasoning outcomes are both coherent and defensible.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Discourse and Defeasibility</title>
      <p>Each rhetorical relation has a nucleus (the “main” proposition) and a satellite (supporting or
contextual material). The satellite always carries less essential information than the nucleus (Table
1). One can see that nucleus contains main diagnostic/treatment fact (higher base probability) and
satellite caries contextual/supporting info with lower significance. These values are obtained in the
course of improvement of validation performance, described in Evaluation section.</p>
      <sec id="sec-2-1">
        <title>Hence the rules are:</title>
        <p>1) strict rules (must hold, nucleus) Head &lt;- Body.
2) defeasible rules (should hold, satellite) Head &lt;~ Body.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Facts are either strict or defeasible:</title>
        <p>fact. % strict fact vs
fact &lt;~ . % defeasible fact
We now show how a defeasible logic program in ValidArgLLM would be built from a real nucleus
vs. satellite pair.</p>
        <p>Nucleus (main claim): “The patient must immediately start a course of antibiotics for bacterial
pneumonia.”
Satellite (supporting context): “Because the chest X-ray shows an infiltrate consistent with
pneumonia.”</p>
        <p>Here the nucleus expresses a mandatory action (antibiotics). The satellite is evidence/justification
(X-ray finding) — useful, but not itself the main prescription.</p>
        <p>In a defeasible logic program:
% Strict rule from the nucleus: MUST do this if bacterial pneumonia
diagnosed
start_antibiotics(Patient) &lt;- diagnosis(Patient, bacterial_pneumonia).
% Defeasible rule from the satellite: SHOULD suspect pneumonia if X-ray
shows infiltrate
diagnosis(Patient, bacterial_pneumonia) &lt;~ chest_xray_infiltrate(Patient).
% Facts:
chest_xray_infiltrate(john) &lt;~ . % defeasible evidence
The nucleus states the obligatory/primary outcome, while the satellite is only supportive, so its rule
is defeasible/overridable.</p>
        <p>The nucleus carries the main point of the discourse. In DLP you can treat it as a strict rule or a
strict fact, because it is asserted to hold independently of the supporting material.
diagnosis(Patient, bacterial_pneumonia).</p>
        <p>Even if we later remove the satellite (the X-ray), the nucleus stands on its own.</p>
        <p>The satellite carries supporting or contextual information. In DLP you model it as a defeasible rule
whose conclusion only fires under the context of the nucleus (or some nucleus-derived condition).</p>
      </sec>
      <sec id="sec-2-3">
        <title>This makes it conditional:</title>
        <p>% satellite only applies if nucleus (diagnosis) is already true
chest_xray_infiltrate(Patient) &lt;~ diagnosis(Patient, bacterial_pneumonia).</p>
        <p>More generally: satellite_fact(X) &lt;~ nucleus_fact(X).</p>
        <p>So, in ValidArgLLM the satellite is not automatically accepted; it’s accepted defeasibly and only
when its nucleus context holds. In a defeasible logic program derived from discourse, nucleus
statements are converted into strict rules or strict facts that hold independently. Satellite statements
are converted into defeasible rules whose applicability is conditional on the corresponding nucleus
being true; they express “should” or “likely” information that can be overridden or becomes vacuous
if the nucleus is absent.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Enabling DeLP with Argument Strength Computation</title>
      <p>Given a natural-language case description, the LLM within ValidArgLLM constructs a set of
defeasible rules encoding its inferred world knowledge:
r1: gout :- asymmetric_joint_inflammation, uric_acid_high.
r2: immune_arthritis :- symmetric_joint_inflammation, fever.
r3: asymmetric_joint_inflammation :- not symmetric_joint_inflammation.
r4: prefer immune_arthritis over gout if fever.</p>
      <p>Here, each rule expresses a conditional belief that may be overridden by stronger evidence.
A defeasible logic program is a set of facts, strict rules, Π, of the form A:-B, and a set of defeasible
rules Δ of the form, A- &lt; B, whose intended meaning is “if B is the case, then usually A is also the
case.” A DeLP for knowledge sources includes facts which are extracted from search results and strict
and defeasible clauses where the head and body form commonsense reasoning rules (Garcia and
Simari, 2004).</p>
      <p>Let DT=(N,R) be a discourse tree, where N is the set of elementary discourse units (EDUs), and R⊆N×N
is the set of rhetorical relations between nucleus and satellite spans. Each relation
!=(nucleus,satellite,relation) ∈R has a rhetorical type (e.g., Cause, Evidence, Elaboration, Concession)
and an associated relative argument strength coefficient ∝"#$%&amp;!'( ∈ , representing how much more
influential the nucleus is compared to its satellite.</p>
      <p>For every defeasible rule A -&lt; B1,B2,…,Bk ∈ Δ we associate a discourse strength weight w(A)∈ 
computed as w(A) = |+%&amp;)(-)| ∑((/0 ,2%&amp;,"#$%&amp;!'( ∈4! ∝"#$%&amp;!'( ,
where - is the set of discourse relations in which participates, and Sat(A) are its supporting
satellites. Thus, rules derived from nucleus EDUs obtain higher w(A) values (closer to 1), while rules
from satellite EDUs obtain lower w(A)proportionally to their rhetorical importance.
Let P=(Π, Δ) be a DeLP program and L a ground literal. A defeasible derivation of L from P consists
of a finite sequence L1, L2,…, Ln of ground literals, such that each literal Li is in the sequence because:</p>
      <sec id="sec-3-1">
        <title>1. L1 is a fact in Π, or</title>
        <p>2. there exists a rule Ri in P (strict or defeasible) with head Li and body B1, B2,…,Bk and every
literal of the body is an element Li of the sequence appearing before Lj (j&lt;i).</p>
        <p>Let h be a literal, and P=(Π, Δ) a DeLP program. We say that &lt;A, h&gt; is an argument for h,
if A is a set of defeasible rules of Δ, such that:
Hence an argument &lt;A, h&gt; is a minimal noncontradictory set of defeasible rules, obtained from a
defeasible derivation for a given literal h associated with a program P.</p>
        <p>The generated DLP is executed within a defeasible reasoning environment which is the
SWIProlog extension. Each rule is treated as an argument, and conflicts among arguments give rise to a
dialectical structure that determines which conclusions are warranted under a chosen semantics (e.g.,
grounded or preferred extension).</p>
        <p>The logical conclusions derived by the solver are compared with the LLM’s original verbal output.
If both converge (e.g., both conclude immune arthritis), the LLM’s reasoning is considered grounded.
If they diverge (e.g., solver: immune arthritis, LLM: gout), the discrepancy signals a reasoning
hallucination—a claim unsupported by the formal reconstruction of its own reasoning.</p>
        <p>
          We represent a quantitative bipolar argumentation framework (QBAF [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]) denoted as a quadruple
⟨A,R−,R+,τ⟩ via DeLP with discourse features indicating an argument strength. This framework
includes a finite set of arguments A, disjoint binary relations of attack R− ⊆ A × A and support R+
⊆ A×A, and argument strength function τ ∶ A → .
        </p>
        <p>
          Gradual semantics recursively compute an argument’s dialectical strength by combining its base
score with the aggregated strengths of its attackers and supporters. Given a gradual semantics, such
as DF-QuAD [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], denoted σ, each argument α ∈ A obtains a strength σ(α)∈ . ValidArgLLM combines
discourse-based argument strength with gradual argument semantics in the following way:
1. LLM produces an argumentation tree B whose root is x. Every other node is an argument
generated by G. Every node, has a single attacker and one supporter argument pointing
to it.
2. Intrinsic argument strength attribution E(B) → Q assigns a base score Q to every node
via some evaluator model E.
3. Argumentative strength calculation Σ(Q) → Q(x) applies a gradual semantics σ to add the
argument strength to discourse strength, starting from the root claim. Τ = σ + w(x).
4. Claim verification prediction g(Q(x)) →  predicts the final result: the claim is true when
        </p>
        <p>Q(x)≥0.5 or false otherwise.</p>
        <p>The strength aggregation function is defined as : *→ , where for a permutation of arguments
S=(v1, . . . , vn) ∈ *: if n = 0 ∶ (S) = 0 ; if n = 1 ∶ (S) = v1; if n = 2 ∶ (S) = f (v1,v2)
if n &gt; 2 ∶ (S) = f (s(v1, . . . ,vn-1),vn) with the base function f ∶  ×  →  defined, for v1,v2 ∈ , as:
f(v1,v2) = v1+ (1−v1)⋅v2 = v1+v1 − v1⋅v2.</p>
        <p>
          Thus, the base function expresses sequences of strengths of attackers or supporters by s,
proportionally increasing the attacking or supporting arguments’ strength towards 1 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Enabling DeLP with Argument Strength Computation</title>
      <p>Fig. 2 illustrates a hybrid neuro-symbolic diagnostic reasoning pipeline where an LLM and a
Defeasible Logic Programming (DeLP) reasoning engine work together to verify or refute an
LLMgenerated medical diagnosis. The goal of the architecture is to ensure that an LLM’s diagnostic
answer (for example, “The patient has gout”) is not only linguistically plausible but also logically
justified and consistent with a structured medical ontology. If the logical reasoning pipeline cannot
defeat the diagnosis claim, it is confirmed; otherwise, the LLM answer is marked unconfirmed.
The ValidArgLLM‘s workflow is as follows:
1. User Input. The process begins with a user request, such as asking the model to provide a
diagnosis.
2. LLM Generates Initial Answer. The LLM processes the request and outputs an initial diagnosis
or conclusion (e.g., “The disease is gout”).
3. Ontology and Discourse Setup. A textual ontology of medical knowledge (rules, relationships,
symptoms, conditions) and a discourse parser are available. The discourse parser identifies
rhetorical relations in the text — for instance, nucleus (main facts) and satellite (contextual or
defeasible facts). Nucleus → “Must” clauses (non-defeasible rules) and Satellite → “Should” clauses
(defeasible rules)
4. LLM forms ontology representation, transforming textual information into a rule-based ontology
in the DeLP format — essentially translating natural-language reasoning into structured logical
rules.
5. Conversion to defeasible logic. The LLM converts these rules into regular (strict) or defeasible
(soft) clauses depending on their role in the discourse (main vs. secondary information).
6. Building logical representation of the user request. The system formalizes the user’s question
and the LLM’s proposed answer as a set of logical facts that can either be defeated or supported
by the ontology.
7. Discourse representation integration. The LLM builds discourse representations of both the user
request and the ontology, capturing which arguments are more or less important
(nucleus/satellite weighting) and how they relate contextually.
8. Argumentation Pipeline. The argumentation module computes attack relations among rules
(contradictions or counter-arguments), dialectical trees, representing possible argumentative
dialogues between supporting and opposing claims, and defeasibility outcomes, determining
whether a claim survives all counter-arguments
9. Comparison and validation. The LLM compares its original diagnosis with the verified diagnosis
obtained through the logical argumentation process.
10. Decision. If the logical reasoning shows that the diagnosis claim is not defeated, it is confirmed
as valid. If the claim is defeated by stronger counter-arguments from the ontology, it is marked
unconfirmed.</p>
      <sec id="sec-4-1">
        <title>This architecture combines:</title>
        <p>•
•
•</p>
        <p>Neural generation (LLM) → producing hypotheses and natural-language reasoning.
Symbolic verification (DeLP) → testing those hypotheses for logical soundness.</p>
        <p>Discourse analysis → weighting arguments by rhetorical importance.</p>
        <p>Together, these components produce a contestable and interpretable diagnostic system, where
the model’s answer can be justified or overturned through structured reasoning.</p>
        <p>The demo of argument-based LLM verifier is available at Tool Series for LLM Verification (Figure
3 and Figure 4).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>
        We evaluate on three claim-verification datasets that we derive from existing QA/NLI resources:
TruthfulHalluc (from TruthfulQA [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]), MedHalluc (from MedQA; [15] and PubMedQA; [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]), and
eSNLI_Halluc (from eSNLI; [17]). For each source, we convert items into question–answer (QA) style
pairs and then inject controlled inconsistencies by appending randomly sampled, semantically
incompatible attributes (facts, circumstances, symptoms). These perturbations create positive
“hallucination” cases; unmodified items serve as negatives.
      </p>
      <p>Our focus is hallucination detection for model answers using argumentation analysis as a
validator. The validator assesses whether an answer’s central claim is defeated by the
argumentvalidation system. We define a hallucination as a claim whose defeat probability exceeds 0.5. This
cautious threshold is motivated by safety-critical domains (health, legal, finance), where we prefer
to reject answers that are defeated with substantial probability.</p>
      <p>Dataset size and prevalence are as follows. Each hallucination dataset contains 1,000 QA pairs
with a 50% hallucination rate (balanced positives/negatives). In the original source datasets the
natural hallucination rate is &lt;1%; our perturbation procedure raises prevalence to enable meaningful
detection metrics and comparability with prior LLM-argumentation studies.</p>
      <p>We report F1 for hallucination prediction in Table 2. It lists: (i) a GPT-5 baseline; (ii) our
claimverification tool that uses argument validation; and (iii) a discourse-aware variant where argument
strength additionally incorporates discourse cues beyond the default computation. This final column
shows the incremental gains from discourse-informed argument strength.</p>
      <p>
        Our MedHalluc results are broadly comparable to prior work: ArgMed-Agents with GPT-4 reports
0.91 predictive accuracy [16]; ArgLLM with GPT-4o reports 0.80 (Friedman et al., 2015); and an
ensemble of ArgLLMs achieves 0.73 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. That said, these systems estimate claim truthfulness,
whereas our study predicts hallucination via whether a claim is defeated by the argument-validation
module, so the targets differ and the numbers are not strictly comparable.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Related Work</title>
      <p>
        The approach of Bezou-Vrakatseli (2023) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] leverages argument schemes—structured templates
that capture common patterns of reasoning—and their associated critical questions, which probe the
assumptions, exceptions, and contextual factors underlying those schemes. By using these as a
framework for classifying and analyzing arguments, the method provides a semantically richer
alternative to surface-level textual analysis. In the context of LLM verification, this enables
evaluators to assess not just whether an LLM produces grammatically or factually correct responses,
but whether it constructs logically sound, ethically nuanced arguments that align with established
norms of rational discourse. The critical questions act as a diagnostic tool, revealing whether the
model truly understands the reasoning behind ethical positions or is merely mimicking
plausiblesounding rhetoric.
      </p>
      <p>This verification strategy directly supports the project’s broader goal of fostering ethical debate
between humans and AI systems. By evaluating LLMs through the lens of argumentation theory,
researchers can determine how well these models engage in principled reasoning about moral
dilemmas, identify potential biases or logical fallacies, and measure their capacity to both construct
and critique ethical arguments. Ultimately, this contributes to “ethics for AI”—ensuring AI systems
behave responsibly—and “AI for ethics”—using AI as a tool to help humans reflect on and refine their
own ethical reasoning. Such a dual focus positions LLMs not just as information providers, but as
collaborative partners in navigating complex moral questions.</p>
      <p>
        Earlier approaches to argumentative reasoning, such as ArgLLMs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], determined argument
quality scores using the confidence of the argument-generating model itself. These systems treated
the model’s internal probability estimates as proxies for argument plausibility, thereby grounding
evaluation in model-intrinsic uncertainty rather than discourse-level coherence. Subsequent
research adopted reward-model-inspired scoring [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], introducing an external evaluator LLM to
assign quality scores. Two main setups were explored; (i) estimated Arguments, where supporting
arguments were scored while the root claim remained fixed at 0.5; and (ii) estimated All, where both
root claims and subordinate arguments were assessed via discrete truth and certainty labels mapped
to continuous scores. Also, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] showed that MArgE can significantly outperform single LLMs,
including three open source models (4B to 8B parameters), and existing ArgLLMs, as well as prior
methods for unstructured multi-LLM debates.
      </p>
      <p>These techniques diversified the evaluation signal and reduced model-specific bias but remained
primarily semantic—focused on the content of individual arguments rather than their rhetorical role
in a discourse structure. In contrast, our work introduces a discourse-based scoring framework that
evaluates arguments in the context of their rhetorical relations and structural significance within the
discourse tree. Instead of relying solely on model-elicited or reward-style judgments, scores are
inferred from the hierarchical organization of argumentative elements—linking claims, evidence, and
counterarguments through nucleus–satellite dependencies and coherence relations. This approach
enables the system to weight contributions based on discourse salience rather than raw textual
confidence, thereby capturing how strongly each component supports or undermines the overall
claim.</p>
      <p>While reward-model scoring focuses on factual adequacy and semantic quality, our
discoursebased approach emphasizes relational justifiability, integrating structural reasoning to yield a more
explainable and linguistically grounded measure of argument strength.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>Argument analysis tool belongs to the series of LLM verification tools including logic
programming, answer set programming, and rule attenuation (Fig. 3) All these tools rely on discourse
analysis to determine rule structure and weights to build a logic program representation. We
observed that LLMs can be reliably verified by discourse-based argumentation analysis.</p>
      <p>Our evaluation demonstrates that integrating argument-validation into LLM pipelines
substantially improves hallucination detection across three newly constructed claim-verification
datasets: TruthfulHalluc, MedHalluc, and eSNLI_Halluc. By transforming QA/NLI sources into
structured QA pairs and injecting semantically incompatible attributes, we create balanced datasets
where hallucinations correspond to claims defeated by an argumentation engine. Using a
defeatprobability threshold of 0.5—chosen for safety-critical settings—the argument-validation module
yields consistent gains over a GPT-5 baseline. Across datasets, ValidArgLLM increases F1 by +0.15–
0.20, with the largest improvements observed in medically grounded reasoning tasks where
unsupported causal links and symptom inferences are more common. These results highlight that
logical defeat, rather than surface-level confidence, provides a more robust criterion for identifying
model errors in multi-step explanatory contexts.</p>
      <p>Adding discourse-aware argument strength further strengthens performance for domains where
rhetorical centrality matters. Incorporating nucleus–satellite weighting and discourse-relation cues
yields additional gains of +0.03–0.06 F1, reaching 0.78 on TruthfulHalluc and 0.80 on MedHalluc.
While superficially comparable to prior systems such as ArgMed-Agents (0.91), these approaches
estimate truthfulness, whereas our model predicts hallucination by determining whether the
answer’s core claim is logically defeated. The distinction is crucial: truth-prediction assumes access
to ground-truth facts, whereas defeat-based hallucination detection evaluates internal argumentative
consistency. Our results thus establish discourse-augmented argumentation analysis as an effective,
model-agnostic verification layer for improving LLM reliability in high-stakes reasoning
environments.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used GPT 5 in order to: grammar and spelling
check, paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.
[15] Jin D, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What
disease does this patient have? A large-scale open domain question answering dataset from
medical exams. CoRR, abs/2009.13081, 2020.
[16] Hong S, Liang Xiao, Xin Zhang, Jianxia Chen (2024) ArgMed-Agents: Explainable Clinical</p>
      <p>Decision Reasoning with LLM Disscusion via Argumentation Schemes
[17] Camburu O-M, Tim Rocktäschel, Thomas Lukasiewicz, Phil Blunsom (2018) e-SNLI: Natural
Language Inference with Natural Language Explanations. Advances in Neural Information
Processing Systems 31 (NeurIPS 2018)
[18] Ruiz-Dolz R and Lawrence J. 2023. Detecting Argumentative Fallacies in the Wild: Problems and
Limitations of Large Language Models. In Proceedings of the 10th Workshop on Argument
Mining, pages 1–10, Singapore. Association for Computational Linguistics
[19] Xu Z, Sanjay Jain, Mohan Kankanhalli (2024) Hallucination is Inevitable: An Innate Limitation
of Large Language Models. arXiv:2401.11817
[20] Banerjee S, Ayushi Agarwal, Saloni Singla (2024) LLMs Will Always Hallucinate, and We Need
to Live With This. arXiv:2409.05746
[21] Gutiérrez A, Stella Heras and Javier Palanca (2024) Detecting disinformation through
computational argumentation techniques and large language models. CMNA 2024
[22] Musi E and Rudi Palmieri (2024) The Fallacy of Explainable Generative AI: evidence from
argumentative prompting in two domains. CMNA 2024
[23] Prakken H (2024) On Evaluating Legal-Reasoning Capabilities of Generative AI. CMNA 2024</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bezou-Vrakatseli</surname>
            <given-names>E</given-names>
          </string-name>
          (
          <year>2023</year>
          )
          <article-title>Evaluation of LLM Reasoning via Argument Schemes</article-title>
          .
          <source>Online Handbook of Argumentation for AI</source>
          , Vol.
          <volume>4</volume>
          p 1
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Louis</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Nenkova</surname>
            <given-names>A.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>A Coherence Model Based on Syntactic Patterns</article-title>
          .
          <source>In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning</source>
          , pages
          <fpage>1157</fpage>
          -
          <lpage>1168</lpage>
          ,
          <string-name>
            <surname>Jeju</surname>
            <given-names>Island</given-names>
          </string-name>
          , Korea. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Freedman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dejl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gorur</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rago</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2025</year>
          .
          <article-title>Argumentative Large Language Models for Explainable and Contestable Claim Verification</article-title>
          .
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <volume>39</volume>
          (
          <issue>14</issue>
          ):
          <fpage>14930</fpage>
          -
          <lpage>14939</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Rago</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Toni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Aurisicchio</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Baroni,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Discontinuity-Free Decision Support with Quantitative Argumentation Debates</article-title>
          .
          <source>In Principles of Knowledge Representation and Reasoning: Proceedings of the Fifteenth Interna tional Conference, KR</source>
          <year>2016</year>
          ,
          <volume>63</volume>
          -
          <fpage>73</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ng</surname>
            <given-names>MP</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Junqi</given-names>
            <surname>Jiang</surname>
          </string-name>
          and Gabriel Freedman and Antonio Rago and Francesca Toni.
          <article-title>MArgE: Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification</article-title>
          , arxiv
          <volume>2508</volume>
          .
          <fpage>02584</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Arcuschin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Janiak,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Krzyzanowski,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Rajamanoharan,
          <string-name>
            <given-names>S.</given-names>
            ; Nanda, N.; and
            <surname>Conmy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2025</year>
          .
          <article-title>Chain-of-thought rea soning in the wild is not always faithful</article-title>
          .
          <source>ICLR 2025 Reasoning and Planning for LLMs Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Barez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , T.-Y.;
          <string-name>
            <surname>Arcuschin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Lan,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ;
            <surname>Siegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Collignon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Neo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ;
            <surname>Paren</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ; et al.
          <year>2025</year>
          .
          <article-title>Chain-of-thought is not explainability</article-title>
          . Preprint, arXiv.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lambert</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pyatkin</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Morrison</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Miranda</surname>
            ,
            <given-names>L. J. V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>B. Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chandu</surname>
            ,
            <given-names>K. R.</given-names>
          </string-name>
          ; Dziri,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Zick,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. A.</surname>
          </string-name>
          ; and Hajishirzi,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2025</year>
          .
          <article-title>Reward Bench: Evaluating Reward Models for Language Modeling</article-title>
          .
          <source>In Findings of the Association for Computational Linguis tics: NAACL</source>
          <year>2025</year>
          ,
          <volume>1755</volume>
          -
          <fpage>1797</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Kaminski</surname>
            ,
            <given-names>Roland</given-names>
          </string-name>
          &amp; Wanko,
          <string-name>
            <surname>Philipp.</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>A Tutorial on Hybrid Answer Set Solving with clingo</article-title>
          .
          <source>In: Reasoning Web. Semantic Interoperability on the Web</source>
          (pp.
          <fpage>167</fpage>
          -
          <lpage>203</lpage>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simari</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <year>2004</year>
          .
          <article-title>Defeasible logic programming: an argumentative approach</article-title>
          .
          <source>Theory Pract. Log. Program. 4</source>
          ,
          <fpage>95</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ferrag</surname>
            <given-names>MA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norbert</surname>
            <given-names>Tihanyi</given-names>
          </string-name>
          , Merouane Debbah,
          <article-title>Reasoning beyond limits: Advances and open problems for LLMs</article-title>
          ,
          <source>ICT Express</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Galitsky</surname>
            <given-names>B</given-names>
          </string-name>
          (
          <year>2025</year>
          )
          <article-title>Enabling large language model with plug-and-play symbolic reasoning components</article-title>
          .
          <source>In Health Apps of Neuro-symbolic AI</source>
          , Elsevier pp
          <fpage>59</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Lin</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jacob</surname>
            <given-names>Hilton</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>Owain</given-names>
            <surname>Evans</surname>
          </string-name>
          .
          <article-title>TruthfulQA: Measuring how models mimic human falsehoods</article-title>
          .
          <source>CoRR, abs/2109.07958</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Dhingra</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W. W.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
            , and
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          , “
          <article-title>Pubmedqa: A dataset for biomedical research question answering</article-title>
          ,”
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>