<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Chakraborty);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Assessing Logical Inference Capabilities of Large Language Models through RDF Schema Entailment Rules: A Multi-Level Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Taichi Hosokawa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sudesna Chakraborty</string-name>
          <email>sudesna@it.aoyama.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Takeshi Morita</string-name>
          <email>morita@it.aoyama.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aoyama Gakuin University</institution>
          ,
          <addr-line>Kanagawa</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Advanced Industrial Science and Technology</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Large language models (LLMs) achieve strong performance in various language tasks, yet their logical inference abilities remain limited. LLMs often rely on pre-trained knowledge rather than explicit inference. Their inference capabilities in ontology languages like RDFS also remain underexplored. This study evaluates LLMs' inference abilities using RDFS entailment rules with two knowledge datasets: real-world data from Linked Open Data and counterfactual data created by systematically altering real-world facts. We propose a novel evaluation methodology assessing LLM outputs. To analyze inference behavior under diferent conditions, we design a three-level task framework varying rule presentation methods for identical inference tasks. Results show high accuracy on real-world datasets. LLMs sometimes infer missing premises using pre-trained knowledge, suggesting potential for incompletely structured environments. However, accuracy declines with counterfactual datasets and when shifting from pre-combined to multiple separate rules. Performance further drops when models must select appropriate rules from predefined subsets. These findings highlight both strengths and limitations of LLMs in structured, rule-based inference within ontology-driven systems. Large Language Models, RDF Schema entailment rules, logical inference capability, counterfactual knowledge ∗Corresponding author.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>A</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        In recent years, large language models (LLMs) have demonstrated substantial capabilities across natural
language processing tasks, yet their capacity for logical inference remains under scrutiny [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. LLMs
tend to leverage pretrained knowledge rather than explicit logical inference, leading to evaluation
studies under counterfactual conditions [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ]. While existing assessments using RDFS inference rules
have relied on natural language-based methods, such approaches present challenges for rigorous
verification [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To address this, we propose a novel evaluation framework that rigorously assesses
LLM logical inference abilities using RDFS entailment rules. Our framework assesses LLM inference
capabilities through real-world and counterfactual knowledge datasets, diferent rule presentation
methods, and multi-rule combinations, using RDFS triples for precise, quantitative evaluation.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Related</title>
    </sec>
    <sec id="sec-4">
      <title>Work</title>
      <p>
        Wu et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] investigated whether pre-trained language models can store, understand, and utilize
ontological knowledge for inference. They evaluated inference tasks based on six RDFS entailment
rules using DBpedia [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Wikidata [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] data. However, significant limitations in inference accuracy
for complex inference were revealed. Ozeki et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] assessed syllogistic inference abilities of LLMs
and showed performance decline in tasks involving belief-incongruent (counterfactual) premises and
conclusions. Morishita et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] assessed inference accuracy when constants and predicates were
replaced with randomly assigned vocabulary, testing whether LLMs could apply logical rules without
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>Rule
rdfs2
rdfs3
rdfs5
rdfs7
rdfs9
rdfs11
relying on pre-trained knowledge. The present study is similar to Wu et al. in using RDFS entailment
rules to evaluate LLMs, but employs RDFS triples as output format and evaluates inference tasks
involving multiple rules, enabling more rigorous and quantitatively precise assessment.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Methodology</title>
      <sec id="sec-5-1">
        <title>3.1. RDFS Entailment Rules</title>
        <p>We select six RDFS entailment rules that are frequently used in practical Semantic Web settings (Table 1).
These rules are well-suited for assessing LLM inference from a practical perspective. We further define
13 additional composite rules by combining two rules (rdfs2+3, rdfs2+7, rdfs2+9, rdfs3+7, rdfs3+9,
rdfs5+7, rdfs9+11) and three rules (rdfs2+3+7, rdfs2+3+9, rdfs2+5+7, rdfs2+9+11, rdfs3+5+7, rdfs3+9+11),
yielding 19 rules in total to evaluate inference patterns of varying complexity.</p>
      </sec>
      <sec id="sec-5-2">
        <title>3.2. Evaluation Datasets</title>
        <p>To systematically evaluate LLM inference capabilities, we conduct evaluations under diverse dataset
conditions.</p>
        <p>
          Real-world knowledge dataset (RK): Constructed from Linked Open Data sources including
DBpedia [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], Wikidata [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], and Linked Open Vocabularies [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          Counterfactual datasets: Eight variants created primarily by systematically modifying real-world
knowledge data:
• S: Swaps subjects/objects, property domains/ranges, and reorders class/property hierarchies
• NA/NR: Adds “not” prefix to all/random classes and properties
• SNA/SNR: Combines S with NA/NR modifications
• GS: Shufles all resources ensuring none remain in original positions
• GSC: Applies GS then renames resources using DBpedia’s type-appropriate naming conventions
(PascalCase for classes, Upper_Snake_Case for instances, camelCase for properties)
• RND: Creates new triples using randomly assigned DBpedia vocabularies by resource type [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
Random symbolic dataset (NS): Triples formed from 8-character alphanumeric strings with no
semantic meaning.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>3.3. Task Levels and Prompt Design</title>
        <p>We define three task levels for a common inference task by varying how the inference rules are presented:
The prompts provide inference rules and premise knowledge, instructing LLMs to perform inference
based solely on the given premises. Output formats are strictly specified to ensure consistent evaluation.
Each task level using an example that combines rdfs2, rdfs3, and rdfs7 as the target inference rules
are shown in the following:</p>
        <p>Lv1: Single composite rule provided; model applies it to premise knowledge. For example, rdfs2_3_7
(Table 2):</p>
        <p>Lv2: Only the necessary rules from Table 1 are presented separately (e.g., rdfs2, rdfs3, and rdfs7);
model applies all given rules.</p>
        <p>Lv3: Complete set of six rules (Table 1) provided; model must select relevant rules (e.g., rdfs2, rdfs3,
and rdfs7 from the six available), apply them, and report which rules were used.</p>
        <p>Strict prompt templates define output formats to prevent background knowledge incorporation and
proficient result extraction.</p>
      </sec>
      <sec id="sec-5-4">
        <title>3.4. Evaluation Metrics</title>
        <p>
          We extract inferred triples from model outputs and compare them with conclusions from an RDFS
reasoner using Apache Jena Inference API [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. To mitigate superficial variations, each triple component
is matched using a 0.95 string-similarity threshold. Precision, recall, and F1 score are then computed.
For Lv3, rule-name selection is further evaluated by exact string matching.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Experimental Results</title>
      <p>We evaluated GPT-4o mini (gpt-4o-mini-2024-07-18) and GPT-4o (gpt-4o-2024-08-06) on 19 entailment
rules across all datasets and task levels. For RK and counterfactual datasets (excluding RND), 400
samples per rule were used; 100 samples per rule were used for RND and NS.</p>
      <sec id="sec-6-1">
        <title>4.1. Inference Performance</title>
        <p>Table 3 shows average F1 scores by rule type (1-rule: single RDFS entailment rule, 2-rule: combinations
of two rules, 3-rule: combinations of three rules) and task level. GPT-4o consistently outperformed
GPT-4o mini across all conditions, maintaining high inference accuracy even under complex conditions
(3-rule and Lv3). Both models showed decreased F1 scores as task complexity increased, with GS
consistently recording the lowest scores, while GSC consistently outperforming GS. For GPT-4o mini,
performance dropped significantly under 3-rule Lv3 conditions (F1=0.30 for GS, 0.38 for GSC), whereas
GPT-4o maintained relatively high performance (F1=0.71 for GS, 0.82 for GSC under same conditions).</p>
      </sec>
      <sec id="sec-6-2">
        <title>4.2. Rule Selection Performance</title>
        <p>Table 4 shows average F1 scores for rule selection accuracy in Lv3 tasks. GPT-4o demonstrated high
rule selection accuracy, with F1 scores above 0.89 for all datasets. In contrast, GPT-4o mini showed
an unusual trend: lower rule selection accuracy in 1-rule tasks compared to 2-rule and 3-rule tasks.
An analysis of error cases suggests that this counterintuitive trend stems from the model frequently
inferring knowledge not present in the prompt, leading to the incorrect application of additional rules.
To illustrate this behavior, Table 5 shows a case where GPT-4o mini incorrectly applied rdfs3 in addition
to the required rdfs2. Although no rdfs:range information was provided, the model supplemented it
from context and produced an extra inference. Such overgeneralization lowers rule selection accuracy,
especially under the 1-rule condition.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Discussion and Conclusion</title>
      <p>Both models achieved high inference accuracy on real-world knowledge tasks. GPT-4o maintained stable
F1 scores even in complex settings (3-rule, Lv3), demonstrating efective rule-based logical inference.
GPT-4o consistently outperformed GPT-4o mini, suggesting that larger parameters and broader training
data enable more complex reasoning.</p>
      <p>However, GPT-4o mini tended to supplement missing knowledge even in tasks requiring strict
rulebased inference. While logically inconsistent, this behavior may prove advantageous in real-world
applications where knowledge incompleteness is unavoidable.</p>
      <p>Inference accuracy was consistently lowest on the GS dataset across all conditions, with GSC showing
relatively low but better performance than GS. This suggests LLMs rely on lexical cues and naming
conventions in resource names as heuristics for inference. GPT-4o’s results indicate that increased
model scale can reduce this reliance and enhance structurally grounded inference.</p>
      <p>These findings highlight both strengths and limitations of LLMs in ontological inference. While LLMs
demonstrate strong potential for RDFS reasoning in realistic settings, their weaker performance under
counterfactual conditions and reliance on linguistic patterns highlight critical areas for improvement,
particularly in generalizing to unfamiliar data. Future work should focus on analyzing LLMs’ inference
processes in greater detail and extend evaluations with more expressive ontology languages such as
OWL.</p>
      <p>Supplemental Material: Evaluation code, datasets, and detailed results are available at https://
anonymous.4open.science/r/LLM-InferRDFS-MultiLevelEval-EBFC/.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by JSPS KAKENHI Grant Numbers 23K11221 and 25K03232, and was partially
supported by NEDO under Grant Numbers JPNP20006 and JPNP25006.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI ChatGPT and Anthropic Claude in order
to: Grammar and wording check, and translation support. After using these tools, the authors reviewed
and edited the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.C.C.</given-names>
          </string-name>
          :
          <article-title>Towards reasoning in large language models: A survey</article-title>
          .
          <source>In: Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          , pp.
          <fpage>1049</fpage>
          -
          <lpage>1065</lpage>
          (
          <year>2023</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Morishita</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamaguchi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morio</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sogawa</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>JFLD: A Japanese benchmark for deductive reasoning based on formal logic</article-title>
          .
          <source>In: Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING</article-title>
          <year>2024</year>
          ), pp.
          <fpage>3034</fpage>
          -
          <lpage>3049</lpage>
          (
          <year>2024</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Do PLMs Know and Understand Ontological Knowledge? In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          , pp.
          <fpage>3080</fpage>
          -
          <lpage>3101</lpage>
          . Association for Computational Linguistics (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Guha</surname>
            ,
            <given-names>R.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brickley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macbeth</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Schema.
          <article-title>org: evolution of structured data on the web</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>59</volume>
          (
          <issue>2</issue>
          ),
          <fpage>44</fpage>
          -
          <lpage>51</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ozeki</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oba</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mita</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hisamoto</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoshinaga</surname>
          </string-name>
          , N.:
          <article-title>NeuBAROCO: A Japanese dataset for evaluation of syllogistic reasoning ability of language models</article-title>
          .
          <source>In: Proceedings of the 30th Annual Meeting of the Association for Natural Language Processing</source>
          , pp.
          <fpage>1776</fpage>
          -
          <lpage>1781</lpage>
          (
          <year>2024</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>DBpedia: A nucleus for a web of open data</article-title>
          .
          <source>In: The Semantic Web. ISWC</source>
          <year>2007</year>
          ,
          <article-title>ASWC 2007</article-title>
          .
          <article-title>LNCS</article-title>
          , vol.
          <volume>4825</volume>
          , pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          . Springer, Heidelberg (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Vrandečić</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krötzsch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Vandenbussche</surname>
            ,
            <given-names>P.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atemezing</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poveda-Villalón</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vatant</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Linked Open Vocabularies (LOV): A gateway to reusable semantic vocabularies on the Web</article-title>
          .
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ),
          <fpage>437</fpage>
          -
          <lpage>452</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Software Foundation</surname>
          </string-name>
          .
          <article-title>Reasoners and rule engines: Jena inference support</article-title>
          . https://jena. apache.org/documentation/inference/ (
          <source>accessed July</source>
          <year>2025</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>