<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Self-Retrieval-Augmented Generation with Divide-and-Conquer for Language Model-based Knowledge Base Construction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jingbo He</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Razniewski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ScaDS.AI Dresden/Leipzig</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universität Dresden</institution>
          ,
          <addr-line>Helmholtzstr. 10, 01069 Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge base construction from language models (LMs) without external retrieval presents unique challenges. Therefore, we present a hybrid, LM-only system for the LM-KBC 2025 challenge [1], which requires constructing knowledge bases using a fixed model (Qwen3-8B) without fine-tuning or external retrieval. Our method combines Self-RAG for general relations with a divide-and-conquer module specialized for awardWonBy. Self-RAG follows a description-first, then extraction-second design with strict output specifications (names-only or one-number-only) to reduce reliance on brittle post-hoc cleaning; numeric answers are normalized to a canonical digit form. The divide-and-conquer module aggregates candidates from constrained, names-only subqueries and filters them with a strict name validator. Evaluation uses the organizers' oficial string-matching metric. On the hidden test leaderboard, our system achieves the 2nd place out of 5 participants, and improves macro-F1 from 0.212 (baseline) to 0.405 ( +0.194; ∼ +91.5% relative improvement), with large gains on companyTradesAtStockExchange (+0.339), personHasCityOfDeath (+0.330), and countryLandBordersCountry (+0.162).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge base construction</kwd>
        <kwd>Language models</kwd>
        <kwd>Self-RAG</kwd>
        <kwd>Divide-and-Conquer</kwd>
        <kwd>LM-KBC</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Ranking rather than materialization: systems are rewarded for ranking a gold string highly,
not for producing a curated, disambiguated list of entities that can be directly materialized into a
KB.</p>
      <p>
        In contrast, LM-KBC 2025 explicitly removes these simplifications: a subject may stand in relation to
zero, one, or multiple objects, and systems must output disambiguated entities accordingly [? ]. This
makes the task closer to realistic KB construction, where deciding whether to output anything and how
many objects to output is integral to performance. Prior knowledge extraction methods from LLMs
face several challenges. First, direct prompting approaches [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] often produce inconsistent output
formats, requiring brittle post-processing pipelines to extract structured answers from free-form text.
Second, single-prompt extraction methods in the LM-KBC line [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ] struggle with relations of varying
cardinalities, particularly when distinguishing between zero, one, or many valid objects for a given
subject–relation pair. Third, chain-of-thought and reasoning-based approaches [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10, 11, 12, 13</xref>
        ] frequently
entangle explanatory text with factual answers, complicating the extraction of clean knowledge base
entries.
      </p>
      <p>Motivated by these challenges, we propose a hybrid system that combines Self-Retrieval-Augmented
Generation (Self-RAG) with a Divide-and-Conquer strategy. For general relations (e.g.,
companyTradesAtStockExchange, countryLandBordersCountry, hasArea, hasCapacity, personHasCityOfDeath), we employ
Self-RAG to elicit and calibrate model-internal knowledge via targeted entity descriptions before answer
generation. For the challenging relation awardWonBy, we adopt a Divide-and-Conquer design that
decomposes the task into smaller, model-friendly subproblems (e.g., award canonicalization, candidate
winner identification, and consolidation), improving both accuracy and robustness. Our implementation
and experimental setup are publicly available.1</p>
      <p>Our contributions are threefold:
1. We introduce a unified hybrid strategy that couples Self-RAG with Divide-and-Conquer to address
diverse relation types under LM-KBC 2025’s realistic, non-simplified cardinality setting.
2. We demonstrate consistent gains over the organizer-provided baseline across multiple relations,
showing that targeted description generation and task decomposition synergize to improve
precision while maintaining recall.
3. We provide relation-wise analyses that illuminate when Self-RAG sufices and when
decomposition is beneficial, ofering practical guidance for LM-only KB construction.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Retrieval-Augmented Generation (RAG)</title>
        <p>
          Retrieval-Augmented Generation (RAG) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] augments an LLM with a non-parametric memory,
retrieving passages that are fed back into the generator to increase factuality and reduce hallucinations.
Recent surveys [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] systematize the rapidly growing literature, covering naive, advanced, and modular
variants. Self-RAG. Asai et al. propose Self-RAG [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], letting the model decide when and what to
retrieve and to critique its own outputs. We draw inspiration from this adaptive retrieval idea, but, in
contrast to classical RAG, we generate internal entity descriptions rather than relying on an external
corpus—consistent with the LM-KBC 2025 rule that forbids external RAG. While external RAG is
prohibited in our setting, the self-generation principle from Self-RAG directly inspires our approach to
generate internal descriptions as context for extraction.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Divide-and-Conquer Prompting</title>
        <p>
          Decompositional prompting dates back to Chain-of-Thought (CoT) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Least-to-Most strategies. A
simple yet efective variant is Divide-and-Conquer (DaC) prompting. Zhang et al. analyse when DaC is
theoretically beneficial and empirically validate it on arithmetic and fact verification tasks [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Hu et al.
extend the idea to long-horizon decision making, coupling hierarchical RL with an LLM controller [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
Our work adapts DaC to entity-centric knowledge extraction: for the notoriously hard awardWonBy
relation we decompose the query into award canonicalisation, candidate enumeration, and consolidation.
We extend these decomposition insights specifically for knowledge extraction, showing that systematic
query decomposition can overcome single-prompt limitations for high-cardinality relations.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Knowledge Extraction from LLMs</title>
        <p>
          Early studies such as LAMA [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] viewed knowledge extraction as single-answer probing, focusing on
surface-form matching with single-token objects. While this simplified evaluation, it avoided critical
challenges: determining whether zero, one, or multiple objects exist for a given subject-relation pair, and
handling entity disambiguation [
          <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
          ]. These simplifications, while useful for initial benchmarking,
do not reflect the complexity of real knowledge base construction.
        </p>
        <p>
          The LM-KBC challenge series has progressively addressed these limitations [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The 2022 edition [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
moved beyond single-answer assumptions, requiring systems to produce actual disambiguated entities.
The 2024 challenge [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] further emphasized handling varying cardinalities and null values—challenges
that directly motivate our hybrid approach. Unlike earlier probing benchmarks, LM-KBC requires
systems to make explicit decisions about whether to output anything and how many objects to return,
closely mirroring real KB construction scenarios.
        </p>
        <p>
          Recent approaches have tackled these challenges through diferent strategies. Hu et al. introduce
GPTKB [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], constructing large-scale KBs directly from LLMs through extensive materialization. While
GPTKB demonstrates the feasibility of LM-only KB construction, it does not produce canonicalized
relations, nor does it provide a clear evaluation setting. Other work has explored constrained decoding
and structured output generation to ensure consistent formatting, though these often require model
modifications unavailable in our setting.
        </p>
        <p>
          Our system adopts the "LM-only" philosophy while targeting the stricter LM-KBC 2025 setting.
We specifically address three key challenges observed in prior work: (1) output format inconsistency
that necessitates brittle post-processing pipelines, (2) dificulty handling relations with varying
cardinalities—from null values to hundreds of valid objects, and (3) entanglement of explanatory text
with factual answers, particularly problematic when using reasoning-enhanced prompting strategies
[
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10, 12, 11, 13</xref>
          ]. Our hybrid approach combines targeted description generation (Self-RAG) for standard
relations with systematic decomposition (Divide-and-Conquer) for high-cardinality relations, achieving
robust extraction without external resources or model modifications.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>We use the oficial LM-KBC 2025 dataset, which provides subject–relation pairs across six relations.
For each relation, the train/validation/test splits contain fixed sets of unique subjects (Table 1). Some
relations allow null values (i.e., a subject may have no valid object), while others are multi-object (e.g.,
awardWonBy). Two relations are numeric (hasArea, hasCapacity), where objects are scalar values rather
than entities.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>We propose a hybrid system that handles diferent relation types through specialized processing
pipelines. Our approach recognizes that the six relations in LM-KBC 2025 exhibit diferent extraction
challenges, requiring customized strategies for optimal performance.</p>
      <sec id="sec-4-1">
        <title>4.1. System Architecture Overview</title>
        <p>This design choice addresses the core limitation of single-shot extraction: awardWonBy relations
require comprehensive enumeration of large recipient sets that exceed the efective output capacity of
single prompts, while simpler relations with smaller answer sets benefit from direct extraction without
decomposition overhead. Our hybrid approach strategically allocates extraction complexity based on
relation cardinality and the model’s single-shot limitations.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Self-RAG Pipeline for General Relations</title>
        <p>Our Self-RAG implementation adapts the retrieve-generate-critique paradigm by generating internal
entity descriptions as retrieval substitutes, consistent with the LM-KBC 2025 constraint prohibiting
external retrieval. Figure 2 details the three-phase process with concrete examples.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Phase 1: Context Generation</title>
          <p>We generate relation-specific entity descriptions using carefully designed prompt templates. Each
relation employs a targeted description strategy:
Description Generation</p>
          <p>Targeted Extraction</p>
          <p>Input: Apple Inc., companyTradesAtStockExchange
Prompt: "Describe Apple Inc. focusing on stock exchange listings..."
Output: "Apple Inc. is a multinational technology company...traded on NASDAQ..."
Input: Generated description + extraction prompt
Prompt: "Given: [description]. On which stock exchanges does Apple
Inc. trade? List only names, comma-separated."
Output: "NASDAQ"
Input: "NASDAQ"
Processing: Remove &lt;think&gt; tags, validate format</p>
          <p>Final Output: ["NASDAQ"]
Response Processing &amp;</p>
          <p>Validation</p>
          <p>Final Entity List
Describe {entity_name} with emphasis on its total area, size
measurements, and spatial dimensions in square kilometers.</p>
          <p>Describe {entity_name} focusing on its maximum capacity, volume, or
the number of people/items it can hold or accommodate.</p>
          <p>Describe {entity_name} focusing on which stock exchanges it is listed
on and where its shares are traded.</p>
          <p>Describe {entity_name} focusing on which specific countries it shares
land borders with and its neighboring nations.</p>
          <p>Describe {entity_name} focusing on where they died, their place of
death, and the city where they passed away.</p>
          <p>These prompts activate relevant parametric knowledge by directing the model’s attention to the
specific factual dimensions required for subsequent extraction</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Phase 2: Targeted Extraction</title>
          <p>We condition extraction queries on generated descriptions using strict format specifications that enforce
direct, unambiguous outputs:</p>
          <p>System Message (All Relations):
“You are a factual assistant. Provide only the requested information without explanations,
uncertainty statements, or additional context. For name lists, provide only names separated by
commas.”</p>
          <p>Extraction Prompt Templates:</p>
          <p>Given this information about {subject_entity}: {description} On which
stock exchanges does {subject_entity} trade? If you don’t know or
are uncertain, answer ’none’. Otherwise, list all exchange names
without abbreviations, separated by commas.</p>
          <p>Given this information about {subject_entity}: {description} Which
countries border {subject_entity}? If you don’t know or are uncertain
about the bordering countries, answer ’none’. Otherwise, list all
country names only, separated by commas.</p>
          <p>Given this information about {subject_entity}: {description} In which
city did {subject_entity} die? If you don’t know or are uncertain about
the city, answer ’none’. Otherwise, answer with only one city name.</p>
          <p>The bold formatting requirements eliminate ambiguity and enforce direct extraction from model
responses, reducing dependency on post-processing for data cleaning.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Phase 3: Response Processing and Validation</title>
          <p>Our processing pipeline applies minimal cleaning operations: (1) removal of reasoning artifacts
(&lt;think&gt; tags), (2) elimination of uncertainty expressions (“I’m not sure”, “I don’t know”), and (3)
format standardization for consistent output structure. Crucially, the strict prompt design minimizes
the need for extensive post-processing.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Divide-and-Conquer Pipeline for awardWonBy</title>
        <p>The awardWonBy relation presents unique challenges: extremely high cardinality (200+ recipients for
major awards), systematic explanation entanglement in single-shot outputs, and temporal complexity
spanning decades. Our Divide-and-Conquer approach decomposes the enumeration task into
manageable, constraint-focused subqueries. Figure 3 illustrates the complete pipeline with actual query
examples.</p>
        <p>Award Name: e.g., Nobel</p>
        <p>Prize in Physics
Query Decomposition
Direct Enumeration</p>
        <p>Manual Predefined Categories
"Who are American recipients of Nobel
Prize in Physics?
Names only."
→ Jack Steinberger, ...</p>
        <p>Temporal Slicing
1950s, 1960s, 1970s, ...</p>
        <p>Geographic Slicing
American, German, ...</p>
        <p>"Who won Nobel Prize in Physics in 1980s?
Names only, no years."
→ Georg Bednorz, ...</p>
        <p>Complete list of Nobel Prize in Physics winners. 
Names only." 
→ Georg Bednorz, Jack Steinberger, ...</p>
        <p>Candidate Aggregation</p>
        <p>Name Validation Filter</p>
        <p>Deduplicated Final List
 - Length: 2-50 chars
 - Format: Capitalized words
 - Exclude: years, meta-words</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Query Decomposition Strategy</title>
          <p>We employ manually predefined categories to ensure systematic coverage and reproducible results,
avoiding the variability introduced by LLM-generated category schemes.</p>
          <p>Temporal Slicing: We partition queries into eight decade-based categories: 1950s, 1960s, 1970s,
1980s, 1990s, 2000s, 2010s, 2020s. Example prompt:
“List all recipients of the {award_name} in the {decade}. Names only, no years, no explanations.</p>
          <p>Format: Name1, Name2, Name3”</p>
          <p>Geographic Slicing: We use nine predefined nationality categories: American, British, German,
French, Italian, Japanese, Canadian, Chinese, plus “other” for comprehensive coverage. These categories
serve as an initial implementation for decomposing queries by geographic dimension. Future work could
explore data-driven or dynamic category selection based on each award’s specific recipient distribution.
Example prompt:
“List all {nationality} recipients of the {award_name}. Names only, no explanations. Format:
Name1, Name2, Name3”
Direct Enumeration: We employ five query formulations as backup strategies:
• “List the names of all {award_name} recipients. Format: Name1, Name2, Name3”
• “{award_name} winners list. Only names separated by commas.”
• “Complete roster of {award_name} laureates. Names only.”
• “All {award_name} recipients in chronological order. Just the names.”
• “Who won {award_name}? List all names without years or descriptions.”</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Name Validation and Aggregation</title>
          <p>We implement a strict, multi-stage validation filter with the following criteria:
Validation Rules:
This filter efectively removes false candidates while preserving valid recipient names.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Computational Analysis</title>
        <p>Our hybrid system employs diferent computational strategies based on relation complexity. All
experiments were conducted on NVIDIA A100-SXM4 Tensor Core GPUs (40 GB HBM2) with AMD
EPYC CPU 7352 (24 cores) @ 2.3 GHz, utilizing 1 GPU and 6 CPU cores per experiment.</p>
        <p>Table 9 presents the empirical timing analysis comparing our hybrid approach against the baseline
system across all relations in the LM-KBC 2025 dataset.</p>
        <p>Self-RAG Eficiency: For the five general relations ( companyTradesAtStockExchange,
countryLandBordersCountry, hasArea, hasCapacity, personHasCityOfDeath), Self-RAG incurs only a 1.11×
computational overhead despite requiring two LLM calls per subject-relation pair. This eficiency stems
from the targeted nature of our prompts, which reduce the need for extensive post-processing and retry
mechanisms.</p>
        <p>Divide-and-Conquer Investment: The awardWonBy relation requires a substantial 3.85×
computational investment, reflecting the complexity of comprehensive recipient enumeration through multiple
query dimensions (8 temporal + 9 geographic + 5 direct variants). However, this targeted computational
expenditure yields significant accuracy improvements for the most challenging relation in the dataset.</p>
        <p>Strategic Resource Allocation: Our hybrid approach demonstrates strategic computational
eficiency: while the overall system overhead is 2.04× , the investment is concentrated where it provides
maximum benefit. The modest overhead for general relations (1.11 × ) combined with targeted
investment for complex enumeration tasks represents an optimal trade-of between computational cost and
accuracy gains.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Prompt Engineering for Direct Extraction</title>
        <p>Our prompt design philosophy prioritizes specification-driven generation over post-hoc cleaning
processes, addressing a key challenge in existing LM-based knowledge extraction systems: the brittleness of
complex post-processing pipelines. We enforce output structure through explicit formatting instructions
rather than relying on error-prone cleaning mechanisms.
4.5.1. Key Design Principles
1. Explicit Format Specifications: Every extraction prompt includes precise output format
requirements tailored to the expected answer type. For numeric relations, we specify “Answer with one number
only”; for entity lists, “List only names, comma-separated”; for potential null cases, “If not applicable,
answer ‘None’ ”.</p>
        <p>2. Proactive Uncertainty Handling: Rather than allowing the model to generate uncertain or
hedged responses, we provide explicit instructions for knowledge gaps: “If you don’t know or are
uncertain, answer ‘none’ ”. This directly addresses the model’s tendency to provide inferential answers
when facing knowledge limitations.</p>
        <p>3. Minimalist System Messages: We employ consistent, concise system instructions across all
relations: “You are a factual assistant. Provide only the requested information without explanations,
uncertainty statements, or additional context.” This uniform approach eliminates variability in model
behavior across diferent relation types.</p>
        <p>4. Reasoning Suppression: Our prompts explicitly discourage verbose explanations, uncertainty
expressions, and step-by-step reasoning in final outputs. This design choice stems from our observation
that models often mix factual answers with explanatory text, complicating extraction.</p>
        <sec id="sec-4-5-1">
          <title>4.5.2. Cross-relation Generalizability</title>
          <p>The efectiveness of our prompt engineering principles generalizes across diverse relation types and
answer formats. Whether extracting single numeric values (hasArea), entity lists
(countryLandBordersCountry), or handling null cases (personHasCityOfDeath), the specification-driven approach consistently
produces directly usable outputs without relation-specific post-processing adaptations.</p>
        </sec>
        <sec id="sec-4-5-2">
          <title>4.5.3. Empirical Validation</title>
          <p>Our systematic error analysis (Section 5.3) provides empirical evidence for the efectiveness of this
approach: we achieve 0% formatting failures across all sampled cases, demonstrating that
specificationdriven prompting successfully eliminates technical processing errors. This validates our design
philosophy that prevention through careful prompt design is more reliable than correction through
post-processing.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>We follow the oficial LM-KBC 2025 evaluation protocol: scores are computed using precision, recall,
and F1 metrics with exact string matching, and results are verified on the hidden test leaderboard. Our
evaluation encompasses both quantitative performance analysis and qualitative error investigation to
provide comprehensive insights into system behavior.</p>
      <p>Leaderboard Status: On the hidden test leaderboard as of 2025-08-01, our system achieves the
2nd place (out of 5 participants), demonstrating the efectiveness of our hybrid approach on unseen
test data where test labels remain private. This ranking validates our design choices across the diverse
set of relations in the LM-KBC 2025 challenge.</p>
      <sec id="sec-5-1">
        <title>5.1. Quantitative Results</title>
        <p>Performance Analysis: Our hybrid system achieves substantial improvements across all relations on
the hidden test set, with macro F1 increasing from 0.2116 to 0.4052 (∼ +91.5% relative improvement).
The results demonstrate distinct patterns across relation types:
• Exceptional gains on challenging relations: companyTradesAtStockExchange (+0.3387) and
personHasCityOfDeath (+0.3300) show the largest improvements, attributable to Self-RAG’s
description-first prompting strategy which provides crucial context for these domain-specific
queries.
• Strong performance on structured relations: countryLandBordersCountry (+0.1624)
demonstrates consistent improvements over an already strong baseline (0.7025), indicating that Self-RAG
enhances even well-performing baseline approaches.
• Meaningful progress on complex enumeration: awardWonBy (+0.0589) benefits from our
Divide-and-Conquer strategy, though the modest gain reflects the inherent dificulty of
comprehensive recipient enumeration for major awards.
• Consistent improvements on numeric relations: Both hasArea (+0.0700) and hasCapacity
(+0.0700) show identical improvements, likely due to our consistent digit-only normalization
approach, though string-matching evaluation remains sensitive to precision and rounding
diferences.</p>
        <p>Precision-Recall Trade-ofs: Our system achieves a substantial precision increase (+0.2962) with
minimal recall reduction (+0.0242), indicating that our approach successfully reduces false positives
while maintaining coverage. This pattern suggests that our strict output specifications and validation
mechanisms efectively filter unreliable predictions without sacrificing comprehensive knowledge
extraction.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Strategy Selection Analysis</title>
        <p>To validate our hybrid approach, we conducted controlled experiments comparing Self-RAG and
Divideand-Conquer strategies across diferent relation types. For token counting, since exact tokenization
requires the use of model-specific tokenizers (e.g., OpenAI’s tiktoken), we estimate the number of tokens
in English text by assuming that one token corresponds to approximately four characters (including
spaces). This heuristic follows OpenAI’s oficial guideline, which reports that “1 token ≈ 4 characters
of English text” [21].</p>
        <sec id="sec-5-2-1">
          <title>5.2.1. Divide-and-Conquer Efectiveness on awardWonBy</title>
          <p>Table 7 presents a comparative analysis between Self-RAG and Divide-and-Conquer (DaC) strategies on
the awardWonBy relation. The results demonstrate a compelling case for using DaC on high-cardinality
enumeration tasks. While Self-RAG achieves only 0.0369 F1 score, DaC reaches 0.1759, representing a
4.8× improvement. This substantial gain justifies the increased computational cost (25.9 × more tokens,
6.1× longer execution time). The low Self-RAG performance confirms that single-query approaches
fundamentally cannot enumerate comprehensive recipient lists, as the model’s single-response capacity
limits it to returning only the most prominent recipients.
5.2.2. Limitations of Divide-and-Conquer on Other Relations</p>
          <p>Table 8 reveals that DaC’s efectiveness is highly relation-specific. For countryLandBordersCountry,
Self-RAG achieves 0.8649 F1 while DaC drops to 0.6201 (-28.3%). Similarly, for
companyTradesAtStockExchange, Self-RAG’s 0.5057 significantly outperforms DaC’s 0.1287 (-74.5%). These results indicate that
decomposition strategies can actually harm performance on medium-to-low cardinality relations.</p>
          <p>As shown in Table 9, when processing both relations, DaC requires 5.7× more tokens (787,667 vs.
137,833) and 7.2× more computation time (27,306s vs. 3,802s) than Self-RAG. This substantial increase
in computational resources, combined with the degraded performance, makes DaC economically
unjustifiable for these relations.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.3. Temporal Granularity Trade-ofs</title>
          <p>Table 10 compares decade-based versus year-based temporal decomposition for awardWonBy. While
year-based queries achieve marginally higher F1 (0.1811 vs. 0.1759, +2.9% relative), they require 3.3×
more tokens and 3.1× more time. The minimal F1 improvement of 0.0052 does not justify the
substantial increase in computational resources.</p>
          <p>This analysis supports our decade-based approach as optimal for practical deployment, balancing
efectiveness with eficiency. The diminishing returns from finer granularity suggest that further
decomposition would yield negligible benefits while dramatically increasing costs.</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>5.2.4. Implications for System Design</title>
          <p>These findings validate our hybrid architecture that applies strategies based on relation characteristics:
• High-cardinality relations (awardWonBy): Divide-and-Conquer despite computational
overhead
• Medium/low-cardinality relations: Self-RAG for superior eficiency and accuracy
• Temporal granularity: Decade-based decomposition provides the best balance between coverage
and cost</p>
          <p>The results emphasize that no single strategy dominates across all relation types, reinforcing the
need for adaptive, relation-aware approaches in LM-based knowledge extraction.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Error Analysis</title>
        <sec id="sec-5-3-1">
          <title>5.3.1. Error Analysis of Self-RAG Strategy</title>
          <p>Since test set answers are not publicly available, we conduct error analysis exclusively on validation
dataset samples. We manually sample 5 entities for each relation type, focusing solely on incorrect
cases to understand failure patterns. Following a systematic approach, we examine each error through
four potential failure modes: (1) Self-RAG context generation issues, (2) extraction step failures, (3)
formatting problems, and (4) evaluation method limitations. Detailed error cases with model outputs
and gold standards are provided in Appendix B.</p>
          <p>Systematic Analysis Findings: Our layer-by-layer error analysis reveals a clear hierarchy of failure
modes:
24
0
0
1
25
1. Technical implementation robustness (0% failures): We can definitively rule out extraction
step failures and formatting issues. Our response processing pipeline successfully extracts
information from model outputs without introducing errors, validating the robustness of our
hybrid architecture’s technical components.
2. Evaluation method appropriateness (4% limitations): Only one case represents a pure
evaluation limitation (“Ivory Coast” vs. “Côte d’Ivoire”), confirming that string-matching evaluation
aligns well with semantic correctness for our task domain.
3. Context generation as primary bottleneck (96%): The overwhelming majority of errors stem
from inadequate context generation, indicating that system improvements should focus on the
initial knowledge activation phase rather than downstream processing.</p>
          <p>Evidence from Model Reasoning Traces: Our system logs reveal the model’s internal reasoning
process, providing direct evidence of inferential behavior, as Appendix A shows. Despite explicit
instructions to “answer ’none’ if uncertain,” the model rarely admits complete ignorance. For example,
when queried about Hopen’s area, the model’s reasoning trace shows: “I need to gather accurate data...
I remember that Hopen is one of the larger islands in Svalbard... From what I can find, Hopen’s area is
approximately 1,600 square kilometers... the consensus is 1,600.” However, the actual area is 47 km²,
demonstrating a 34x overestimation.</p>
          <p>Inferential Reasoning Pattern: This trace reveals critical behavioral patterns: (1) acknowledging
uncertainty while (2) constructing plausible reasoning chains, (3) simulating source consultation,
and (4) expressing false confidence in estimated answers. The model chooses to provide inferential
responses rather than appropriate abstention, indicating underlying knowledge gaps compensated
through sophisticated reasoning.</p>
          <p>Implications for System Design: The error analysis confirms that our hybrid approach
successfully addresses technical extraction and processing challenges, but reveals fundamental limitations in
distinguishing between confident factual knowledge and inferential reasoning. Future improvements
should focus on uncertainty quantification and appropriate abstention mechanisms rather than technical
pipeline enhancements.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Error Analysis of Divide-and-Conquer Strategy</title>
        <p>To understand the failure modes of our Divide-and-Conquer approach, we conduct detailed analysis
using the Max Planck Medal as a representative case study. We examine both the model’s reasoning
process and the systematic patterns across temporal slices.</p>
        <p>Analysis of the model’s internal reasoning for the 1950s query reveals pervasive uncertainty markers
throughout the extraction process. The model explicitly states “I’m not 100% certain” and uses 47
instances of self-correction (“Wait, I’m getting confused”), yet still generates four physicist names. This
behavior—acknowledging uncertainty while producing confident-seeming outputs—represents a critical
failure mode where decomposition provides more opportunities for plausible confabulation rather than
appropriate abstention.</p>
        <p>We evaluate outputs across three accuracy dimensions (detailed results in Table 14 in Appendix):
Layer 1 - Domain Coherence: The model maintains 100% domain accuracy across all decades,
consistently generating physicist names. This demonstrates that decomposition preserves conceptual
understanding of the award’s domain.</p>
        <p>Layer 2 - Award Association: Accuracy varies dramatically by era:
• Pre-1990: Mixed performance (50-100% are actual recipients, though often from wrong decades)
• Post-1990: Complete failure (0% are actual Max Planck Medal recipients)
• The model generates plausible physicists (John Bardeen, Steven Weinberg) who never received
this specific award</p>
        <p>Layer 3 - Temporal Precision: Even when identifying actual recipients, temporal placement is
severely compromised. Hans Bethe (1955) appears in both 1970s and 1980s queries; Niels Bohr (1930)
appears in the 1960s query. This suggests the model has fragmented knowledge of recipients but lacks
temporal grounding.</p>
        <p>The most striking pattern is the sharp knowledge degradation around 1990. For earlier decades, the
model retrieves some actual recipients despite temporal confusion. For 1990s onwards, it generates
exclusively non-recipients, indicating a complete knowledge void rather than retrieval dificulty. This
boundary is consistent across all temporal slices, demonstrating that decomposition cannot compensate
for absent knowledge.</p>
        <p>For 2000s and 2010s queries, the model exceeded token limits by producing verbose explanations (“I
need to list all recipients of the Max Planck Medal in the 2000s...”) instead of the required name-only
format. This suggests that uncertainty triggers extended reasoning despite explicit format constraints,
leading to extraction failures even when the query structure is identical to successful cases.</p>
        <p>Our analysis reveals both the potential and limitations of Divide-and-Conquer:</p>
        <p>Strengths: The strategy successfully surfaces more information than single queries might achieve.
Diferent temporal prompts activate diferent memory patterns, helping retrieve recipients like Paul
Dirac and Enrico Fermi who might be missed in monolithic queries.</p>
        <p>Fundamental Limitation: Decomposition amplifies existing knowledge but cannot synthesize
absent information. When the model lacks knowledge (post-1990 recipients), it confidently generates
plausible but incorrect answers for each sub-query, potentially compounding errors through aggregation.</p>
        <p>Key Insight: The efectiveness of Divide-and-Conquer is bounded by the underlying knowledge
availability in the model’s parametric memory. It works best for fragmented knowledge that needs
assembly, not for complete knowledge voids.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We present a hybrid system for LM-only knowledge base construction that strategically combines
SelfRAG for general relations with a specialized divide-and-conquer module for awardWonBy. Our approach
addresses the core challenges of the LM-KBC 2025 setting: constructing disambiguated knowledge
bases from a fixed language model without fine-tuning or external retrieval augmentation.</p>
      <p>In the oficial evaluation, our hybrid system achieves substantial performance gains across all six
relations, improving macro F1 from 0.212 to 0.405, and securing the 2nd place on the hidden test
leaderboard. We obtain consistent improvements with particularly strong gains on challenging relations:
companyTradesAtStockExchange (+0.339), personHasCityOfDeath (+0.330), and
countryLandBordersCountry (+0.162). Our precision increase (+0.296) with minimal recall reduction (+0.024) indicates that our
approach successfully filters unreliable predictions, while maintaining coverage.</p>
      <p>We make three key contributions to LM-based knowledge extraction. First, we demonstrate that
different relation types require fundamentally diferent extraction strategies: while Self-RAG’s
descriptionifrst approach excels for structured relations through targeted knowledge activation, divide-and-conquer
decomposition is essential for high-cardinality enumeration tasks like awardWonBy. Second, we show
that specification-driven prompt engineering can eliminate formatting errors entirely—our systematic
error analysis reveals 0% formatting failures across all sampled cases, with 96% of errors stemming from
knowledge gaps rather than technical processing issues. Third, our error analysis reveals a fundamental
insight: a primary bottleneck in LM-based knowledge extraction is not the extraction, but rather the
model’s tendency to provide confident inferential answers when facing knowledge gaps. Through direct
examination of model reasoning traces, we demonstrate that models construct plausible but incorrect
responses through sophisticated reasoning rather than admitting uncertainty.</p>
      <p>These findings suggest that uncertainty quantification and appropriate abstention mechanisms are
more critical than advanced post-processing techniques for improving knowledge extraction reliability.
Our hybrid approach demonstrates that strategic combination of complementary techniques can
significantly advance LM-only knowledge base construction, with the insights about inferential reasoning
providing a foundation for developing more reliable knowledge extraction systems that appropriately
handle uncertainty in language models’ parametric knowledge.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
Prompting as probing: Using language models for knowledge base construction, LM-KBC challenge
@ ISWC (2023). URL: https://arxiv.org/abs/2208.11057.
[21] OpenAI, Openai tokenizer, 2023. URL: https://platform.openai.com/tokenizer, accessed: 2025-09-11.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Model Reasoning Trace Examples</title>
      <p>To provide concrete evidence of the inferential reasoning patterns identified in our error analysis, we
present actual model reasoning traces captured during our experiments. These traces demonstrate how
the model constructs plausible but incorrect answers when facing knowledge gaps.</p>
      <sec id="sec-8-1">
        <title>A.1. Example: Hopen Island Area Query</title>
        <p>Query: What is the exact area of Hopen in square kilometers?</p>
        <p>Model’s Internal Reasoning Trace:
“I need to gather accurate data on its area. I remember that Hopen is one of the larger islands
in Svalbard... From what I can find, Hopen’s area is approximately 1,600 square kilometers...
Some sources might list it as 1,610 km²... The user wants the exact number, so I need to ensure
it’s the most up-to-date and accurate value... the consensus is 1,600 square kilometers.”
Model Output: 1600
Gold Standard: 47</p>
        <p>Analysis: This trace exemplifies the inferential reasoning pattern where the model: (1) acknowledges
uncertainty (“I need to gather accurate data”), (2) constructs seemingly plausible reasoning (“one of
the larger islands”), (3) simulates source consultation (“From what I can find”), and (4) expresses
false confidence (“the consensus is 1,600”). The 34x overestimation demonstrates how sophisticated
reasoning can lead to confident but dramatically incorrect answers when the model lacks precise factual
knowledge.</p>
        <p>A.2. Example: Max Planck Medal 1950s Query (Divide-and-Conquer)
Query: List all recipients of the Max Planck Medal in the 1950s. Names only, no years, no explanations.
Format: Name1, Name2, Name3</p>
        <p>Model’s Internal Reasoning Trace:
“I need to list all recipients of the Max Planck Medal in the 1950s... Wait, I’m getting confused...
I think the Max Planck Medal was awarded in the 1950s to: Paul Dirac (but he received it in
1952)... John von Neumann (1956)... Enrico Fermi (1950)... Edward Teller (1958)... Wait, but
I’m not entirely sure... I’m not 100% certain. However, based on my knowledge, these are the
names.”
Model Output: Paul Dirac, Enrico Fermi, John von Neumann, Edward Teller</p>
        <p>Gold Standard: Peter Debye, James Franck, Gustav Hertz, Paul Dirac, Walther Bothe, Enrico Fermi,
Hans Bethe, Victor Weisskopf, Carl Friedrich von Weizsäcker, Wolfgang Pauli, Oskar Klein</p>
        <p>Analysis: This example illustrates specific error patterns in temporal sub-queries:</p>
        <sec id="sec-8-1-1">
          <title>Error Pattern Identification</title>
          <p>• Partial correct recall: The model correctly identified 2 actual recipients (Paul Dirac, Enrico</p>
          <p>Fermi)
• Name confabulation: Generated John von Neumann and Edward Teller, who never received
this award
• Incomplete coverage: Retrieved only 2 of 10 actual recipients from the decade
• Uncertainty handling: Despite expressing significant uncertainty throughout reasoning, the
model still produced four names rather than abstaining
Observed Behavior The model appears to generate names based on “prominent physicists of the era”
when facing knowledge gaps, mixing correct recipients with plausible but incorrect candidates. The
internal reasoning shows the model attempting to reconstruct information through associative reasoning
(“1956... John von Neumann”) despite acknowledged uncertainty. This suggests potential improvements
through stricter confidence thresholds or additional validation steps within the decomposition pipeline.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>B. Detailed Error Case Analysis</title>
      <p>This appendix provides the complete manual error analysis conducted on validation dataset samples. For
each incorrect case, we present the model’s direct output, the gold standard answer, and our systematic
error classification following the four-category framework described in Section 5.
DRDGOLD Limited
Iraq
Turkey
Ethiopia
Burkina Faso
Serbia
Annobón Island
La Digue
Saint Kitts and Nevis
Flinders Island
Goli otok
Jinshan Sports Centre
Estadio El Birichiche
Stevenson Field
Carrara Indoor Stadium
Estádio da Gávea
Christoph Eschenbach
Erich Schleyer
Bolesław Zoń
Al Jarreau
Souleymane Cissé</p>
      <p>Gold Standard
Physicists
All 4 are physicists
All are physicists</p>
      <p>Paul Dirac, Enrico Fermi,
John von Neumann,
Edward Teller
Richard Feynman, Julian
Schwinger, Hans Bethe,
Edward Teller, John
Bardeen, Lev Landau,
Wolfgang Pauli, Eugene
Wigner, Niels Bohr, Max
Born
Richard P. Feynman,
Julian Schwinger, Hans
Bethe, Edward Teller,
John Bardeen, Robert
Marshak
John Bardeen, Steven
Weinberg, Edward Teller,
Murray Gell-Mann,
Sheldon Glashow, Hans
Bethe
John Bardeen, Steven
Weinberg, Murray
GellMann, Klaus von
Klitzing, Sheldon Glashow,
Abdus Salam, Richard
Feynman, Edward
Witten, Gerardus ’t Hooft
Format failure: exceeded
token limit
Format failure: exceeded
token limit
Klaus Hasselmann,
David J. Thouless,
Martinus Veltman, Gerardus
’t Hooft</p>
      <p>Decade
Dieter Vollhardt, Giorgio
Parisi, Martin Zirnbauer,
Werner Nahm, David Ruelle,
Viatcheslav Mukhanov,
Herbert Wagner, Herbert Spohn,
Juan Ignacio Cirac, Detlef
Lohse
Andrzej Buras, Alexander
Markovich Polyakov,
Annette Zippelius, Rashid A.</p>
      <p>Sunyaev, Erwin Frey,
Reinhard F. Werner
All are physicists</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Zhang,</surname>
          </string-name>
          <article-title>4th lm-kbc challenge</article-title>
          ,
          <source>in: LM-KBC Challenge @ ISWC</source>
          ,
          <year>2025</year>
          . URL: https://lm-kbc.github.io/challenge2025/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] OpenAI, GPT-4
          <source>technical report, arXiv:2303.08774</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2303.08774.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.,
          <source>Qwen3 technical report, arXiv:2505.09388</source>
          (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2505. 09388.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>T. B. Brown</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>NeurIPS</source>
          (
          <year>2020</year>
          ). URL: https://arxiv.org/ abs/
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases?</article-title>
          ,
          <source>in: EMNLP</source>
          ,
          <year>2019</year>
          . URL: https://aclanthology.org/D19-1250/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <article-title>Enabling LLM knowledge analysis via extensive materialization</article-title>
          ,
          <source>ACL</source>
          (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2411.04920.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Singhania</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Razniewski,</surname>
          </string-name>
          <article-title>1st lm-kbc challenge</article-title>
          ,
          <source>in: LM-KBC challenge @ ISWC</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3274</volume>
          /paper1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singhania</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          , LM-KBC
          <year>2023</year>
          :
          <article-title>2nd challenge on knowledge base construction from pre-trained language models</article-title>
          ,
          <source>in: LM-KBC Challenge @ ISWC</source>
          <year>2023</year>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3577</volume>
          /paper0.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , et al.,
          <article-title>Preface: LM-KBC challenge 2024, in: LM-KBC Challenge @</article-title>
          ISWC,
          <year>2024</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3853</volume>
          /paper0.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chainof-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>NeurIPS</source>
          (
          <year>2022</year>
          ). URL: https: //arxiv.org/abs/2201.11903.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          ,
          <source>NeurIPS</source>
          (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2205.11916.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schärli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Scales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bousquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <article-title>Least-to-most prompting enables complex reasoning in large language models</article-title>
          ,
          <source>ICLR</source>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/abs/2205.10625.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Shafran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <article-title>Tree of thoughts: Deliberate problem solving with large language models</article-title>
          ,
          <source>NeurIPS</source>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/abs/2305.10601.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive NLP tasks</article-title>
          ,
          <source>NeurIPS</source>
          (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrievalaugmented generation for large language models: A survey</article-title>
          ,
          <source>arXiv:2312.10997</source>
          (
          <year>2023</year>
          ). URL: https: //arxiv.org/abs/2312.10997.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , Self-RAG:
          <article-title>Learning to retrieve, generate, and critique through self-reflection</article-title>
          ,
          <source>ICLR</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2310.11511.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>An examination on the efectiveness of divide-and-conquer prompting in large language models</article-title>
          ,
          <source>arXiv:2402.05359</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2402. 05359.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Y. Cheng,
          <article-title>Divide and conquer: Grounding LLMs as eficient decision-making agents via ofline hierarchical reinforcement learning</article-title>
          , ICML (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2505.19761.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N.</given-names>
            <surname>Poerner</surname>
          </string-name>
          , U. Waltinger,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <article-title>BERT is not a knowledge base (yet): On the barriers to probing facts</article-title>
          ,
          <source>in: ACL</source>
          ,
          <year>2020</year>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>328</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Alivanistos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Báez</given-names>
            <surname>Santamaría</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          , E. van Krieken,
          <string-name>
            <given-names>T.</given-names>
            <surname>Thanapalasingam</surname>
          </string-name>
          , Paul Dirac (
          <year>1952</year>
          ), Enrico
          <string-name>
            <surname>Fermi</surname>
          </string-name>
          (
          <year>1954</year>
          )
          <article-title>Niels Bohr (</article-title>
          <year>1930</year>
          ), Max
          <string-name>
            <surname>Born</surname>
          </string-name>
          (
          <year>1948</year>
          ), Hans
          <string-name>
            <surname>Bethe</surname>
          </string-name>
          (
          <year>1955</year>
          ), Wolfgang
          <string-name>
            <surname>Pauli</surname>
          </string-name>
          (
          <year>1958</year>
          ), Lev
          <string-name>
            <surname>Landau</surname>
          </string-name>
          (
          <year>1960</year>
          ), Eugene
          <string-name>
            <surname>Wigner</surname>
          </string-name>
          (
          <year>1961</year>
          )
          <article-title>Hans Bethe (</article-title>
          <year>1955</year>
          )
          <article-title>Hans Bethe (</article-title>
          <year>1955</year>
          )
          <article-title>1950s 1960s 1970s 1980s 1990s 2000s 2010s</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>