<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Adaptive Knowledge Structuring by Multi-Agent Consensus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Takahiro Kobayashi</string-name>
          <email>tkhr.kobayashi@ntt.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Makoto Nakatsuji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Knowlede Graph, Large Language Models, LLM-based Agents, Human-AI Collaboration, Multi-Agent Systems</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>NTT, Inc. Human Informatics Laboratories</institution>
          ,
          <addr-line>1-1, Hikari no Oka, Yokosuka city, Kanagawa</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>As LLM-based agents become integral to workflows, enabling accurate task execution through collaboration between multiple agents and humans has emerged as a critical research challenge. Knowledge graphs (KGs) support semantic consistency but are often incomplete in dynamic domains. We propose a method combining (1) autonomous extraction of structured knowledge from inter-agent discussions, (2) integration into a shared evolving KG, and (3) consensus-driven refinement during task execution. Preliminary experiments compared two configurations: one implementing all components and another omitting consensus. Incorporating consensus increased non-taxonomic relations by 15% and improved relation precision from 79% to 97%, confirming its efectiveness. Future work will examine how refined knowledge impacts overall task accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As AI agents powered by large language models (LLMs) become increasingly integrated into real-world
workflows, there is growing demand for accurate and reliable task execution. Collaborative frameworks
involving multiple agents [
        <xref ref-type="bibr" rid="ref1">1, 2, 3</xref>
        ] and human-in-the-loop approaches [4] have been proposed to align
task outputs with human intent. While many tasks require up-to-date domain knowledge, LLMs only
store knowledge learned during pretraining. Fine-tuning is one way to incorporate new knowledge, but
it demands enormous time and computational resources, making it impractical for frequent updates.
      </p>
      <p>Retrieval-Augmented Generation (RAG) [5] addresses this by combining LLMs with external
knowledge bases. Among candidates, vector-search-based document databases [6] and knowledge graphs
(KGs) [7] are promising. Vector-based stores retrieve semantically similar documents but cannot
guarantee accurate relationships. In contrast, KGs explicitly represent semantic relations, making them
suitable for complex reasoning and consistency checks.</p>
      <p>To utilize KGs in RAG, it is necessary to construct the graph itself. Public KGs such as Wikidata1 are
often insuficient for domains with fluid structures, where knowledge must adapt to emerging entities
and shifting relationships. Agents must therefore update and reorganize knowledge representations to
respond efectively to changing requirements.</p>
      <p>
        This study proposes a framework to enhance task accuracy by organizing and sharing knowledge
extracted from input materials. It consists of three components: (1) autonomous extraction of structured
knowledge; (2) construction of a shared KG; and (3) consensus-based refinement through “KG update
meetings.” To validate the approach, we conducted experiments using ACT-generated dialogue logs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
as input. Results show that consensus-based refinement increased the qunantity of non-taxonomic
relations by 15% and improved precision from 79% to 97%.
Japan
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <sec id="sec-2-1">
        <title>2.1. Teminology</title>
        <p>To ensure clarity, we define key terms used throughout this paper:
Task Resolution. The process of generating an output artifact that satisfies a user-specified task
instruction. Task resolution involves interpreting the input task and producing a coherent deliverable
aligned with the given requirements.</p>
        <p>Agent. In this study, an agent refers to an AI entity that performs a specific role based on a consistent
objective and persona. Each agent operates by selectively employing multiple LLM prompts to
accomplish its assigned responsibilities within the collaborative workflow.</p>
        <p>Input Materials. We use the term input materials to denote the data targeted for knowledge extraction
during KG updates. Typically, these materials consist of textual resources—structured or
unstructured—obtained through autonomous research activities such as web searches, database queries, or
interactions with external expert agents.
2.2. LLMs4OL
The LLMs4OL framework [8] investigates the hypothesis that LLMs known for their ability to capture
complex linguistic patterns, can efectively support ontology learning (OL) tasks without extensive
domain-specific training. Specifically, LLMs4OL evaluates multiple LLM families under zero-shot
prompting for three core OL tasks:
• Term Typing: Assigning ontological types to extracted terms.
• Taxonomy Discovery: Identifying hierarchical relationships among concepts.
• Non-Taxonomic Relation Extraction: Detecting semantic relations beyond hierarchical
structures.</p>
        <p>Empirical results show that LLMs generalize efectively across diverse ontological domains, including
lexical (WordNet), geographical (GeoNames), and biomedical (UMLS), ofering a scalable alternative to
manual curation.</p>
        <p>
          Our research builds upon this paradigm by adopting the same three tasks as foundational components
of our methodology to build KGs. By leveraging LLMs in these processes, we aim to enhance semantic
consistency, aligning with the principles established by LLMs4OL.
2.3. ACT
Our research assumes a framework in which multiple agents autonomously conduct researches and
collaboratively generate outputs, similar in spirit to ACT [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. ACT operationalizes complex task
execution through a four-phase workflow:
1. Agent Generation: The input task is decomposed into subtasks, and agents are dynamically
instantiated and assigned responsibility for each subtask.
2. Team Meeting: Agents engage in a collaborative discussion to refine the overall task design
and their respective subtasks. A randomly selected leader fuses individual task proposals into
a coherent team-level plan, ensuring feasibility and alignment; agents also acquire structured
knowledge of others’ approaches.
3. Break Time: Each agent independently analyzes its contribution and deepens its expertise. To
that end, agents dynamically generate expert agents and conduct focused interviews to expand
task-relevant knowledge.
4. Production Meeting: Agents share accumulated insights and integrate their contributions to
produce the final output aligned with the team-level task.
        </p>
        <p>Inspired by this paradigm, our study proposes a method for consolidating the outcomes of agents’
discussions and the findings from their autonomous research into a formal KG. Rather than limiting
knowledge to ACT’s internal representation for coordination, we leverage agents’ interactions and
external knowledge sources to populate and refine a KG.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Overview of Task Resolution</title>
        <p>Let  denote an input task specification and  the desired final output. The framework resolves 
through four phases, while maintaining a persistent knowledge graph (KG) that evolves across tasks:
1. Agent Generation. The task  is decomposed into subtasks  = { 1, … ,   }. For each   , we
instantiate an agent   with a persona   and a role-specific task descriptor   , forming a team
 = { 1, … ,   }.
2. Individual Research. Each agent autonomously gathers input materials   (e.g., web/DB
documents or outputs from external agents) guided by (  ,   ) and a KG-conditioned view   of
the current graph   () . Formally,
  = Select( 
() ,   ,   ),
  = Research(  ,   ,   ).
3. KG Update Meeting. Agents propose knowledge extractions from {  } under a consensus
protocol (Sec. 3.3), yielding a consolidated update Δ  and an updated graph   (+1) .
4. Output Production Meeting. Referencing   (+1) , agents iteratively co-author the final output
 under the same consensus protocol.</p>
        <p>The KG is reused across tasks and updated monotonically with conflict handling, enabling long-horizon
knowledge reuse and consistent decision-making. This study focuses on KG generation and does not
specify details of individual research or output production beyond aspects relevant to KG construction.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Knowledge Extraction Pipeline</title>
        <p>The knowledge graph at iteration  is represented as
where:
nosep  () : set of terms (entities/concepts),
nosep  () : set of types (the first level classes of the taxonomy),
nosep  N()T: set of non-taxonomic relations,
nosep  T()AX: set of taxonomic (IS-A) relations.</p>
        <p>() ()
() = ( () ,  () ,  NT,  TAX),</p>
        <p>Given input materials  collected by agents, the update process for  
steps:
consists of the following
(i) Term Extraction. Extract domain-specific terms from  :</p>
        <sec id="sec-3-2-1">
          <title>Merge with existing terms:</title>
          <p>cand =  term( )
 (+1) =  () ∪  cand.
(ii) Non-Taxonomic Relation Extraction. Identify semantic relations among terms:</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Merge with existing relations:</title>
          <p>(iii) Type Extraction. Generate candidate types incrementally:
 NcaTnd =  nt( (+1) ,  ).</p>
          <p>N(+T1) =  N()T ∪  NcaTnd.</p>
          <p>(+1) =  type( (+1) ,  () ,  ).</p>
          <p>This design allows dynamic type induction without relying on a fixed schema.
(iv) Type Assignment. Assign each term in  (+1) to exactly one type in  (+1) . Formally:
 ∶  (+1) →  (+1) ,
new type candidate may be introduced during  type in the next iteration.
where ( )</p>
          <p>returns the unique type   ∈  (+1) that best represents term  . If no suitable type exists, a
(v) Taxonomic Relation Extraction. For each type   ∈  (+1) , extract hierarchical relations using:</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Aggregate all per-type results: The updated knowledge graph is:</title>
          <p>TAX, =  tax(  , { ∈  (+1) ∣ ( ) =   }).</p>
          <p>T(+A1X) =</p>
          <p>⋃
  ∈ (+1)</p>
          <p>TAX, .</p>
          <p>(+1) = ( (+1) ,  (+1) ,  N(+T1) ,  T(+A1X) )
Conflict resolution and structural invariants (e.g., acyclicity for  TAX) are enforced during merge.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Multi-Agent Consensus Algorithm</title>
        <sec id="sec-3-3-1">
          <title>Section 3.2.</title>
          <p>The consensus mechanism operates as an internal control loop within each stage of the knowledge
extraction pipeline described in Section 3.2. Specifically, whenever a pipeline step—such as term
extraction, type induction, or relation identification—is executed, multiple agents collaborate to refine
the intermediate outputs before committing them to the evolving knowledge graph. Agent state.
Each agent   is parameterized by   = (  ,   ), where   encodes its persona and   specifies the
roledependent extraction task. The task descriptor determines which pipeline function the agent invokes,
e.g., Extract(  ,   ), where Extract corresponds to one of the operators  term,  type,  nt,  tax defined in
Iterative consensus protocol. For a given pipeline step, the process unfolds as follows:
1. Initial Proposal. Each agent   produces an initial candidate Δ</p>
          <p>designated extraction operator to its local input materials.
2. Opinion Statement. At iteration  , agent   emits an opinion
(0) = Extract(  ,   ), applying its


() denotes elements to add and  

the agent accepts the current state.</p>
          <p>() 
3. Opinion Aggregation. A designated leader ℓ integrates all opinions into a consolidated agenda
= Integrate(  =1 ), summarizing required modifications and additions.</p>
          <p>() denotes elements to modify in the current shared candidate set,</p>
          <p>() ∈ True, False denotes agreement flag indicating whether
indicated with asterisks(∗), and nodes with relationships changed are indicated with hashes(#).
4. Regeneration. Each agent revises its proposal conditioned on Ω() :
Δ

(+1) = Regenerate(  , Ω</p>
          <p>() ).</p>
          <p>The loop terminates when all agents agree (∀ ∶  () = True) or the maximum iteration count  max is
reached (three in our experiments). The finalized proposals are then merged into the global knowledge
graph using a conflict-aware operator. This design ensures that consensus-driven refinement occurs at
every pipeline stage, yielding a more coherent and structurally consistent  
(+1) .</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets and settings</title>
        <p>
          two experiments:
We simulated a scenario where input materials arise from agents conducting individual research and
interacting with external agents. For this simulation, we employed the ACT framework to generate
input materials. ACT was executed using the Reddit Creative Writing dataset [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which consists of 6,673
non-factoid QA pairs extracted from Reddit posts and categorized into five domains: Tea, Cafe, Design,
Architecture, and Fashion. From the Tea category, we randomly selected four questions and assigned
each as a subtask to one of four agents. The agents then performed the Team Meeting and Break Time
(cf. section 2.3) phases of ACT. During Break Time, each agent instantiated a domain-specific expert
agent and conducted an interview of 24–26 turns, resulting in four dialogue logs. These logs were
treated as input materials for our framework (denoted as  1 through  4 in Section 3.1). We conducted
        </p>
        <p>Experiment 1: We varied the number of input materials | | from 1 to 4 to examine how KG evolves
as more materials are incorporated. In this setting, consensus-based refinement was omitted; initial
proposals Δ(0) were directly applied to KG updates. Experiment 2: Using all four input materials, we
compared KG construction with and without consensus, focusing on the number and precision of
non-taxonomic relations. Precision was computed as   /(  +   )
. The authors manually inspected
each input material: a relation was labeled true positive if its semantics were supported by the material,
and false positive otherwise.</p>
        <p>All experimental processes were implemented using GPT-4o mini.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiment 1: Progressive KG Development</title>
        <p>This experiment is aimed to assess the feasibility of incremental knowledge graph (KG) expansion.
We constructed four KGs using one to four input ma- terials to analyze graph evolution. As input
increased, the KG is expanded to more detailed structures with non-taxonomic relationships. Ultimately,
nine types were extracted: BrewingFactors, HealthImpact, OdorManagement, Qual- ityAssessment,
SellerCategory, StorageMethod, TeaOrigin, TeaStoragePractices, and TeaType.</p>
        <p>Findings from Structural Evolution. Figure 1 presents the evolution of the hierarchical structure
with respect to the QualityAssessment type as input materials are incrementally incorporated. Our
analysis yields the following three findings: (1) Despite being independently generated, the four graphs
exhibit recurring structural patterns. This indicates a degree of logical reproducibility in how agents
organize knowledge, suggesting that the process is not entirely arbitrary. (2) In contrast, nodes such as
FlavorProfile show high positional variability across graphs. This suggests possible inconsistencies in
terminology or semantic boundaries, highlighting areas where the schema may require re- finement. (3)
These observations underscore the importance to evaluate the ro- bustness of the knowledge structure
when incorporating new information. Such evaluation serves as a diagnostic tool to identify unstable
areas and assess the overall coherence and validity of the resulting graph.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experiment 2: Refining the KG via Multi-Agent Consensus</title>
        <p>We compared KG construction with and without the consensus mechanism. The number of the
relations increased from 33 to 38, and the precision improved from 79% to 97%. Agents’ consensus
corrected errors such as reversed entities (e.g., ('BakingSoda', 'uses', 'OdorNeutralization')).
Challenges included concretizing general relations and simplifying overly descriptive types. For
instance, (Darjeeling, hasBrewingTechnique, BrewingTechnique) lacked specific attributes, and
(Tea, benefitsFromStorageTechnique, CoolAndDryPlace) could be better structured. Such
improvements are a subject for future work.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study proposed a framework in which multiple AI agents collaboratively execute tasks while
constructing and refining a knowledge graphs (KGs) in do- mains with fluid knowledge structures. Using
an LLM-based knowledge extrac- tion, we built a KG incorporating both taxonomic and non-taxonomic
relations and analyzed its evolution quality through repeated task execution. Our evaluation implies
that the KG structure exhibits varying stability― some components remain consistent across tasks,
while others require refinement. This suggests a strategy for improving reliability by focusing on
unstable regions. We also demonstrated that multi-agent consensus enhances the quality of extracted
knowledge. Specifically, the proportion of logically valid non-taxonomic</p>
      <p>relationships was improved from 79% to 97%, confirming the efectiveness of collaborative refinement.
Future work will explore algorithmic improvements to further enhance the accuracy in creating KGs
and in resolving tasks using created KGs.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Microsoft 365 Copilot in order to: drafting content,
grammar and spelling check and content enhancement. After using this service, the authors reviewed
and edited the content as needed and take full responsibility for the publication’s content.
(Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Association for Computational Linguistics, 2025, pp. 16831–16861.
[2] I. Abbasnejad, X. Liu, A. Roy, Deciding the path: Leveraging multi-agent systems for solving
complex tasks, in: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)
Workshops, 2025, pp. 4216–4225.
[3] R. Barbarroxa, L. Gomes, Z. Vale, Benchmarking large language models for multi-agent systems:
A comparative analysis of autogen, crewai, and taskweaver, in: P. Mathieu, F. De la Prieta (Eds.),
Advances in Practical Applications of Agents, Multi-Agent Systems, and Digital Twins: The PAAMS
Collection, Springer Nature Switzerland, 2025, pp. 39–48.
[4] W. Takerngsaksiri, J. Pasuksmit, P. Thongtanunam, C. Tantithamthavorn, R. Zhang, F. Jiang, J. Li,
E. Cook, K. Chen, M. Wu, Human-in-the-loop software development agents, in: Proceedings of the
47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, ACM, 2025.
[5] T. T. Procko, O. Ochoa, Graph retrieval-augmented generation for large language models: A survey,
in: 2024 Conference on AI, Science, Engineering, and Technology (AIxSET), 2024, pp. 166–169.
[6] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Transactions on Big</p>
      <p>Data 7 (2021) 535–547.
[7] A. Hogan, E. Blomqvist, M. Cochez, C. D’amato, G. D. Melo, C. Gutierrez, S. Kirrane, J. E. L. Gayo,
R. Navigli, S. Neumaier, A.-C. N. Ngomo, A. Polleres, S. M. Rashid, A. Rula, L. Schmelzeisen,
J. Sequeda, S. Staab, A. Zimmermann, Knowledge graphs, ACM Comput. Surv. 54 (2021).
[8] H. Babaei Giglou, J. D’Souza, S. Auer, Llms4ol: Large language models for ontology learning, in:
T. R. Payne, V. Presutti, G. Qi, M. Poveda-Villalón, G. Stoilos, L. Hollink, Z. Kaoudi, G. Cheng, J. Li
(Eds.), The Semantic Web – ISWC 2023, Springer Nature Switzerland, 2023, pp. 408–427.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nakatsuji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tateishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fujiwara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matsumoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nomoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <string-name>
            <surname>ACT</surname>
          </string-name>
          :
          <article-title>Knowledgeable agents to design and perform complex tasks</article-title>
          , in: W. Che,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nabende</surname>
          </string-name>
          , E. Shutova, M. T. Pilehvar
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>