<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SPHOTA: Knowledge Graph Structure Prediction with a Hybrid Orientation of Textual Alignment using K-BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>R. P. Bharath Chand</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanju Tiwari</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Futures Studies, University of Kerala</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sharda University</institution>
          ,
          <addr-line>Delhi-NCR</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Link prediction in domain-specific knowledge graphs presents unique challenges that require both structured data representation and contextual understanding. Existing approaches typically focus either on topological embeddings or language models, but rarely integrate both in a unified framework. This work addresses that gap by proposing a hybrid approach that integrates Knowledge Graph Embedding techniques with Language Models to enhance link prediction performance. Our aim is to enhance the link prediction performance in the context of the Second International Biochemical Knowledge Extraction Challenge. The methodology involves evaluating knowledge integration strategy of K-BERT, combined with a regularization model of EPHEN method, applied to the NuBBE dataset. The approach focuses on designing an optimal knowledge integration pipeline to improve the predictive accuracy. The study anticipates that this integrated framework will advance domain-relevant link prediction and set a foundation for the future research into tighter KG-LLM coupling strategies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Injection</kwd>
        <kwd>K-BERT</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Knowledge Graph Embedding</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>Link Prediction</kwd>
        <kwd>Bio-Chemical</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Techniques used for Knowledge Graph Embedding include topology-based methods and language
model-based embedding approaches, each with distinct advantages as shown in Table 1.
Topologybased embeddings ofer structured data representation but lack the contextual richness of language
models. Conversely, language models provide context awareness, which structured knowledge graphs
often miss—sometimes leading to erroneous predictions. Therefore, both approaches are complementary
and can enhance tasks such as link prediction when used together.</p>
      <p>
        Traditional embedding methods predominantly focus on network structure, but EPHEN [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] model
takes a diferent approach by leveraging a language model to propagate embeddings. It integrates both
event descriptions and their intricate relationships into a low-dimensional vector space, allowing for
smooth and adaptive embedding updates [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This model is currently the only one of its kind applied to
the Knowledge Extraction task in the Biochemical Knowledge Extraction Challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] using NuBBE
dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Biochemical compounds are central to pharmaceutical innovation, yet their discovery remains
hindered by the fragmented and unstructured nature of textual data in academic publications. Predicting
chemical compounds or other related information from literature requires a nuanced understanding of
contextual relationships between factors like the plant species, compound name, bioactivity, collection
site, isolation type, etc. By leveraging unsupervised Knowledge Graph Embedding techniques and
Language Models, we can infer the missing information and predict novel associations from semantic
and structural patterns in existing data. In this context, our proposed method is for investigating the
feasibility associated with a hybrid model and apply these insights to the BiKE challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to achieve
improved performance through necessary modifications.
      </p>
      <p>Table 1 outlines how Language Models and Knowledge Graphs ofer distinct but complementary
strengths when applied to link prediction tasks. Language Models excel in semantic understanding and
can suggest plausible links by leveraging the context. However, they often operate as black boxes with
limited explainability and are prone to hallucinating links that do not exist. In contrast, Knowledge
Graphs use structured, graph-based approaches such as embeddings or path based methods to predict
links. They tend to be less prone to noise or hallucination, especially when carefully curated.</p>
      <p>Despite these strengths, KGs typically lack contextual awareness unless explicitly modeled. This is
where Language Models can provide added value by incorporating real world knowledge and contextual
information. Conversely, KGs can help validate or ground the predictions made by the Language Models,
making the combined use of both approaches particularly powerful for improving the accuracy and
trustworthiness of link prediction tasks.</p>
      <p>Aspects
Link Prediction
Explainability
Context Awareness
Noise / Hallucination</p>
      <p>Language Models
Predict links based on
semantic understanding
and context from
training data.</p>
      <p>Knowledge Graphs
Predict links using
graphbased approaches (e.g.,
embeddings, path-based
methods).</p>
      <p>Low – often black-box
predictions</p>
      <p>High – based on graph
paths, schema,
ontologies
Strong. Can tell X is Weak, unless explicitly
likely related to Y be- modeled
cause of a contextual
specification.</p>
      <p>Prone to hallucination Low if curated properly
(non-existent links)</p>
      <p>Complementarity
Language Models can
suggest plausible links
beyond existing graph
patterns; KGs ground
those suggestions in
facts.</p>
      <p>KGs can help explain or
validate the Language
Models-predicted links
through graph paths
Language Models
complement KGs with
realworld contextual
reasoning
KG can act as a
filter to validate Language
Models-predicted links</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <p>This section outlines our approach to enhancing link prediction in a heterogeneous Knowledge Graph.
The main strategy leverages recent advances in knowledge injection, embedding propagation, and
similarity-based retrieval to unify semantic and topological signals within the graph, enabling more
accurate and interpretable predictions.</p>
      <sec id="sec-2-1">
        <title>2.1. Data Input</title>
        <p>
          Dataset is derived from the NuBBE DB [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] natural product database, which was constructed through
manual curation of over 2,000 peer-reviewed scientific articles by domain experts. These articles were
annotated with key biochemical properties including compound name, bioactivity, species of origin,
collection site, and isolation type. The dataset was then transformed into a knowledge graph (KG),
where each scientific paper is represented by a central node (using its DOI), linked to various extracted
properties as well as to topic nodes derived using BERTopic [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], a transformer-based topic modeling
technique. The NatUKE benchmark paper [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] evaluates multiple unsupervised graph embedding
techniques such as DeepWalk, Node2Vec, Metapath2Vec, and EPHEN for their efectiveness in completing
such KGs via similarity-based link prediction, assessing how well each model predicts missing properties
of natural products.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Knowledge Injection</title>
        <p>Several methodologies exist for integrating KG content into language models, with the majority being
Retrieval-Augmented Generation (RAG) based. RAG methods operate by modifying the input rather
than altering the internal structure of the Language Modle. The Language Model remains intact, while
RAG retrieves external knowledge and incorporates it into the input prior to processing. This renders
RAG a non-invasive integration strategy. However, if RAG fails to retrieve pertinent documents that
facilitate prediction, the model is unable to infer missing links independently, particularly when the
relevant knowledge is distributed across multiple documents that exceed the token limit. Consequently,
the eficacy of prediction is limited with the quality of knowledge retrieval prior to the Language Model
processing.</p>
        <p>Building on prior research, Cadeddu et al. [6] conducted a comprehensive analysis of diferent
strategies for injecting structured knowledge into transformer-based language models. Their work
focused on classification tasks; however, the intention in our work is to repurpose these strategies for
link prediction tasks within hybrid models. All experiments in their study were conducted using BERT
[7] as the base model. Nevertheless, Cadeddu et al. [6] emphasized the importance of exploring the
applicability of these knowledge injection techniques across other contemporary large language models,
such as LLaMA 2 [8]. Here we are considering only the BERT Language Model for this work using the
K-BERT [9] approach. The following table provides an overview about the diferent other approaches.
The combination of this can further be used for testing for its feasibility in the future. Table 2 is divided
into two subtables: one for the Knowledge injection approaches mentioned in the analysis and one is
the possible potential Language Models which is yet to be tested for [6]. Here, for our purpose of text
embedding from the collected papers we chose K-BERT [9] method and added those embeddings as an
attribute for each node of the knowledge graph in a similar way to how the EPHEN method used the
SentenceBERT [10].</p>
        <p>(a) Knowledge Integration Methods
Model
DTI
PT
MLP
K-BERT</p>
        <p>Method</p>
        <p>Direct Text Injection</p>
        <p>Pretraining on triple text
Using Multilayer Perceptron</p>
        <p>Integrating triples
(b) Language Models to be applied with</p>
        <p>Model
BERT
LLaMA 2
GPT-2
GPT-J</p>
        <p>Status
Tested
Not yet
Not Yet
Not Yet</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Regularization</title>
        <p>
          We adopted the regularization-based embedding propagation method proposed by do Carmo and
Marcacini [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], designed for heterogeneous information networks (HINs) with both textual and non
textual nodes. This approach employs contextual text embeddings generated using K-BERT for nodes
containing textual data, and propagates these embeddings across the network using a regularization
function. Specifically, it minimizes the distance between connected node embeddings while also
preserving the original semantic representation of text nodes through a tunable regularization term.
This allowed us to unify the representation space across diferent types of entities in our dataset, thereby
enabling improved downstream learning and inference tasks. The ability to incorporate both topological
and semantic information was efective in enhancing the performance and interpretability of our model.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Link Prediction</title>
        <p>After node embeddings are learned either through contextual embedding propagation (as in EPHEN
and our proposed method) or structural path-based embedding (as in DeepWalk, Node2Vec and
Metapath2Vec) link prediction is performed using a K-nearest neighbors (KNN) approach in the embedding
space. Each node in the knowledge graph is represented as a dense vector that encodes its semantics
and structure. To predict links or retrieve associated properties (like species or bioactivity), the system
computes cosine similarity between the embedding of the query node and all other candidate nodes
of the target type. These candidates are then ranked based on similarity, and the top-k most similar
nodes are selected as predictions. This is essentially a nearest-neighbor retrieval operation in a
highdimensional vector space, where similarity corresponds to semantic and relational closeness in the
original heterogeneous information network (Knowledge Graph).</p>
        <p>This KNN-based ranking mechanism enables evaluation using standard metrics such as hits@k
(whether the correct link is among the top-k predictions) or MRR@k (Mean Reciprocal Rank), which
emphasizes the rank position of the first correct prediction.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>The key aspect of the evaluation strategy is deliberately disturbing the Knowledge Graph during testing
by removing the links to the ground-truth property values for selected test nodes—specifically. This
simulates a real-world knowledge extraction scenario where information is incomplete or missing. The
model’s task is to restore these missing links by ranking candidate nodes based on their embedding
similarity to the test node. This setup is particularly meaningful for evaluating unsupervised graph
embeddings, as it tests the model’s ability to infer correct relationships purely from the remaining
structure and contextual cues (like co-occurring topics or known relationships in the graph). The use
of hit@k here measures how often the correct, previously removed value reappears within the top-k
predicted candidates, thereby reflecting the model’s capacity for knowledge graph completion. This
approach enables a rigorous and realistic evaluation of how well each embedding method performs.</p>
      <p>
        The evaluation was perfromed using the oficial BiKE challenge benchmark NatUKE [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the
results are shown in the Table 3. It is a comparison with the final result of our proposed approach with
the benchmark results given from oficial BiKE challenge. The table provides a comparative evaluation
of diferent embedding and extraction methods DeepWalk, Node2Vec, Metapath2Vec, EPHEN, and the
proposed K-BERT regularization method across five information extraction tasks for compound name
(C), bioactivity (B), species (S), collection site (L), and isolation type (T). The metric used is the hit@k
score for k = 50, 5, 20, 1, with the best results for each task and method in bold.
      </p>
      <p>The proposed method of K-BERT with regularization approach consistently outperformed the baseline
models across four extraction tasks - bioactivity, species, collection site, and isolation type. In particular,
the most notable gains were observed in bioactivity and collection site prediction, where the method
achieved hits@k scores significantly higher than all the baseline models. In the case of species and
isolation type, the proposed method does not uniformly outperform all baselines across every evaluation
steps. However it still delivers competitive results in the first evaluation stage for the species and
ifrst three evaluation stages for the isolation type. Overall, these findings suggest that the integration
of contextual embeddings via K-BERT, along with embedding propagation through regularization,
provides a richer and more discriminative representation of heterogeneous node types.</p>
      <p>The open repository containing instructions on how to verify the claimed results is publicly available
from this GitHub link https://github.com/bharathchand10/BiKE-SPHOTA</p>
    </sec>
    <sec id="sec-4">
      <title>4. Related Work</title>
      <p>Several recent works have aimed to improve the extraction of biochemical knowledge from scientific
literature, particularly within the context of the NatUKE benchmark. One such efort, ’Leveraging
ChatGPT API for Enhanced Data Preprocessing in NatUKE’ [11], integrates OpenAI’s ChatGPT into the
data preprocessing pipeline to extract structured information, including compound names, bioactivity,
species, collection site, and isolation type from the PDFs. This use of a general purpose language model
serves to enrich the quality of input data before downstream processing.</p>
      <p>Property</p>
      <p>DeepWalk
2nd 3rd
1st</p>
      <p>Another contribution, ’Improving Natural Product Automatic Extraction With Named Entity
Recognition’ [12], enhances the traditional NatUKE pipeline by incorporating Named Entity Recognition
(NER). Instead of basic token slicing, this approach extracts entire sentences that include target entities,
thus preserving contextual integrity and improving entity-level accuracy in extraction.</p>
      <p>
        The work ’Enhancing Biochemical Extraction with BFS-driven Knowledge Graph Embedding
approach’ [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduces a hybrid method combining Breadth-First Search (BFS) traversal of a knowledge
graph with Word2Vec-based embedding. By treating BFS-generated paths as sentences, the model
learns enriched vector representations for graph nodes, improving the capture of relational patterns in
the literature.
      </p>
      <p>Building upon these ideas, our work further investigates the synergy of structural embeddings and
language models for improved link prediction in biochemical knowledge graphs. The following sections
present the main works we directly adapt and build upon in our proposed approach.</p>
      <sec id="sec-4-1">
        <title>4.1. Embedding Propagation over Heterogeneous Information Networks (EPHEN)</title>
        <p>
          We have used EPHEN [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] as the upstream model to inspire and modify our downstream
implementation. EPHEN is designed to integrate contextual text embeddings with the structural information
of heterogeneous networks. It first generates embeddings for nodes containing textual data using a
language model like SBERT and then propagates these embeddings to non textual nodes through a
graph based regularization framework. This regularization minimizes the distance between connected
node embeddings while preserving the original semantic representation of textual nodes. The resulting
unified latent space allows efective comparison and analysis of both textual and non textual entities.
Building on this methodology, we adapted and deployed a downstream version of the model with
K-BERT.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. A comparative analysis of knowledge injection strategies for large language models in the scholarly domain</title>
        <p>The method of using K-BERT [9] for text embedding is adapted from the paper “A Comparative Analysis
of Knowledge Injection Strategies for Large Language Models in the Scholarly Domain” by Cadeddu
et al. [6]. This paper presents a thorough evaluation of diferent approaches for injecting structured
knowledge into transformer models, particularly for scientific article classification. Among the strategies
examined, K-BERT stands out for its sophisticated mechanism of knowledge injection through triple
augmentation—appending relevant triples from a knowledge graph directly into the input text. K-BERT
enhances token representations by constructing a sentence tree, which expands the sentence structure
without overwhelming it, and uses a visible matrix to selectively control which tokens are allowed to
influence each other during attention calculation. This ensures that only the pertinent triples afect the
corresponding entities in the original sentence, preserving contextual integrity. The combination of
the embedding layer, seeing layer, and mask self-attention mechanism allows K-BERT to learn rich,
contextualised embeddings that incorporate both linguistic and structured knowledge eficiently.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion &amp; Future Works</title>
      <p>Among the evaluated methods, K-BERT with regularization emerges as one of the most consistent and
accurate, leveraging both structural and semantic information to outperform others across various
tasks. By leveraging the complementary strengths of structured topological embeddings and contextual
language model representations, the proposed framework seeks to overcome the limitations inherent in
using either approach independently. The development of an optimal knowledge integration architecture
tailored to the chosen Language Model is expected to significantly improve predictive performance and
application relevance.</p>
      <p>Future work will focus on extending this framework through the design of novel, domain-sensitive
loss functions that account for the varying impact of prediction errors in this specialized domain.
Additionally, further exploration of advanced integration strategies such as graph augmented transformers
and alternative architectures for tighter coupling between KGs and Language Models will also be
pursued.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT (https://chat.openai.com/) for assistance
in refining academic language and enhancing grammatical accuracy.
[6] A. Cadeddu, A. Chessa, V. De Leo, G. Fenu, E. Motta, F. Osborne, D. R. Recupero, A. Salatino,
L. Secchi, A comparative analysis of knowledge injection strategies for large language models in
the scholarly domain, Engineering Applications of Artificial Intelligence 133 (2024) 108166.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 conference of the North American chapter
of the association for computational linguistics: human language technologies, volume 1 (long
and short papers), 2019, pp. 4171–4186.
[8] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P.
Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint
arXiv:2307.09288 (2023).
[9] W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, P. Wang, K-bert: Enabling language
representation with knowledge graph, in: Proceedings of the AAAI conference on artificial intelligence,
volume 34, 2020, pp. 2901–2908.
[10] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv
preprint arXiv:1908.10084 (2019).
[11] P. Fröhlich, J. Gwozdz, M. Jooß, Leveraging chatgpt api for enhanced data preprocessing in natuke.,
in: TEXT2KG/BiKE@ ESWC, 2023, pp. 244–255.
[12] S. Schmidt-Dichte, I. J. Mócsy, Improving natural product automatic extraction with named entity
recognition., in: TEXT2KG/BiKE@ ESWC, 2023, pp. 226–234.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>P. do Carmo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Marcacini</surname>
          </string-name>
          ,
          <article-title>Embedding propagation over heterogeneous event networks for link prediction</article-title>
          ,
          <source>in: 2021 IEEE International Conference on Big Data (Big Data)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>4812</fpage>
          -
          <lpage>4821</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <article-title>Enhancing biochemical extraction with bfs-driven knowledge graph embedding approach</article-title>
          ., in: TEXT2KG/BiKE@ ESWC,
          <year>2023</year>
          , pp.
          <fpage>235</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Do Carmo</surname>
          </string-name>
          , E. Marx,
          <string-name>
            <given-names>R.</given-names>
            <surname>Marcacini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Valli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. V. S. e</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pilon</surname>
          </string-name>
          ,
          <article-title>Natuke: A benchmark for natural product knowledge extraction from academic literature</article-title>
          ,
          <source>in: 2023 IEEE 17th International Conference on Semantic Computing (ICSC)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Pilon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Valli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Dametto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E. F.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Castro-Gamboa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Andricopulo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Bolzani</surname>
          </string-name>
          ,
          <article-title>Nubbedb: an updated database to uncover chemical and biological information from brazilian biodiversity</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>7</volume>
          (
          <year>2017</year>
          )
          <fpage>7215</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grootendorst</surname>
          </string-name>
          , Bertopic:
          <article-title>Leveraging bert and c-tf-idf to create easily interpretable topics</article-title>
          ,
          <source>Zenodo, Version v0 9</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>