<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Tiwari)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Enhancing Biochemical Extraction with BFS-driven Knowledge Graph Embedding approach.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bhushan Zope</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sashikala Mishra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanju Tiwari</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Symbiosis Institute of Technology, Symbiosis International (Deemed University) (SIU)</institution>
          ,
          <addr-line>Lavale, Pune 412115</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidade Autonoma de Tamaulipas</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Knowledge Graph (KG) embedding is a representation of nodes and edges in lower-dimension space. It has many applications, including knowledge graph completion. Extracting the knowledge trapped in thousands of research papers in the biochemical domain is one such application. This work proposes a model that combines the Breadth-first search (BFS) technique and Word2Vec algorithms to generate the node embeddings for each node. Firstly, The knowledge graph is explored using the BFS to construct the various paths. The Word2Vec model is then trained using these paths to obtain the embeddings for the respective nodes. Results have shown that this unsupervised approach produces reasonably good knowledge embeddings. hits@50 results for edge types 'compound name' and 'specie' are 0.83 and 0.81, which are 415% and 184% better than the existing best method, respectively. For other edge types like 'bio-activity' and 'collection-site,' results are reasonably close to the best.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Graph Embedding Models</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Knowledge Representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>the power of graph embedding algorithms, this approach aims to provide a comprehensive
and structured representation of biochemical knowledge. The BFS-driven technique facilitates
exploring relationships within the knowledge graph, enabling eficient and efective extraction
of biochemical information.</p>
      <p>Researchers and professionals working in the fields of bioinformatics and drug discovery
can gain from a more thorough and organized representation of biochemical knowledge by
implementing this BFS-driven Knowledge Graph Embedding approach. This can speed up
data analysis, hypothesis creation, and decision-making processes, leading to faster scientific
breakthroughs and progress in the biomedical field.</p>
      <p>Overall, this study aims to fill the gap between conventional biochemical extraction techniques
and the increasing complexity of biomedical data, providing a promising method for obtaining
important information from the vast amount of information already available and enabling
researchers to pursue novel insights and discoveries.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>The term "embedding" is popular right now. The number of studies on the subject has exploded
in recent years, particularly those that deal with word embeddings. Word embeddings are vector
representations of a word that maintain the word’s meaning and are generally in a Euclidean
space. Following the introduction of the Word2Vec model [3], word embeddings have gained
enormous popularity. Word2Vec and other language models have been extended to graph
structures, as demonstrated by DeepWalk [4]. To anticipate nearby words in a text, Word2Vec
trains a neural network. The sentences of the text are composed of the sequence of nodes
visited during walks. Then, word embedding models, such Word2Vec, may be used to find the
embedding of nodes by treating them as words in the sentences. Although DeepWalk employs
a random uniform random walk, each network has unique connection patterns that must be
considered when creating node representations. Based on this understanding, node2vec[5]
introduced a more complex random walks technique that outperformed DeepWalk and can be
more readily modified to various graph connection patterns.</p>
      <p>The challenge with representation learning is the variety of node and link types, which makes
it dificult to use traditional network embedding approaches. The metapath2vec model [ 6] uses a
heterogeneous skip-gram model to conduct node embeddings after establishing meta-path-based
random walks to create a node’s heterogeneous neighborhood. In numerous heterogeneous
network mining tasks, the metapath2vec model can beat state-of-the-art embedding models
and identify the structural and semantic linkages between diferent network objects.</p>
      <p>The majority of embedding techniques used today focus on network topology. Still, EPHEN [7]
uses a language model-based embedding propagation method that uses both textual information
about events and the complex relationships between that event &amp; a low-dimensional vector
space. This results in the possibility of gradual and adaptive embedding updation.</p>
      <p>Our research builds on these earlier investigations and uses a BFS-driven Knowledge Graph
Embedding technique to improve biochemical extraction. The proposed method aims to capture
biological entities’ structural and semantic context by combining the benefits of semantic
embedding and graph traversal techniques. This comprehensive strategy has the potential
2
Yes
4
1</p>
      <p>5
Embedding Space
to enhance the extraction process’s precision, efectiveness, and interpretability, allowing
researchers to draw important conclusions from vast biological information repositories.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed BFS-driven Knowledge Graph Embedding Approach</title>
      <sec id="sec-3-1">
        <title>3.1. Overview of the Approach</title>
        <p>Word2Vec is a popular algorithm for creating word embeddings in natural language processing.
It represents the words in the form of numerical vectors. It looks at the surrounding words and
ifnds out the context of the word. This way, it learns the relationships and meanings of the
words.</p>
        <p>General idea of the proposed method, as shown in Figure 1, is to utilize the Word2Vec
approach for node embedding generations. However, Word2Vec relies on the sentences to find
the word embeddings. Hence, in the knowledge graph context, the sequence of nodes appearing
in a particular path can be treated as a sentence. Multiple such paths can then be given to
Word2Vec for node embedding generation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset</title>
        <p>The dataset given for the BiKE challenge [8] is used for experimentation. The dataset [9] was
generated by extracting information from peer-reviewed scientific articles. These articles served
as the primary source of information for natural product extraction. It focuses on five NuBBE
properties: Compound Name, Bioactivity, Species from Extraction, Collection Site, and Isolation
Type. Figure 2 shows each property type’s number of distinct values. It consists of four diferent
split rations viz. 20/80, 40/60, 60/40, and 80/20 percent for testing and training respectively. For
each percentage, ten randomly split knowledge graphs were given.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. BFS Exploration for Graph Embedding</title>
        <p>Breadth First Search (BFS) is a simple algorithm to explore and navigate through a graph or
a network. Here we have traversed the knowledge graph in an all-source-BFS manner. BFS
explores all the nodes at the same level from the starting point before moving on to the nodes
from the level one step further away. To generate the node sequence, the proposed method uses
the BFS approach.</p>
        <p>BFS ensures that it explores the graph layer by layer. It guarantees that nodes at a shallower
level (closer to the starting point) are visited before moving on to nodes at deeper levels. This
leads to structural awareness, which is the main advantage of BFS. As shown in Figure 3,
neighborhood nodes are closer in the sequence, contributing more to the node’s context in the
knowledge graph.</p>
        <p>However, BFS paths sufer from very important problems. The nodes adjacent to the sequence
may not be immediate neighbors of each other. For example, as shown in Figure 3, nodes 2,
1, and 3 are adjacent in the given BFS sequence but are a few hops away. If only nodes 2 and
3 are considered while finding the embedding for node 1, then it won’t be appropriate. To
mitigate this problem, a large window size of 10 is used during embedding generation. The large
window size enables the model to handle long-range dependencies, resulting in a meaningful
node representation. Additionally, Five walks are constructed from each node by visiting four
neighbors in each BFS iteration for four iterations. This allows more opportunities for the node
2</p>
        <p>BFS starting with 0
level 1
2
level 1</p>
        <p>4</p>
        <p>0
BFS Path: 0-&gt; 5 -&gt;4-&gt;2-&gt;1-&gt;3
level 1</p>
        <p>5
1 level 2
to get surrounded by relevant, adjacent nodes. Thus these constructed walks capture the context
of the neighborhood. These walks are then used for training the Word2Vec model, giving the
embeddings for each node.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussions</title>
      <p>The evaluation was performed using the oficial BiKE challenge benchmark NatUKE [ 9], and
the results1 are listed in Table 1. Results for the proposed method are compared to DeepWalk,
Node2Vec, Metapath2Vec, and EPHEN, which are taken from [9].</p>
      <p>We have used the property prediction task and hits@k performance metric on the dataset for
experimentation. Not all properties have the same unique values. Therefore, a diferent value
for k in hits@k is used for diferent property predictions.</p>
      <p>Evidently, the proposed method gave excellent results for ’Compound Name’ and ’Specie’
properties. For the ’Compound Name’ property, hits@50 for 1st evaluation stage is 0.9, slightly
less than the results for Metapath2vec. However, results improved progressively in the
subsequent evaluation stages, with 415% better results than the previous. Similarly, the Results for
’Specie’ are the best among all the other four models for all the evaluation stages.</p>
      <p>Furthermore, the results for the other two properties, i.e., ’Bioactivity’ and ’Collection Site,’
are also motivating. However, results for EPHEN are distinguishably apart from all other
methods. The results for the proposed method are similar to the remaining methods.</p>
      <p>On the other hand, results for the ’isolation type’ property are not encouraging and very
similar to the Node2vec method, which is very similar to the proposed method. There are only
six unique values for the ’Isolation Type’ property compared to 446 for the ’Compound Name’
property. Since there are few unique values, one value may appear with diferent types of
1Code and result files are kept at the GitHub repository: https://github.com/bhushan-zope/BiKE.
(a) hits@50 result for ’Compound Name’ property
(b) hits@5 result for ’Bioactivity property</p>
      <p>(d) hits@20 result for ’Collection Site’ property
nodes in the path. Making it very dificult to discriminate the context. Thus, limited diversity in
distinct values contributes to relatively poor outcomes. Moreover, because of the diversity in
distinct values, each value appears with a specific set of nodes in the path, resulting in a precise
understanding of context.</p>
      <p>As shown in Figure 2, bioactivity, collection site, and isolation type has very limited diversity.
Results for these properties are exceptionally well for the EPHEN method. Whereas results for
compound name and specie properties, which have more unique values, are outstanding for
the proposed method. It follows that the suggested approach is better suited to properties with
more distinct values, whereas EPHEN is better suited to properties with less distinct values.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This research paper presented an approach for enhancing biochemical knowledge extraction
through a BFS-driven Knowledge Graph Embedding method. Our BFS-driven Knowledge Graph
Embedding approach ofered several advantages. The knowledge graph is traversed using
a Breadth-First Search algorithm to capture context and relationships between biochemical
entities. The results of our experiments showcased the potential of our BFS-driven Knowledge
Graph Embedding approach.
[2] L. Tari, S. Anwar, S. Liang, J. Hakenberg, C. Baral, Synthesis of pharmacokinetic pathways
through knowledge acquisition and automated reasoning, in: Biocomputing 2010, World
Scientific, 2010, pp. 465–476.
[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of
words and phrases and their compositionality, ArXiv abs/1310.4546 (2013).
[4] B. Perozzi, R. Al-Rfou, S. S. Skiena, Deepwalk: online learning of social representations,
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery
and data mining (2014).
[5] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (2016).
[6] Y. Dong, N. Chawla, A. Swami, metapath2vec: Scalable representation learning for
heterogeneous networks, Proceedings of the 23rd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (2017).
[7] P. V. D. Carmo, R. M. Marcacini, Embedding propagation over heterogeneous event networks
for link prediction, 2021 IEEE International Conference on Big Data (Big Data) (2021) 4812–
4821.
[8] 2023. URL: https://aksw.org/bike/.
[9] P. V. do Carmo, E. Marx, R. Marcacini, M. Valli, J. V. S. e Silva, A. Pilon, NatUKE: A
Benchmark for Natural Product Knowledge Extraction from Academic Literature, in: 17th
IEEE International Conference on Semantic Computing, IEEE, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Q.-C.</given-names>
            <surname>Bui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Sloot</surname>
          </string-name>
          ,
          <article-title>A robust approach to extract biomedical events from literature</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>28</volume>
          (
          <year>2012</year>
          )
          <fpage>2654</fpage>
          -
          <lpage>2661</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/bts487.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>