1. Introduction

S. Tiwari)

Enhancing Biochemical Extraction with BFS-driven Knowledge Graph Embedding approach.

Bhushan Zope

Sashikala Mishra

Sanju Tiwari

1 0 Symbiosis Institute of Technology, Symbiosis International (Deemed University) (SIU) , Lavale, Pune 412115 , India 1 Universidade Autonoma de Tamaulipas , Mexico

2023

000 0 0003

Knowledge Graph (KG) embedding is a representation of nodes and edges in lower-dimension space. It has many applications, including knowledge graph completion. Extracting the knowledge trapped in thousands of research papers in the biochemical domain is one such application. This work proposes a model that combines the Breadth-first search (BFS) technique and Word2Vec algorithms to generate the node embeddings for each node. Firstly, The knowledge graph is explored using the BFS to construct the various paths. The Word2Vec model is then trained using these paths to obtain the embeddings for the respective nodes. Results have shown that this unsupervised approach produces reasonably good knowledge embeddings. hits@50 results for edge types 'compound name' and 'specie' are 0.83 and 0.81, which are 415% and 184% better than the existing best method, respectively. For other edge types like 'bio-activity' and 'collection-site,' results are reasonably close to the best.

eol>Knowledge Graph Embedding Models Natural Language Processing Knowledge Representation

1. Introduction

the power of graph embedding algorithms, this approach aims to provide a comprehensive and structured representation of biochemical knowledge. The BFS-driven technique facilitates exploring relationships within the knowledge graph, enabling eficient and efective extraction of biochemical information.

Researchers and professionals working in the fields of bioinformatics and drug discovery can gain from a more thorough and organized representation of biochemical knowledge by implementing this BFS-driven Knowledge Graph Embedding approach. This can speed up data analysis, hypothesis creation, and decision-making processes, leading to faster scientific breakthroughs and progress in the biomedical field.

Overall, this study aims to fill the gap between conventional biochemical extraction techniques and the increasing complexity of biomedical data, providing a promising method for obtaining important information from the vast amount of information already available and enabling researchers to pursue novel insights and discoveries.

2. Related work

The term "embedding" is popular right now. The number of studies on the subject has exploded in recent years, particularly those that deal with word embeddings. Word embeddings are vector representations of a word that maintain the word’s meaning and are generally in a Euclidean space. Following the introduction of the Word2Vec model [3], word embeddings have gained enormous popularity. Word2Vec and other language models have been extended to graph structures, as demonstrated by DeepWalk [4]. To anticipate nearby words in a text, Word2Vec trains a neural network. The sentences of the text are composed of the sequence of nodes visited during walks. Then, word embedding models, such Word2Vec, may be used to find the embedding of nodes by treating them as words in the sentences. Although DeepWalk employs a random uniform random walk, each network has unique connection patterns that must be considered when creating node representations. Based on this understanding, node2vec[5] introduced a more complex random walks technique that outperformed DeepWalk and can be more readily modified to various graph connection patterns.

The challenge with representation learning is the variety of node and link types, which makes it dificult to use traditional network embedding approaches. The metapath2vec model [ 6] uses a heterogeneous skip-gram model to conduct node embeddings after establishing meta-path-based random walks to create a node’s heterogeneous neighborhood. In numerous heterogeneous network mining tasks, the metapath2vec model can beat state-of-the-art embedding models and identify the structural and semantic linkages between diferent network objects.

The majority of embedding techniques used today focus on network topology. Still, EPHEN [7] uses a language model-based embedding propagation method that uses both textual information about events and the complex relationships between that event & a low-dimensional vector space. This results in the possibility of gradual and adaptive embedding updation.

Our research builds on these earlier investigations and uses a BFS-driven Knowledge Graph Embedding technique to improve biochemical extraction. The proposed method aims to capture biological entities’ structural and semantic context by combining the benefits of semantic embedding and graph traversal techniques. This comprehensive strategy has the potential 2 Yes 4 1

5 Embedding Space to enhance the extraction process’s precision, efectiveness, and interpretability, allowing researchers to draw important conclusions from vast biological information repositories.

3. Proposed BFS-driven Knowledge Graph Embedding Approach 3.1. Overview of the Approach

Word2Vec is a popular algorithm for creating word embeddings in natural language processing. It represents the words in the form of numerical vectors. It looks at the surrounding words and ifnds out the context of the word. This way, it learns the relationships and meanings of the words.

General idea of the proposed method, as shown in Figure 1, is to utilize the Word2Vec approach for node embedding generations. However, Word2Vec relies on the sentences to find the word embeddings. Hence, in the knowledge graph context, the sequence of nodes appearing in a particular path can be treated as a sentence. Multiple such paths can then be given to Word2Vec for node embedding generation.

3.2. Dataset

The dataset given for the BiKE challenge [8] is used for experimentation. The dataset [9] was generated by extracting information from peer-reviewed scientific articles. These articles served as the primary source of information for natural product extraction. It focuses on five NuBBE properties: Compound Name, Bioactivity, Species from Extraction, Collection Site, and Isolation Type. Figure 2 shows each property type’s number of distinct values. It consists of four diferent split rations viz. 20/80, 40/60, 60/40, and 80/20 percent for testing and training respectively. For each percentage, ten randomly split knowledge graphs were given.

3.3. BFS Exploration for Graph Embedding

Breadth First Search (BFS) is a simple algorithm to explore and navigate through a graph or a network. Here we have traversed the knowledge graph in an all-source-BFS manner. BFS explores all the nodes at the same level from the starting point before moving on to the nodes from the level one step further away. To generate the node sequence, the proposed method uses the BFS approach.

BFS ensures that it explores the graph layer by layer. It guarantees that nodes at a shallower level (closer to the starting point) are visited before moving on to nodes at deeper levels. This leads to structural awareness, which is the main advantage of BFS. As shown in Figure 3, neighborhood nodes are closer in the sequence, contributing more to the node’s context in the knowledge graph.

However, BFS paths sufer from very important problems. The nodes adjacent to the sequence may not be immediate neighbors of each other. For example, as shown in Figure 3, nodes 2, 1, and 3 are adjacent in the given BFS sequence but are a few hops away. If only nodes 2 and 3 are considered while finding the embedding for node 1, then it won’t be appropriate. To mitigate this problem, a large window size of 10 is used during embedding generation. The large window size enables the model to handle long-range dependencies, resulting in a meaningful node representation. Additionally, Five walks are constructed from each node by visiting four neighbors in each BFS iteration for four iterations. This allows more opportunities for the node 2

BFS starting with 0 level 1 2 level 1

0 BFS Path: 0-> 5 ->4->2->1->3 level 1

5 1 level 2 to get surrounded by relevant, adjacent nodes. Thus these constructed walks capture the context of the neighborhood. These walks are then used for training the Word2Vec model, giving the embeddings for each node.

4. Results and Discussions

The evaluation was performed using the oficial BiKE challenge benchmark NatUKE [ 9], and the results1 are listed in Table 1. Results for the proposed method are compared to DeepWalk, Node2Vec, Metapath2Vec, and EPHEN, which are taken from [9].

We have used the property prediction task and hits@k performance metric on the dataset for experimentation. Not all properties have the same unique values. Therefore, a diferent value for k in hits@k is used for diferent property predictions.

Evidently, the proposed method gave excellent results for ’Compound Name’ and ’Specie’ properties. For the ’Compound Name’ property, hits@50 for 1st evaluation stage is 0.9, slightly less than the results for Metapath2vec. However, results improved progressively in the subsequent evaluation stages, with 415% better results than the previous. Similarly, the Results for ’Specie’ are the best among all the other four models for all the evaluation stages.

Furthermore, the results for the other two properties, i.e., ’Bioactivity’ and ’Collection Site,’ are also motivating. However, results for EPHEN are distinguishably apart from all other methods. The results for the proposed method are similar to the remaining methods.

On the other hand, results for the ’isolation type’ property are not encouraging and very similar to the Node2vec method, which is very similar to the proposed method. There are only six unique values for the ’Isolation Type’ property compared to 446 for the ’Compound Name’ property. Since there are few unique values, one value may appear with diferent types of 1Code and result files are kept at the GitHub repository: https://github.com/bhushan-zope/BiKE. (a) hits@50 result for ’Compound Name’ property (b) hits@5 result for ’Bioactivity property

(d) hits@20 result for ’Collection Site’ property nodes in the path. Making it very dificult to discriminate the context. Thus, limited diversity in distinct values contributes to relatively poor outcomes. Moreover, because of the diversity in distinct values, each value appears with a specific set of nodes in the path, resulting in a precise understanding of context.

As shown in Figure 2, bioactivity, collection site, and isolation type has very limited diversity. Results for these properties are exceptionally well for the EPHEN method. Whereas results for compound name and specie properties, which have more unique values, are outstanding for the proposed method. It follows that the suggested approach is better suited to properties with more distinct values, whereas EPHEN is better suited to properties with less distinct values.

5. Conclusion

This research paper presented an approach for enhancing biochemical knowledge extraction through a BFS-driven Knowledge Graph Embedding method. Our BFS-driven Knowledge Graph Embedding approach ofered several advantages. The knowledge graph is traversed using a Breadth-First Search algorithm to capture context and relationships between biochemical entities. The results of our experiments showcased the potential of our BFS-driven Knowledge Graph Embedding approach. [2] L. Tari, S. Anwar, S. Liang, J. Hakenberg, C. Baral, Synthesis of pharmacokinetic pathways through knowledge acquisition and automated reasoning, in: Biocomputing 2010, World Scientific, 2010, pp. 465–476. [3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, ArXiv abs/1310.4546 (2013). [4] B. Perozzi, R. Al-Rfou, S. S. Skiena, Deepwalk: online learning of social representations, Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2014). [5] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016). [6] Y. Dong, N. Chawla, A. Swami, metapath2vec: Scalable representation learning for heterogeneous networks, Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017). [7] P. V. D. Carmo, R. M. Marcacini, Embedding propagation over heterogeneous event networks for link prediction, 2021 IEEE International Conference on Big Data (Big Data) (2021) 4812– 4821. [8] 2023. URL: https://aksw.org/bike/. [9] P. V. do Carmo, E. Marx, R. Marcacini, M. Valli, J. V. S. e Silva, A. Pilon, NatUKE: A Benchmark for Natural Product Knowledge Extraction from Academic Literature, in: 17th IEEE International Conference on Semantic Computing, IEEE, 2023.

[1]

Q.-C.

Bui ,

P. M.

Sloot , A robust approach to extract biomedical events from literature , Bioinformatics 28 ( 2012 ) 2654 - 2661 . doi: 10 .1093/bioinformatics/bts487.