Introduction

Drug-drug interaction ex- traction from biomedical texts using long short-term mem- ory network. Journal of Biomedical Informatics

1532-0464

10.1093/bioinformatics/btz682

BERTKG-DDI: Towards Incorporating Entity-specific Knowledge Graph Information in Predicting Drug-Drug Interactions

Ishani Mondal

Microsoft Research Lab

Lavelle Road

Bengaluru

India ishani

@gmail.com

2013

86 15 1234 1240

Off-the-shelf biomedical embeddings obtained from the recently released various pre-trained language models (such as BERT, XLNET) have demonstrated state-of-the-art results (in terms of accuracy) for the various natural language understanding tasks (NLU) in the biomedical domain. Relation Classification (RC) falls into one of the most critical tasks. In this paper, we explore how to incorporate domain knowledge of the biomedical entities (such as drug, disease, genes), obtained from Knowledge Graph (KG) Embeddings, for predicting Drug-Drug Interaction from textual corpus. We propose a new method, BERTKG-DDI, to combine drug embeddings obtained from its interaction with other biomedical entities along with domain-specific BioBERT embedding-based RC architecture. Experiments conducted on the DDIExtraction 2013 corpus clearly indicate that this strategy improves other baselines architectures by 4.1% macro F1-score.

Introduction

During the concurrent administration of multiple drugs to a patient, there seems to be a possibility in which an ailment might get cured or it can lead to serious side-effects. These type of interactions are known as Drug-Drug Interactions (DDIs). Predicting drug-drug interactions (DDI) is a difficult task as it requires to understand the underlying action principle of the interacting drugs. Numerous efforts by the researchers have been observed recently in terms of automatic extraction of DDIs from the textual corpus (Sahu and Anand 2018), (Liu et al. 2016), (Sun et al. 2019), (Li and Ji 2019), (Mondal 2020) and predicting unknown DDI from KG (Purkayastha et al. 2019). Automatic extraction of DDI from texts helps to maintain large-scale databases and thereby facilitate the medical experts in their diagnosis.

In parallel to the progress of DDI extraction from the textual corpus, some efforts have been observed recently where the researchers came up with various strategies of augmenting chemical structure information of the drugs and textual description of the drugs (Zhu et al. 2020) to improve DrugDrug Interaction prediction performance from corpus and Knowledge Graphs. The DDI Prediction from the textual corpus has been framed by the earlier researchers as relation classification problem (Sahu and Anand 2018), (Liu et al. 2016), (Sun et al. 2019), (Li and Ji 2019) using CNN or RNN-based neural networks.

Recently, with the massive success of the pre-trained language models (Devlin et al. 2019) , (Yang et al. 2019) in many NLP classifications, we formulate the problem of DDI classification as a relation classification task by leveraging both entities and contextual information. We propose a model that leverages both domain-specific contextual embeddings (Bio-BERT) (Lee et al. 2019) from the target entities (drugs) and also its external information. In the recent years, representation learning has played a pivotal role in solving various machine learning tasks.

In this work, we explore the direction of augmenting graph embeddings to predict relation between two drugs from the textual corpus. We have made use of an in-house Knowledge Graph (Bio-KG) after curating the interactions among drugs, diseases, genes from multiple ontologies. In order to understand the complex underlying mechanism of interactions among the biomedical entities, we employ translation-based and semantics preserving heterogeneous graph embeddings on Bio-KG and augment the entities representation jointly to train the relation classification model. Experiments conducted on the DDIExtraction 2013 corpus (Herrero-Zazo et al. 2013) reveals that this method outperforms the existing baseline models and is in line with the new direction of research of fusing various information to DDI prediction. In a nutshell, the major contributions of this work are summarized as follows: 1. We propose a novel method that jointly leverages textual and external Knowledge information to classify relation type between the drug pairs mentioned in the text showing the efficacy of external entity specific information. 2. Our method achieves new state-of-the-art performance on DDI Extraction 2013 corpus.

Problem Statement

Given an input instance or sentence s with two target drug entities d1 and d2, the task is to classify the type of relation (y) the drugs hold between them, y 2 (y1 , ...., yN ). Here N denotes the number of relation types.

Methodology Text-based Relation Classification

Our model for extracting DDIs from texts is based on the pre-trained BERT-based relation classification model by (Wu and He 2019). Given a sentence s with drugs d1 and d2, let the final hidden state output from BERT module is H. Let the vectors Hi to Hj are the final hidden state vectors from BERT for entity d1, and Hk to Hm are the final hidden state vectors from BERT for entity d2. An average operation is applied to obtain the vector representation for each of the drug entities. An activation operation tanh is applied followed by a fully connected layer to each of the two vectors, and the output for d1 and d2 are H10 and H20 respectively.

H10 = W [tanh( H20 = W [tanh( (j i + 1) 1 1 (m k + 1)

j X Ht] + b t=i m X Ht] + b t=k The weights (W ) and bias (b) parameters are shared. For the final hidden state vector of the first token (‘[CLS]’), we also add an activation operation and a fully connected layer, which is formally expressed as:

H00 = W0(tanh(H0)) + b0 Matrices W0, W1, W2 have the same dimensions, i.e. W0 2 Rd d ,W1 2 Rd d, W2 2 Rd d, where d is the hidden state size from BERT. We concatenate H0, H10 and H20 and then 0 add a fully connected layer and a softmax layer, which is expressed as : h00 = W3[concat(H00; H10; H20)] + b3

yt0 = sof tmax(h00 ) W3 2 RN 3d, and yt0 is the softmax probability output over N . In Equations (1), (2), (3), (4) the bias vectors are b0, b1, b2, b3. We use cross entropy as the loss function. We denote this text-based architecture as BERT-Text-DDI.

Entity Representation from KG

To infuse external information of the entities in relation classification task, we obtain the representation of two Drug entities mentioned in each input instance of the relation classification task. We use an in-house heterogeneous biomedical Knowledge Graph (Bio-KG) consisting of the interactions of target-target, drug-drug, drug-disease, drug-target, diseasedisease, disease-target interactions from a large number of ontologies such as : DrugBank1, BioSNAP2, UniProt3 (The UniProt Consortium 2016). The overall statistics of Bio-KG has been enumerated in table 1. The real-world information/facts observed in the Bio-KG are stored as a collection of triples in the form (h, r, t). Each triple is composed of a head entity h 2 E, a tail entity t 2 E, and a relation r 2 (1) (2) (3) (4) (5)

1https://go.drugbank.com/ 2http://snap.stanford.edu/biodata/ 3https://www.uniprot.org/ Drug Target Disease Count 6512 30098 23458 Drug-Target

Target-Target

Drug-Disease Disease-Disease Disease-Target

Total Edges

Count

R between them, e.g., (paracetamol, treats, fever). The fact that paracetamol is effective in curing fever is being stored in Bio-KG. In this case, E denotes set of entities, and R denotes the set of relations. There are three different types of E in Bio-KG such as drugs, diseases, targets and five different types of R such as target-target, drug-disease, drug-target, disease-disease, disease-target interactions.

The aim of a Knowledge Graph embedding is to embed the entities and relations into a low-dimensional continuous vector space, so as to simplify the computations on the KG. They mostly use facts in the KG to perform the embedding task, enforcing embedding to be compatible with the facts. They provide a generalizable context about the overall Knowledge Graph (KG) that can be used to infer the relations. In this work, we employ some off-the-shelf KG embeddings to encode the representation of each of the drugs (in terms of their relationship with other entities). The knowledge graph embeddings are computed so that they satisfy certain properties; i.e., they follow a given KGE model. These KGE models define different score functions that measure the distance of two entities relative to its relation type in the low-dimensional embedding space. These score functions are used to train the KGE models so that the entities connected by relations are close to each other while the entities that are not connected are far away. Some of the KGEs used in our experiments as explained below: • TransE (Bordes et al. 2013) : Given a fact (h, r, t), the relation in TransE is interpreted as a translation vector r so that the embedded entities h and t can be connected by r, i.e., h + r t when (h, r, t) holds. The scoring function is defined as (negative) distance between h + r and t, i.e., fr(h; t) =k h + r t k • TransR (Lin et al. 2015): Given a fact (h, r, t), TransR first projects the entity representations h and t into the space specific to relation r, Here Mr is a projection matrix from the entity space to the relation space of r, the scoring function is:

ht = Mrh; tt = Mrt • RESCAL (Nickel, Tresp, and Kriegel 2011): Each relation in RESCAL is represented as a matrix which models pairwise interactions between latent factors. The score of a fact (h, r, t) is defined by a bi-linear function where h, t are vector representations of the entities, and Mr is a matrix associated with the relation. This score captures (6) (7) (8) (9) pairwise interactions between all components of h and t: d 1 d 1 fr(h; t) = hT Mrt = X X[Mr]ij [h]i [t]j

i=0 j=0 • DistMult (Yang et al. 2015): DistMult simplifies RESCAL by restricting Mr to diagonal matrices. For each relation r, it introduces a vector embedding r and requires Mr = diag(r). The scoring function is defined as: fr(h; t) = hT diag(r)t = This score captures pairwise interactions between only the components of h and t along the same dimension, and reduces the number of parameters to O(d) per relation. From Bio-KG, we train these KG Embeddings and obtain the representation of all the nodes. In our case, we are only interested in obtaining the representation of drug nodes. We denote the KG representation of drug d as kge.

BERTKG-DDI

From the input instance s with two tagged target drug entities d1 and d2, we obtain the KG embedding representation of two drugs kge1 and kge2 respectively using Bio-KG. We concatenate these two embeddings kge1 and kge2 and pass those through a fully connected layer as represented below: kge = W [concat(kge1; kge2)] + b (10) W and b are the parameters of the fully-connected layer of the KG representation of kge1 and kge2. The final layer of BERTKG-DDI model contains concatenation of all the previous text-based outputs and drug representation from KG as expressed below: o0 = W3[concat(H00; H10; H20; kge)] + b3 yt0 = sof tmax(o0 ) (11) (12) Finally the training optimization is achieved using the crossentropy loss.

Experimental Setup Dataset and Pre-processing

We have followed the task setting of Task 9.2 in the DDIExtraction 2013 shared task (Herrero-Zazo et al. 2013) for evaluation. It consists of MEDLINE documents annotated with the drug mentions and five types of interactions: Mechanism, Effect, Advice, Interaction and Other. The task is a multiclass classification to classify each of the drug pairs in the sentences into one of the types and we evaluate using three standard evaluation metrics such as: Precision (P), Recall (R) and F1-score (F1).

During pre-processing, we obtain the DRUG mentions in the corpus and map those into unique DrugBank 4 identifiers. This is a step for converting the drug mentions into

4https://go.drugbank.com/

Embeddings on BERT-Text-DDI

bert-base-cased scibert-scivocab-uncased biobert v1.0 pubmed pmc biobert v1.1 pubmed their respective DrugBank ID, a step of entity linking (Mondal et al. 2019), (Leaman, Dogan, and lu 2013). This mention normalization has been performed based on the longest overlap of drug mentions in DrugBank and map the drugs to different Knowledge sources used to construct Bio-KG.

Training Details

For the purpose of experiments, we use the initialization of various pre-trained contextual embeddings. For instance, we use the embeddings such as bert-base-cased 5, scibertscivocab-uncased (Beltagy, Lo, and Cohan 2019) 6 and domain-specific biobert v1.0 pubmed pmc and biobert v1.0 pubmed7 as the initialization of the transformer encoder in BERTKG-DDI. We uniformly keep the maximum sequence length as 300 for all the embedding ablations and trained for 5 epochs. For the KG embeddings, we use word embeddings dimensions to be 200. Stochastic Gradient Descent (SGD) was used for optimization with an initial learning rate of 0.0001 and the model is trained for 300 epochs. After training the embeddings, we obtain the final representation of each drug. For the drugs mentioned in the input instance, we make use of the obtained embeddings as shown in the equation 11. We initialize the non-normalized drugs using pre-trained word2vec (of dimension 200 same as the KG embedding) trained on PubMED 8.

5https://huggingface.co/bert-base-cased 6https://github.com/allenai/scibert 7https://github.com/dmis-lab/biobert 8http://evexdb.org/pmresources/ngrams/PubMed/

Methods (Zhang et al. 2017) (Vivian et al. 2017) (Asada, Miwa, and Sasaki 2018) (Sun et al. 2019) (Zhu et al. 2020) Our method (BERTKG-DDI)

Results and Discussion

In this section, we provide a detailed analysis of the various results and findings that we have observed during experiments. We show empirical results based on BERTKG-DDI for both text and KG information.

Ablation of Embeddings on BERT-Text-DDI: During ablation analysis, we observe that the incorporation of domain-specific information in biobert v.1 pubmed boosts up the predictive performance in terms of macro-F1 score (across all relation types) by 2.3% compared to bert-basecased. Moreover, the scibert-vocab-cased embedddings due to the scientific details obtained during fine-tuning achieves reasonable boost in performance. biobert v.1 pubmed based BERT-Text-DDI is the best-performing text-based relation classification model. The results are enumerated in Table 2.

Ablation analysis of KG Embeddings on BERTKG

DDI: We compare the different KG embeddings for drugs obtained from Bio-KG after augmenting with the BERTText-DDI model in Table 3. The semantic-matching models such as RESCAL and DistMult measure plausibility of facts by matching the latent semantics of both relations and entities in their vector space. In our experiments, they seem to outperform the translation-based KGE such as TransE and TransR by an average of 1% macro F1-score.

Advantage of KG information on BERTKG-DDI: During

empirical analysis of the BERTKG-DDI model, we observe how much performance gain can be achieved by augmenting KG embeddings. From the results enumerated in terms of macro F1-score on all the relation types in Table 4, we observe that the best-performing BERT-Text-DDI model achieves a performance boost of 1.8% after augmenting KG information in BERTKG-DDI.

Comparison with the existing baselines: We compare our best-performing model with some of the best-performing existing baselines. (Asada, Miwa, and Sasaki 2018) proposed a novel neural method to extract drug-drug interactions (DDIs) from texts using external drug molecular structure information. They encode textual drug pairs with convolutional neural networks and their molecular pairs with graph convolutional networks (GCNs), and then concatenate the outputs of these two networks. (Vivian et al. 2017) proposed an effective model that classifies DDIs from the literature by combining an attention mechanism and a recurrent neural network with long short-term memory (LSTM) units. (Zhang et al. 2017) has presented a hierarchical recurrent neural networks (RNNs)-based method to integrate the SDP and sentence sequence for DDI extraction task. (Sun et al. 2019) has proposed a novel recurrent hybrid convolutional neural network (RHCNN) for DDI extraction from biomedical literature. In the embedding layer, the texts mentioning two entities are represented as a sequence of semantic embeddings and position embeddings. In particular, the complete semantic embedding is obtained by the information fusion between a word embedding and its contextual information which is learnt by recurrent structure. Recently, (Zhu et al. 2020) proposed multiple entity-aware attentions with various entity information to strengthen the representations of drug entities in sentences. They integrate drug descriptions from Wikipedia and DrugBank to our model to enhance the semantic information of drug entities. Also, they modified the output of the BioBERT model and the results show that it is better than using the BioBERT model directly. On the contrary, our method achieves the state-ofthe-art performance based on the results on the DDI Extraction 2013 corpus (in terms of F1-scores of all the relation types) as shown in Table 5.

Conclusion

In this paper, we propose an approach, BERTKG-DDI, for DDI relation classification based on pre-trained language models and Knowledge Graph Embedding of the drug entities. Experiments conducted on a benchmark DDI dataset proves the effectiveness of our proposed method. Possible directions of further research might be to explore other external drug representation such as chemical structure, textual description in predicting DDI from textual corpus.

Leaman, R.; Dogan, R.; and lu, Z. 2013. DNorm: Disease Name Normalization with Pairwise Learning to Rank. Bioinformatics (Oxford, England) 29. doi:10.1093/ bioinformatics/btt474.

Li, D.; and Ji, H. 2019. Syntax-aware Multi-task Graph Convolutional Networks for Biomedical Relation Extraction. In Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), 28– 33. Hong Kong: Association for Computational Linguistics. doi:10.18653/v1/D19-6204. URL https://www.aclweb.org/ anthology/D19-6204.

Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; and Zhu, X. 2015. Learning Entity and Relation Embeddings for Knowledge Graph Completion. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, 2181–2187. AAAI Press. ISBN 0262511290.

Mondal, I. 2020. BERTChem-DDI : Improved Drug-Drug Interaction Prediction from text using Chemical Structure Information. In Proceedings of Knowledgeable NLP: the First Workshop on Integrating Structured Knowledge and Neural Networks for NLP, 27–32. Suzhou, China: Association for Computational Linguistics. URL https://www. aclweb.org/anthology/2020.knlp-1.4.

Purkayastha, S.; Mondal, I.; Sarkar, S.; Goyal, P.; and Pillai, J. K. 2019. Drug-Drug Interactions Prediction Based on Drug Embedding and Graph Auto-Encoder. In 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE), 547–552.

Sun, X.; Dong, K.; Ma, L.; Sutcliffe, R.; He, F.; Chen, S.; and Feng, J. 2019. Drug-Drug Interaction Extraction via Recurrent Hybrid Convolutional Neural Networks with an Improved Focal Loss. Entropy 21(1): 37. ISSN 10994300. doi:10.3390/e21010037. URL http://dx.doi.org/10. 3390/e21010037. Wu, S.; and He, Y. 2019. Enriching Pre-trained Language Model with Entity Information for Relation Classification. CoRR abs/1905.08284. URL http://arxiv.org/abs/ 1905.08284.

Yang, B.; tau Yih, W.; He, X.; Gao, J.; and Deng, L. 2015. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. CoRR abs/1412.6575. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and Le, Q. V. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS.

Asada , M. ; Miwa , M. ; and Sasaki, Y. 2018 . Enhancing Drug-Drug Interaction Extraction from Texts by Molecular Structure Information . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 680 - 685 . Melbourne, Australia: Association for Computational Linguistics . doi: 10 .18653/ v1/ P18 -2108. URL https://www.aclweb.org/anthology/P18- 2108.

Beltagy , I. ; Lo, K. ; and Cohan , A. 2019 .

Bordes , A. ; Usunier , N. ; Garc´ıa-Dura´n, A .; Weston , J.; and Yakhnenko , O. 2013 . Translating Embeddings for Modeling Multi-relational Data . In NIPS.

Devlin , J. ; Chang, M.-W.; Lee , K. ; and Toutanova , K. 2019 .