The Impact of Self-Interaction Attention on the Extraction of Drug-Drug Interactions Luca Putelli1,2 , Alfonso E. Gerevini1 , Alberto Lavelli2 , Ivan Serina1 1 Università degli Studi di Brescia, 2 Fondazione Bruno Kessler {alfonso.gerevini, ivan.serina}@unibs.it, {l.putelli, lavelli}@fbk.eu have to be extracted from a corpus of free-text sen- Abstract tences, combining machine learning with natural language processing (NLP). Since a large amount of medical treat- Starting from the introduction of word embed- ments requires the assumption of multi- ding techniques like Word2Vec (Mikolov et al., ple drugs, the discovery of how these in- 2013) and GloVe (Pennington et al., 2014) for teract with each other, potentially causing word representation, Recurrent Neural Networks health problems to the patients, is the sub- (RNN) and in particular Long Short Term Mem- ject of a huge quantity of documents. In ory networks (LSTM) have become the state-of- order to obtain this information from free the-art technology for most of natural language text, several methods involving deep learn- processing tasks like text classification or relation ing have been proposed over the years. In extraction. this paper we introduce a Recurrent Neu- The main idea behind the attention mechanism ral Network-based method combined with (Bahdanau et al., 2014) is that the model “pays the Self-Interaction Attention Mechanism. attention" only to the parts of the input where Such a method is applied to the DDI2013- the most relevant information is present. In our Extraction task, a popular challenge con- case, this mechanism assigns a higher weight to cerning the extraction and the classifica- the most influential words, i.e. the ones which de- tion of drug-drug interactions. Our fo- scribe an interaction between drugs. cus is to show its effect over the tendency Several attention mechanisms have been pro- to predict the majority class and how it posed in the last few years (Hu, 2018), in particu- differs from the other types of attention lar self-interaction mechanism (Zheng et al., 2018) mechanisms. applies attention with a different weight vector for 1 Introduction each word in the sequence, producing a matrix that represents the influence between all word pairs. Given the increase of publications regarding side We consider this information very meaningful, es- effects, adverse drug reactions and, more in gen- pecially in a task like this one where we need to eral, how the assumption of drugs can cause risks discover connections between pairs of words. of health issues that may affect patients, a large In this paper we show how self-interaction at- quantity of free-text containing crucial informa- tention improves the results in the DDI-2013 task, tion has become available. For doctors and re- comparing it to other types of attention mecha- searchers, accessing this information is a very de- nisms. Given that this dataset is strongly unbal- manding task, given the number and the complex- anced, the main focus of the analysis is how each ity of such documents. attention mechanism deals with the tendency to Hence, the automatic extraction of Drug-Drug predict the majority class. Interactions (DDI), i.e. situations where the simul- taneous assumption of drugs can cause adverse 2 Related work drug reactions, is the goal of the DDIExtraction- 2013 task (Segura-Bedmar et al., 2014). DDIs The best performing teams in the DDI-2013 orig- inal challenge (Segura-Bedmar et al., 2014) used Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 SVM (Björne et al., 2013) but, more recently, International (CC BY 4.0) Convolutional Neural Networks (CNN) (Liu et al., 2016), (Quan et al., 2016) and mostly Recurrent There are five different classes: unrelated: Neural Networks (RNN) have proved to be the there is no relation between the two drugs men- new state of the art. tioned; effect: the text describes the effect of Kumar and Anand (2017) propose a double the drug-drug interaction; advise: the text rec- LSTM. The sentences are processed by two differ- ommends to avoid the simultaneous assumption ent bidirectional LSTM layers: one followed by a of two drugs; mechanism: the text describes an max-pooling layer and the other one by a custom anomaly of the absorption of a drug, if assumed si- made attention-pooling layer that assign weights multaneously with another one; int: the text states to words. Furthermore Zhang et al. (2018) design a generic interaction between the drugs. a multi-path LSTM neural network. Three paral- lel bidirectional LSTM layers process the sentence 4 Pre-processing sequence and a fourth one processes the shortest The pre-processing phase exploits the dependency path between the two candidate drugs “en_core_web_sm" model of spaCy1 , a Python in the dependency tree. The output of these four tool for Natural Language Processing, and it is layers is merged and handled by another bidirec- composed by these steps: tional LSTM layer. Substitution: after tokenization and POS- Zheng et al. (2017) apply attention directly tagging, the drug mention tokens are re- to word vectors, creating a “candidate-drugs- placed by the standard terms PairDrug1 and oriented" input which is processed by a single PairDrug2. In the particular case when the pair LSTM layer. is composed by two mentions of the same drug, Yi et al. (2017) use a RNN with Gated Re- these are replaced by NoPair. Every other drug current Units (GRU) (Cho et al., 2014) instead mentioned in the sentence is replaced with the of LSTM units, followed by a standard attention generic name Drug. mechanism, and exploits information contained in Shortest dependency path: spaCy produces other sentences with a custom made sentence at- the dependency tree associated to the sentence, tention mechanism. with tokens as nodes and dependency relations be- Putelli et al. (2019) introduce an LSTM model tween the words as edges. Then, we calculate followed by a self-interaction attention mecha- the shortest path in the dependency tree between nism which computes, for each pair of words, a PairDrug1 and PairDrug2. vector representing how much it is related to the Offset features: given a word w in the sen- other. These vectors are concatenated into a sin- tence, D1 is calculated as the distance (in terms of gle one which is passed to a classification layer. words) from the first drug mention, divided by the In this paper, starting from the results reported in length of the sentence. Similarly, D2 is calculated Putelli et al. (2019), we improve the input rep- as the distance from the second drug mention. resentation, the negative filtering and extend the analysis of self-interaction attention, comparing it 4.1 Negative instance filtering to more standard attention mechanisms. The DDI-2013 dataset contains many “negative instances", i.e. instances that belong to the un- 3 Dataset description related class. In an unbalanced dataset, machine learning algorithms are more likely to classify a This dataset was released for the shared challenge new instance over the majority class, leading to SemEval 2013 - Task 9 (Segura-Bedmar et al., poor performance for the minority classes (Weiss 2014) and contains annotated documents from the and Provost, 2001). Given that previous stud- biomedical literature. In particular, there are two ies (Chowdhury and Lavelli, 2013; Kumar and different sources: abstracts from MEDLINE re- Anand, 2017; Zheng et al., 2017) have demon- search articles and texts from DrugBank. strated a positive effect of reducing the number Every document is divided into sentences and, of negative instances on this dataset, we have fil- for each sentence, the dataset provides annotations tered out some instances from the training-set rely- of every drug mentioned. The task requires to clas- ing only on the structure of the sentence, starting sify all the possible n2 pairs of n drugs mentioned from the pairs of drugs with the same name. In in the given sentences. The dataset provides the 1 instances with their classification value. https://spacy.io addition to this case, we can filter out a candidate which allow to process longer and more complex pair if the two drug mentions appear in coordinate sequences. Given x1 , x2 . . . xm , ht−1 and ct−1 structure, checking the shortest dependency path where m is the length of the sentence and xi ∈ Rd between the two drug mentions.If they are not con- is the vector obtained by concatenating the embed- nected by a path, i.e. there is no grammatical re- ded features, ht−1 and ct−1 are the hidden state lation between them, the candidate pair is filtered and the cell state of the previous LSTM cell (h0 out. and c0 are initialized as zero vectors), new hidden While other works like (Kumar and Anand, state and cell state values are computed as follows: 2017) and (Liu et al., 2016) apply custom-made rules for this dataset (such as regular expressions), our choice is to keep the pre-processing phase as ĉt = tanh(Wc [hti , xt ] + bc ) general as possible, defining an approach that can it = σ(Wi [hti , xt ] + bi ) be applied for other relation extraction tasks. ft = σ(Wf [hti , xt ] + bf ) ot = σ(Wo [hti , xt ] + bo ) 5 Model description ct = it ∗ ĉt + ft ∗ ct−1 ht = tanh(ct ) ∗ ot with σ being the sigmoid activation function and ∗ denoting the element wise product. Wf , Wi , Wo , Wc ∈ R(N +d)×N are weight matrices and bf , bi , bo , bc ∈ RN are bias vectors. Weight matrices and bias vectors are randomly initialized and learned by the neural network during the training phase. N is the LSTM layer size and d is the dimension of the feature vector for each input word. The vectors in square brackets are concatenated. Bidirectional LSTM not only computes the in- put sequence in the order of the sentence but also Figure 1: Model architecture backwards (Schuster and Paliwal, 1997). Hence, In this section we present the LSTM-based we can compute hr using the same equations de- model (Figure 1), the self-attention mechanism scribed earlier but reversing the word sequence. and how it is used for relation extraction. Given ht computed in the sentence order and hrt in the reversed order, the output of the t bidirectional 5.1 Embedding LSTM cell hbt is the result of the concatenation of Each word in our corpus is represented with a vec- ht and hrt . tor of length 200. These vectors are obtained with a Word2Vec (Mikolov et al., 2013) fine-tuning. 5.3 Sentence representation and attention We initialized a Word2Vec model with the vec- mechanisms tors obtained by the authors of McDonald et al. The LSTM layers produce, for each word input (2018) the same algorithm over PubMed abstracts wi , a vector hi ∈ Rn which is the result of com- and PMC texts, and trained our Word2Vec model puting every word from the start of the sentence using the DDI-2013 corpus. to wi . Hence, given a sentence of length m, hm PoS tags are represented with vectors of length can be considered as the sentence representation 4. These are obtained applying the Word2Vec produced by the LSTM layer. So, for a sentence method to the sequence of PoS tags in our corpus. classification task, hm can be used as the input to a fully connected layer that provides the classifi- 5.2 Bidirectional LSTM layer cation. A Recurrent neural network is a deep learning Even if they perform better than simple RNNs, model for processing sequential data, like natu- LSTM neural networks have difficulties preserv- ral language sentences. Its issues with vanishing ing dependencies between distant words (Raffel gradient are avoided using LSTM cells (Hochre- and Ellis, 2015) and, especially for long sen- iter and Schmidhuber, 1997; Gers et al., 2000), tences, hm may not be influenced by the first words or may be affected by less relevant words. 5 neurons (one for each class) and softmax acti- The Attention mechanism (Bahdanau et al., 2014; vation function that provides the classification. Kadlec et al., 2016) deals with these problems tak- In our experiments, we compare this model ing into consideration each hi , computing weights with similar configurations obtained substituting αi for each word contribution: the self-interaction attention with the standard at- tention layer introduced by Bahdanau et al. (2014) ui = tanh(Wa hi + bP a) and the context-attention of Yang et al. (2016). αi = sof tmax(ui ) = exp(ui )/ nk=1 exp(uk ) where Wa ∈ RN ×N and ba ∈ RN . 6 Results and discussion The attention mechanism outputs the sentence Our models are implemented using Keras library representation with Tensorflow backend. We perform a sim- s= m P i=1 αi hi ple random hyper-parameter search (Bergstra and The Context Attention mechanism (Yang et Bengio, 2012) in order to optimize the learning al., 2016) is more complex. In order to enhance phase and avoiding overfitting, using a subset of the importance of the words for the meaning of sentences as validation set. the sentence, this uses a word level context vector 6.1 Evaluation uw of additional weights for the calculation of αi : αi = sof tmax(uTw ui ) We have tested our two models with different in- put configurations: using only word vectors, using As proposed by Zheng et al. (2018), Self- word and PoS tag vectors or adding also offset fea- Interaction Attention mechanism uses multiple tures. vi for each word wi instead of using a single one. In Table 1 we show the recall measure for each This way, we can extract the influence (called ac- input configuration. The effect of self-interaction tion) between the action controller wi and the rest is also verified through the Friedman test (Fried- of the sentence, i.e. each wk for k ∈ {1, m}. The man, 1937): for all input configurations, the model action of wi is calculated as follows: with self-interaction attention performs better than si = m P the other configurations with a level of confidence Pαi,k ui k=1 equals to 99%. Similarly, the simple Attention αik = exp(vkT ui )/ m T j=1 exp(vj ui ) Mechanism obtains better performances with re- with ui defined in the same way as the standard spect to the Context Attention with confidence of attention mechanism. 99% (see Figure 2). In Table 2 we show the F-Score for each class of 5.4 Model architecture the dataset. The overall performance of the config- In order to obtain also in this case a context vector uration including word vectors, PoS tagging and representing the sentence, in Zheng et al. (2018) offset features as input is considered also in Ta- each si is aggregated into a single vector s as its ble 3. average, maximum or even applying another stan- In Table 3 we compare our results with other dard attention layer. In our model we choose to state-of-the-art methods and compare the overall avoid any pooling operations and to concatenate performance of the three attention mechanisms. instead each si , creating a flattened representation The Context-Att obtains results similar to those (Du et al., 2018) and passing it to the classification of most of the approaches based on Convolution layer. Neural Networks and worse than most of LSTM- The model designed (see Figure 1) and tested based models. for the DDI-2013 Relation Extraction task in- In terms of F-Score, Word Attention LSTM cludes the following layers: three parallel em- (Zheng et al., 2017) outperforms our approach and bedding layers: one with pre-trained word vec- the other LSTM-based models by more than 4%. tors, one with pre-trained PoS tag vectors and one As we discussed in (Putelli et al., 2019), we have that calculates the embedding of the offset fea- tried to replicate their model but we could not ob- tures; two bidirectional LSTM layers that pro- tain the same results. Furthermore, their attention cess the word sequence; the self-interaction at- mechanism aimed to creating a “candidate-drugs- tention mechanism; a fully connected layer with oriented" input did not improve the performance. Input No Attention Context-Att Attention Self-Int-Att Word 64.44 65.32 66.60 69.72 Word+Tag 65.37 65.20 67.57 68.95 Word+Tag+Offset 60.67 65.82 69.47 70.88 Table 1: Overall recall (%) comparison with different attention mechanisms and input configurations. For each input configuration, the best recall is marked in bold. Effect Mechanism Input No Att C-Att Att Self-Int No Att C-Att Att Self-Int Word 0.68 0.71 0.72 0.70 0.69 0.72 0.72 0.70 Word+Tag 0.67 0.70 0.70 0.69 0.71 0.73 0.74 0.70 Word+Tag+Offset 0.65 0.70 0.70 0.69 0.68 0.73 0.74 0.76 Advise Int Input No Att C-Att Att Self-Int No Att C-Att Att Self-Int Word 0.77 0.71 0.74 0.78 0.53 0.49 0.45 0.45 Word+Tag 0.78 0.73 0.77 0.77 0.55 0.50 0.45 0.43 Word+Tag+Offset 0.74 0.75 0.79 0.78 0.50 0.52 0.50 0.49 Table 2: Detailed F-Score comparison with different configurations and attention mechanisms. For each class, the best F-Score is marked in bold. Method P(%) R(%) F(%) UTurku (SVM) 73.2 49.9 59.4 FBK-irst (SVM) 64.6 65.6 65.1 Zhao SCNN 72.5 65.1 68.6 Liu CNN 75.7 64.7 69.8 Multi-Channel 76.0 65.3 70.2 Figure 2: Recall comparison for mod- Context-Att 75.9 65.8 70.5 els with different attention mechanisms for Joint-LSTMs 73.4 69.7 71.5 Word+Tag+Offset. The continue arrow means Self-Int 73.0 70.9 71.9 99% confidence, while the dashed arrow means GRU 73.7 70.8 72.2 95%. Attention 75.6 69.5 72.4 SDP-LSTM 74.1 71.8 72.9 tendency of predicting the majority class, hence Word-Att LSTM 78.4 76.2 77.3 decreasing the number of false negatives. The standard attention mechanism produces better re- Table 3: Comparison with overall precision (P), sults than the context attention. recall (R) and F-Score (F) of other state-of-the-art methods: , ordered by F. Our models are marked As future work, our objective is to exploit or in bold, results higher than ours are marked in red. adapt the Transformer architecture (Vaswani et al., 2017), which has become quite popular for ma- chine translation tasks and relies almost only on 7 Conclusions and future work attention mechanisms, and apply it to relation ex- We have compared the self-interaction attention traction tasks like DDI-2013. model to alternative configurations using the stan- Another direction includes the exploitation of a dard attention mechanism introduced by Bah- different pre-trained language modeling. For ex- danau et al. (2014) and the context-attention mech- ample, BioBERT (Lee et al., 2019) obtains good anism of Yang et al. (2016). results for several NLP tasks like Named Entity Our experiments show that the self-interaction Recognition or Question Answering and we plan mechanism improves the performance with re- to apply it to our task. spect to other versions, in particular reducing the References Sunil Kumar and Ashish Anand. 2017. Drug- drug interaction extraction from biomedical text Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua using long short term memory network. CoRR, Bengio. 2014. Neural machine translation abs/1701.08303. by jointly learning to align and translate. cite arxiv:1409.0473Comment: Accepted at ICLR 2015 Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, as oral presentation. Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: pre- James Bergstra and Yoshua Bengio. 2012. Random trained biomedical language representation model search for hyper-parameter optimization. J. Mach. for biomedical text mining. arXiv preprint Learn. Res., 13(1):281–305, February. arXiv:1901.08746. Jari Björne, Suwisa Kaewphan, and Tapio Salakoski. Shengyu Liu, Buzhou Tang, Qingcai Chen, and Xiao- 2013. UTurku: Drug named entity recognition and long Wang. 2016. Drug-drug interaction extraction drug-drug interaction extraction using SVM classi- via convolutional neural networks. Computational fication and domain knowledge. In Second Joint and mathematical methods in medicine, 2016. Conference on Lexical and Computational Seman- tics (*SEM), Volume 2: Proceedings of the Sev- Ryan McDonald, Georgios-Ioannis Brokos, and Ion enth International Workshop on Semantic Evalua- Androutsopoulos. 2018. Deep relevance rank- tion (SemEval 2013), pages 651–659, Atlanta, Geor- ing using enhanced document-query interactions. gia, USA, June. Association for Computational Lin- CoRR, abs/1809.01682. guistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- rado, and Jeff Dean. 2013. Distributed representa- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger tions of words and phrases and their composition- Schwenk, and Yoshua Bengio. 2014. Learning ality. In C. J. C. Burges, L. Bottou, M. Welling, phrase representations using RNN encoder-decoder Z. Ghahramani, and K. Q. Weinberger, editors, Ad- for statistical machine translation. arXiv preprint vances in Neural Information Processing Systems arXiv:1406.1078. 26, pages 3111–3119. Curran Associates, Inc. Md. Faisal Mahbub Chowdhury and Alberto Lavelli. 2013. FBK-irst : A multi-phase kernel based ap- Jeffrey Pennington, Richard Socher, and Christopher proach for drug-drug interaction detection and clas- Manning. 2014. Glove: Global vectors for word sification that exploits linguistic information. In representation. In Proceedings of the 2014 confer- Proceedings of the 7th International Workshop on ence on Empirical Methods in Natural Language Semantic Evaluation, SemEval@NAACL-HLT 2013, Processing (EMNLP), pages 1532–1543. Atlanta, Georgia, USA, June 14-15, 2013, pages Luca Putelli, Alfonso E. Gerevini, Alberto Lavelli, and 351–355. Ivan Serina. 2019. Applying self-interaction atten- Jinhua Du, Jingguang Han, Andy Way, and Dadong tion for extracting drug-drug interactions. In Pro- Wan. 2018. Multi-level structured self-attentions ceedings of 18th International Conference of the for distantly supervised relation extraction. CoRR, Italian Association for Artificial Intelligence. abs/1809.00699. Chanqin Quan, Lei Hua, Xiao Sun, and Wenjun Bai. Milton Friedman. 1937. The use of ranks to avoid the 2016. Multichannel convolutional neural network assumption of normality implicit in the analysis of for biological relation extraction. BioMed research variance. Journal of the American Statistical Asso- international, 2016. ciation, 32(200):675–701. Colin Raffel and Daniel P. W. Ellis. 2015. Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cum- Feed-forward networks with attention can solve mins. 2000. Learning to forget: Continual predic- some long-term memory problems. CoRR, tion with LSTM. Neural Computation, 12:2451– abs/1512.08756. 2471. Mike Schuster and Kuldip K Paliwal. 1997. Bidirec- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long tional recurrent neural networks. IEEE Transactions short-term memory. Neural computation, 9:1735– on Signal Processing, 45(11):2673–2681. 80, 12. Isabel Segura-Bedmar, Paloma Martínez, and María Dichao Hu. 2018. An introductory survey on at- Herrero-Zazo. 2014. Lessons learnt from tention mechanisms in NLP problems. CoRR, the DDIExtraction-2013 shared task. Journal of abs/1811.05544. Biomedical Informatics, 51:152–164. Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Jan Kleindienst. 2016. Text understanding Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz with the attention sum reader network. CoRR, Kaiser, and Illia Polosukhin. 2017. Attention is all abs/1603.01547. you need. CoRR, abs/1706.03762. Gary Weiss and Foster Provost. 2001. The effect of class distribution on classifier learning: An empir- ical study. Technical report, Department of Com- puter Science, Rutgers University. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical attention networks for document clas- sification. In HLT-NAACL. Zibo Yi, Shasha Li, Jie Yu, Yusong Tan, Qingbo Wu, Hong Yuan, and Ting Wang. 2017. Drug-drug inter- action extraction via recurrent neural network with multiple attention layers. In International Confer- ence on Advanced Data Mining and Applications, pages 554–566. Springer. Yijia Zhang, Wei Zheng, Hongfei Lin, Jian Wang, Zhi- hao Yang, and Michel Dumontier. 2018. Drug-drug interaction extraction via hierarchical RNNs on se- quence and shortest dependency paths. Bioinformat- ics, 34(5):828–835. Wei Zheng, Hongfei Lin, Ling Luo, Zhehuan Zhao, Zhengguang Li, Zhang Yijia, Zhihao Yang, and Jian Wang. 2017. An attention-based effective neural model for drug-drug interactions extraction. BMC Bioinformatics, 18, 12. Jianming Zheng, Fei Cai, Taihua Shao, and Honghui Chen. 2018. Self-interaction attention mechanism- based text representation for document classifica- tion. Applied Sciences, 8(4).