The Impact of Self-Interaction Attention
                       on the Extraction of Drug-Drug Interactions

              Luca Putelli1,2 , Alfonso E. Gerevini1 , Alberto Lavelli2 , Ivan Serina1
                  1
                    Università degli Studi di Brescia, 2 Fondazione Bruno Kessler
               {alfonso.gerevini, ivan.serina}@unibs.it, {l.putelli, lavelli}@fbk.eu


                                                           have to be extracted from a corpus of free-text sen-
                      Abstract                             tences, combining machine learning with natural
                                                           language processing (NLP).
    Since a large amount of medical treat-
                                                              Starting from the introduction of word embed-
    ments requires the assumption of multi-
                                                           ding techniques like Word2Vec (Mikolov et al.,
    ple drugs, the discovery of how these in-
                                                           2013) and GloVe (Pennington et al., 2014) for
    teract with each other, potentially causing
                                                           word representation, Recurrent Neural Networks
    health problems to the patients, is the sub-
                                                           (RNN) and in particular Long Short Term Mem-
    ject of a huge quantity of documents. In
                                                           ory networks (LSTM) have become the state-of-
    order to obtain this information from free
                                                           the-art technology for most of natural language
    text, several methods involving deep learn-
                                                           processing tasks like text classification or relation
    ing have been proposed over the years. In
                                                           extraction.
    this paper we introduce a Recurrent Neu-
                                                              The main idea behind the attention mechanism
    ral Network-based method combined with
                                                           (Bahdanau et al., 2014) is that the model “pays
    the Self-Interaction Attention Mechanism.
                                                           attention" only to the parts of the input where
    Such a method is applied to the DDI2013-
                                                           the most relevant information is present. In our
    Extraction task, a popular challenge con-
                                                           case, this mechanism assigns a higher weight to
    cerning the extraction and the classifica-
                                                           the most influential words, i.e. the ones which de-
    tion of drug-drug interactions. Our fo-
                                                           scribe an interaction between drugs.
    cus is to show its effect over the tendency
                                                              Several attention mechanisms have been pro-
    to predict the majority class and how it
                                                           posed in the last few years (Hu, 2018), in particu-
    differs from the other types of attention
                                                           lar self-interaction mechanism (Zheng et al., 2018)
    mechanisms.
                                                           applies attention with a different weight vector for
1   Introduction                                           each word in the sequence, producing a matrix that
                                                           represents the influence between all word pairs.
Given the increase of publications regarding side          We consider this information very meaningful, es-
effects, adverse drug reactions and, more in gen-          pecially in a task like this one where we need to
eral, how the assumption of drugs can cause risks          discover connections between pairs of words.
of health issues that may affect patients, a large            In this paper we show how self-interaction at-
quantity of free-text containing crucial informa-          tention improves the results in the DDI-2013 task,
tion has become available. For doctors and re-             comparing it to other types of attention mecha-
searchers, accessing this information is a very de-        nisms. Given that this dataset is strongly unbal-
manding task, given the number and the complex-            anced, the main focus of the analysis is how each
ity of such documents.                                     attention mechanism deals with the tendency to
   Hence, the automatic extraction of Drug-Drug            predict the majority class.
Interactions (DDI), i.e. situations where the simul-
taneous assumption of drugs can cause adverse              2   Related work
drug reactions, is the goal of the DDIExtraction-
2013 task (Segura-Bedmar et al., 2014). DDIs               The best performing teams in the DDI-2013 orig-
                                                           inal challenge (Segura-Bedmar et al., 2014) used
     Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   SVM (Björne et al., 2013) but, more recently,
International (CC BY 4.0)                                  Convolutional Neural Networks (CNN) (Liu et al.,
2016), (Quan et al., 2016) and mostly Recurrent           There are five different classes: unrelated:
Neural Networks (RNN) have proved to be the            there is no relation between the two drugs men-
new state of the art.                                  tioned; effect: the text describes the effect of
   Kumar and Anand (2017) propose a double             the drug-drug interaction; advise: the text rec-
LSTM. The sentences are processed by two differ-       ommends to avoid the simultaneous assumption
ent bidirectional LSTM layers: one followed by a       of two drugs; mechanism: the text describes an
max-pooling layer and the other one by a custom        anomaly of the absorption of a drug, if assumed si-
made attention-pooling layer that assign weights       multaneously with another one; int: the text states
to words. Furthermore Zhang et al. (2018) design       a generic interaction between the drugs.
a multi-path LSTM neural network. Three paral-
lel bidirectional LSTM layers process the sentence     4       Pre-processing
sequence and a fourth one processes the shortest       The pre-processing phase exploits the
dependency path between the two candidate drugs        “en_core_web_sm" model of spaCy1 , a Python
in the dependency tree. The output of these four       tool for Natural Language Processing, and it is
layers is merged and handled by another bidirec-       composed by these steps:
tional LSTM layer.                                        Substitution: after tokenization and POS-
   Zheng et al. (2017) apply attention directly        tagging, the drug mention tokens are re-
to word vectors, creating a “candidate-drugs-          placed by the standard terms PairDrug1 and
oriented" input which is processed by a single         PairDrug2. In the particular case when the pair
LSTM layer.                                            is composed by two mentions of the same drug,
   Yi et al. (2017) use a RNN with Gated Re-           these are replaced by NoPair. Every other drug
current Units (GRU) (Cho et al., 2014) instead         mentioned in the sentence is replaced with the
of LSTM units, followed by a standard attention        generic name Drug.
mechanism, and exploits information contained in          Shortest dependency path: spaCy produces
other sentences with a custom made sentence at-        the dependency tree associated to the sentence,
tention mechanism.                                     with tokens as nodes and dependency relations be-
   Putelli et al. (2019) introduce an LSTM model       tween the words as edges. Then, we calculate
followed by a self-interaction attention mecha-        the shortest path in the dependency tree between
nism which computes, for each pair of words, a         PairDrug1 and PairDrug2.
vector representing how much it is related to the         Offset features: given a word w in the sen-
other. These vectors are concatenated into a sin-      tence, D1 is calculated as the distance (in terms of
gle one which is passed to a classification layer.     words) from the first drug mention, divided by the
In this paper, starting from the results reported in   length of the sentence. Similarly, D2 is calculated
Putelli et al. (2019), we improve the input rep-       as the distance from the second drug mention.
resentation, the negative filtering and extend the
analysis of self-interaction attention, comparing it   4.1       Negative instance filtering
to more standard attention mechanisms.                 The DDI-2013 dataset contains many “negative
                                                       instances", i.e. instances that belong to the un-
3   Dataset description                                related class. In an unbalanced dataset, machine
                                                       learning algorithms are more likely to classify a
This dataset was released for the shared challenge
                                                       new instance over the majority class, leading to
SemEval 2013 - Task 9 (Segura-Bedmar et al.,
                                                       poor performance for the minority classes (Weiss
2014) and contains annotated documents from the
                                                       and Provost, 2001). Given that previous stud-
biomedical literature. In particular, there are two
                                                       ies (Chowdhury and Lavelli, 2013; Kumar and
different sources: abstracts from MEDLINE re-
                                                       Anand, 2017; Zheng et al., 2017) have demon-
search articles and texts from DrugBank.
                                                       strated a positive effect of reducing the number
   Every document is divided into sentences and,
                                                       of negative instances on this dataset, we have fil-
for each sentence, the dataset provides annotations
                                                       tered out some instances from the training-set rely-
of every drug mentioned. The task requires to clas-   ing only on the structure of the sentence, starting
sify all the possible n2 pairs of n drugs mentioned
                                                       from the pairs of drugs with the same name. In
in the given sentences. The dataset provides the
                                                           1
instances with their classification value.                     https://spacy.io
addition to this case, we can filter out a candidate    which allow to process longer and more complex
pair if the two drug mentions appear in coordinate      sequences. Given x1 , x2 . . . xm , ht−1 and ct−1
structure, checking the shortest dependency path        where m is the length of the sentence and xi ∈ Rd
between the two drug mentions.If they are not con-      is the vector obtained by concatenating the embed-
nected by a path, i.e. there is no grammatical re-      ded features, ht−1 and ct−1 are the hidden state
lation between them, the candidate pair is filtered     and the cell state of the previous LSTM cell (h0
out.                                                    and c0 are initialized as zero vectors), new hidden
   While other works like (Kumar and Anand,             state and cell state values are computed as follows:
2017) and (Liu et al., 2016) apply custom-made
rules for this dataset (such as regular expressions),
our choice is to keep the pre-processing phase as                  ĉt = tanh(Wc [hti , xt ] + bc )
general as possible, defining an approach that can                    it = σ(Wi [hti , xt ] + bi )
be applied for other relation extraction tasks.                      ft = σ(Wf [hti , xt ] + bf )
                                                                      ot = σ(Wo [hti , xt ] + bo )
5     Model description                                                ct = it ∗ ĉt + ft ∗ ct−1
                                                                         ht = tanh(ct ) ∗ ot

                                                        with σ being the sigmoid activation function and ∗
                                                        denoting the element wise product. Wf , Wi , Wo ,
                                                        Wc ∈ R(N +d)×N are weight matrices and bf , bi ,
                                                        bo , bc ∈ RN are bias vectors. Weight matrices and
                                                        bias vectors are randomly initialized and learned
                                                        by the neural network during the training phase. N
                                                        is the LSTM layer size and d is the dimension of
                                                        the feature vector for each input word. The vectors
                                                        in square brackets are concatenated.
                                                            Bidirectional LSTM not only computes the in-
                                                        put sequence in the order of the sentence but also
           Figure 1: Model architecture                 backwards (Schuster and Paliwal, 1997). Hence,
  In this section we present the LSTM-based             we can compute hr using the same equations de-
model (Figure 1), the self-attention mechanism          scribed earlier but reversing the word sequence.
and how it is used for relation extraction.             Given ht computed in the sentence order and hrt in
                                                        the reversed order, the output of the t bidirectional
5.1    Embedding                                        LSTM cell hbt is the result of the concatenation of
Each word in our corpus is represented with a vec-      ht and hrt .
tor of length 200. These vectors are obtained with
a Word2Vec (Mikolov et al., 2013) fine-tuning.          5.3   Sentence representation and attention
We initialized a Word2Vec model with the vec-                 mechanisms
tors obtained by the authors of McDonald et al.         The LSTM layers produce, for each word input
(2018) the same algorithm over PubMed abstracts         wi , a vector hi ∈ Rn which is the result of com-
and PMC texts, and trained our Word2Vec model           puting every word from the start of the sentence
using the DDI-2013 corpus.                              to wi . Hence, given a sentence of length m, hm
   PoS tags are represented with vectors of length      can be considered as the sentence representation
4. These are obtained applying the Word2Vec             produced by the LSTM layer. So, for a sentence
method to the sequence of PoS tags in our corpus.       classification task, hm can be used as the input to
                                                        a fully connected layer that provides the classifi-
5.2    Bidirectional LSTM layer                         cation.
A Recurrent neural network is a deep learning              Even if they perform better than simple RNNs,
model for processing sequential data, like natu-        LSTM neural networks have difficulties preserv-
ral language sentences. Its issues with vanishing       ing dependencies between distant words (Raffel
gradient are avoided using LSTM cells (Hochre-          and Ellis, 2015) and, especially for long sen-
iter and Schmidhuber, 1997; Gers et al., 2000),         tences, hm may not be influenced by the first
words or may be affected by less relevant words.         5 neurons (one for each class) and softmax acti-
The Attention mechanism (Bahdanau et al., 2014;          vation function that provides the classification.
Kadlec et al., 2016) deals with these problems tak-         In our experiments, we compare this model
ing into consideration each hi , computing weights       with similar configurations obtained substituting
αi for each word contribution:                           the self-interaction attention with the standard at-
                                                         tention layer introduced by Bahdanau et al. (2014)
            ui = tanh(Wa hi + bP
                               a)                        and the context-attention of Yang et al. (2016).
 αi = sof tmax(ui ) = exp(ui )/ nk=1 exp(uk )

where Wa ∈ RN ×N and ba ∈ RN .
                                                         6     Results and discussion
  The attention mechanism outputs the sentence           Our models are implemented using Keras library
representation                                           with Tensorflow backend. We perform a sim-
                s= m
                    P
                      i=1 αi hi                          ple random hyper-parameter search (Bergstra and
   The Context Attention mechanism (Yang et              Bengio, 2012) in order to optimize the learning
al., 2016) is more complex. In order to enhance          phase and avoiding overfitting, using a subset of
the importance of the words for the meaning of           sentences as validation set.
the sentence, this uses a word level context vector
                                                         6.1    Evaluation
uw of additional weights for the calculation of αi :
              αi = sof tmax(uTw ui )                     We have tested our two models with different in-
                                                         put configurations: using only word vectors, using
   As proposed by Zheng et al. (2018), Self-             word and PoS tag vectors or adding also offset fea-
Interaction Attention mechanism uses multiple            tures.
vi for each word wi instead of using a single one.
                                                            In Table 1 we show the recall measure for each
This way, we can extract the influence (called ac-
                                                         input configuration. The effect of self-interaction
tion) between the action controller wi and the rest
                                                         is also verified through the Friedman test (Fried-
of the sentence, i.e. each wk for k ∈ {1, m}. The
                                                         man, 1937): for all input configurations, the model
action of wi is calculated as follows:
                                                         with self-interaction attention performs better than
               si = m
                     P                                   the other configurations with a level of confidence
                           Pαi,k ui
                         k=1
                                                         equals to 99%. Similarly, the simple Attention
       αik = exp(vkT ui )/ m          T
                             j=1 exp(vj ui )
                                                         Mechanism obtains better performances with re-
with ui defined in the same way as the standard          spect to the Context Attention with confidence of
attention mechanism.                                     99% (see Figure 2).
                                                            In Table 2 we show the F-Score for each class of
5.4   Model architecture                                 the dataset. The overall performance of the config-
In order to obtain also in this case a context vector    uration including word vectors, PoS tagging and
representing the sentence, in Zheng et al. (2018)        offset features as input is considered also in Ta-
each si is aggregated into a single vector s as its      ble 3.
average, maximum or even applying another stan-             In Table 3 we compare our results with other
dard attention layer. In our model we choose to          state-of-the-art methods and compare the overall
avoid any pooling operations and to concatenate          performance of the three attention mechanisms.
instead each si , creating a flattened representation    The Context-Att obtains results similar to those
(Du et al., 2018) and passing it to the classification   of most of the approaches based on Convolution
layer.                                                   Neural Networks and worse than most of LSTM-
   The model designed (see Figure 1) and tested          based models.
for the DDI-2013 Relation Extraction task in-               In terms of F-Score, Word Attention LSTM
cludes the following layers: three parallel em-          (Zheng et al., 2017) outperforms our approach and
bedding layers: one with pre-trained word vec-           the other LSTM-based models by more than 4%.
tors, one with pre-trained PoS tag vectors and one       As we discussed in (Putelli et al., 2019), we have
that calculates the embedding of the offset fea-         tried to replicate their model but we could not ob-
tures; two bidirectional LSTM layers that pro-           tain the same results. Furthermore, their attention
cess the word sequence; the self-interaction at-         mechanism aimed to creating a “candidate-drugs-
tention mechanism; a fully connected layer with          oriented" input did not improve the performance.
              Input                   No Attention     Context-Att    Attention   Self-Int-Att
              Word                    64.44            65.32          66.60       69.72
              Word+Tag                65.37            65.20          67.57       68.95
              Word+Tag+Offset         60.67            65.82          69.47       70.88

Table 1: Overall recall (%) comparison with different attention mechanisms and input configurations.
For each input configuration, the best recall is marked in bold.

                                           Effect                              Mechanism
       Input                 No Att     C-Att Att       Self-Int     No Att   C-Att Att      Self-Int
       Word                   0.68      0.71 0.72        0.70         0.69    0.72 0.72       0.70
       Word+Tag               0.67      0.70 0.70        0.69         0.71    0.73 0.74       0.70
       Word+Tag+Offset        0.65      0.70 0.70        0.69         0.68    0.73 0.74       0.76
                                           Advise                                  Int
       Input                 No Att     C-Att Att       Self-Int     No Att   C-Att Att      Self-Int
       Word                   0.77      0.71 0.74        0.78         0.53    0.49 0.45       0.45
       Word+Tag               0.78      0.73 0.77        0.77         0.55    0.50 0.45       0.43
       Word+Tag+Offset        0.74      0.75 0.79        0.78         0.50    0.52 0.50       0.49

Table 2: Detailed F-Score comparison with different configurations and attention mechanisms. For each
class, the best F-Score is marked in bold.

    Method               P(%)    R(%)     F(%)
    UTurku (SVM)         73.2    49.9     59.4
    FBK-irst (SVM)       64.6    65.6     65.1
    Zhao SCNN            72.5    65.1     68.6
    Liu CNN              75.7    64.7     69.8
    Multi-Channel        76.0    65.3     70.2           Figure 2:     Recall comparison for mod-
    Context-Att          75.9    65.8     70.5           els with different attention mechanisms for
    Joint-LSTMs          73.4    69.7     71.5           Word+Tag+Offset. The continue arrow means
    Self-Int             73.0    70.9     71.9           99% confidence, while the dashed arrow means
    GRU                  73.7    70.8     72.2           95%.
    Attention            75.6    69.5     72.4
    SDP-LSTM             74.1    71.8     72.9
                                                         tendency of predicting the majority class, hence
    Word-Att LSTM        78.4    76.2     77.3
                                                         decreasing the number of false negatives. The
                                                         standard attention mechanism produces better re-
Table 3: Comparison with overall precision (P),          sults than the context attention.
recall (R) and F-Score (F) of other state-of-the-art
methods: , ordered by F. Our models are marked              As future work, our objective is to exploit or
in bold, results higher than ours are marked in red.     adapt the Transformer architecture (Vaswani et al.,
                                                         2017), which has become quite popular for ma-
                                                         chine translation tasks and relies almost only on
7   Conclusions and future work                          attention mechanisms, and apply it to relation ex-
We have compared the self-interaction attention          traction tasks like DDI-2013.
model to alternative configurations using the stan-         Another direction includes the exploitation of a
dard attention mechanism introduced by Bah-              different pre-trained language modeling. For ex-
danau et al. (2014) and the context-attention mech-      ample, BioBERT (Lee et al., 2019) obtains good
anism of Yang et al. (2016).                             results for several NLP tasks like Named Entity
  Our experiments show that the self-interaction         Recognition or Question Answering and we plan
mechanism improves the performance with re-              to apply it to our task.
spect to other versions, in particular reducing the
References                                               Sunil Kumar and Ashish Anand. 2017. Drug-
                                                           drug interaction extraction from biomedical text
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua                using long short term memory network. CoRR,
  Bengio.      2014.    Neural machine translation         abs/1701.08303.
  by jointly learning to align and translate. cite
  arxiv:1409.0473Comment: Accepted at ICLR 2015          Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
  as oral presentation.                                     Donghyeon Kim, Sunkyu Kim, Chan Ho So,
                                                            and Jaewoo Kang.        2019.    BioBERT: pre-
James Bergstra and Yoshua Bengio. 2012. Random              trained biomedical language representation model
  search for hyper-parameter optimization. J. Mach.         for biomedical text mining.       arXiv preprint
  Learn. Res., 13(1):281–305, February.                     arXiv:1901.08746.
Jari Björne, Suwisa Kaewphan, and Tapio Salakoski.
                                                         Shengyu Liu, Buzhou Tang, Qingcai Chen, and Xiao-
   2013. UTurku: Drug named entity recognition and
                                                           long Wang. 2016. Drug-drug interaction extraction
   drug-drug interaction extraction using SVM classi-
                                                           via convolutional neural networks. Computational
   fication and domain knowledge. In Second Joint
                                                           and mathematical methods in medicine, 2016.
   Conference on Lexical and Computational Seman-
   tics (*SEM), Volume 2: Proceedings of the Sev-        Ryan McDonald, Georgios-Ioannis Brokos, and Ion
   enth International Workshop on Semantic Evalua-         Androutsopoulos. 2018. Deep relevance rank-
   tion (SemEval 2013), pages 651–659, Atlanta, Geor-      ing using enhanced document-query interactions.
   gia, USA, June. Association for Computational Lin-      CoRR, abs/1809.01682.
   guistics.
                                                         Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
                                                           rado, and Jeff Dean. 2013. Distributed representa-
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
                                                           tions of words and phrases and their composition-
  Schwenk, and Yoshua Bengio. 2014. Learning
                                                           ality. In C. J. C. Burges, L. Bottou, M. Welling,
  phrase representations using RNN encoder-decoder
                                                           Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
  for statistical machine translation. arXiv preprint
                                                           vances in Neural Information Processing Systems
  arXiv:1406.1078.
                                                           26, pages 3111–3119. Curran Associates, Inc.
Md. Faisal Mahbub Chowdhury and Alberto Lavelli.
 2013. FBK-irst : A multi-phase kernel based ap-         Jeffrey Pennington, Richard Socher, and Christopher
 proach for drug-drug interaction detection and clas-       Manning. 2014. Glove: Global vectors for word
 sification that exploits linguistic information. In        representation. In Proceedings of the 2014 confer-
 Proceedings of the 7th International Workshop on           ence on Empirical Methods in Natural Language
 Semantic Evaluation, SemEval@NAACL-HLT 2013,               Processing (EMNLP), pages 1532–1543.
 Atlanta, Georgia, USA, June 14-15, 2013, pages
                                                         Luca Putelli, Alfonso E. Gerevini, Alberto Lavelli, and
 351–355.
                                                           Ivan Serina. 2019. Applying self-interaction atten-
Jinhua Du, Jingguang Han, Andy Way, and Dadong             tion for extracting drug-drug interactions. In Pro-
   Wan. 2018. Multi-level structured self-attentions       ceedings of 18th International Conference of the
   for distantly supervised relation extraction. CoRR,     Italian Association for Artificial Intelligence.
   abs/1809.00699.
                                                         Chanqin Quan, Lei Hua, Xiao Sun, and Wenjun Bai.
Milton Friedman. 1937. The use of ranks to avoid the       2016. Multichannel convolutional neural network
  assumption of normality implicit in the analysis of      for biological relation extraction. BioMed research
  variance. Journal of the American Statistical Asso-      international, 2016.
  ciation, 32(200):675–701.
                                                         Colin Raffel and Daniel P. W. Ellis.       2015.
Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cum-        Feed-forward networks with attention can solve
  mins. 2000. Learning to forget: Continual predic-        some long-term memory problems.         CoRR,
  tion with LSTM. Neural Computation, 12:2451–             abs/1512.08756.
  2471.
                                                         Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long         tional recurrent neural networks. IEEE Transactions
  short-term memory. Neural computation, 9:1735–           on Signal Processing, 45(11):2673–2681.
  80, 12.
                                                         Isabel Segura-Bedmar, Paloma Martínez, and María
Dichao Hu. 2018. An introductory survey on at-              Herrero-Zazo.    2014.      Lessons learnt from
  tention mechanisms in NLP problems. CoRR,                 the DDIExtraction-2013 shared task. Journal of
  abs/1811.05544.                                           Biomedical Informatics, 51:152–164.

Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and         Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Jan Kleindienst.    2016.    Text understanding          Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  with the attention sum reader network. CoRR,             Kaiser, and Illia Polosukhin. 2017. Attention is all
  abs/1603.01547.                                          you need. CoRR, abs/1706.03762.
Gary Weiss and Foster Provost. 2001. The effect of
  class distribution on classifier learning: An empir-
  ical study. Technical report, Department of Com-
  puter Science, Rutgers University.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
  Alexander J. Smola, and Eduard H. Hovy. 2016.
  Hierarchical attention networks for document clas-
  sification. In HLT-NAACL.
Zibo Yi, Shasha Li, Jie Yu, Yusong Tan, Qingbo Wu,
  Hong Yuan, and Ting Wang. 2017. Drug-drug inter-
  action extraction via recurrent neural network with
  multiple attention layers. In International Confer-
  ence on Advanced Data Mining and Applications,
  pages 554–566. Springer.
Yijia Zhang, Wei Zheng, Hongfei Lin, Jian Wang, Zhi-
   hao Yang, and Michel Dumontier. 2018. Drug-drug
   interaction extraction via hierarchical RNNs on se-
   quence and shortest dependency paths. Bioinformat-
   ics, 34(5):828–835.
Wei Zheng, Hongfei Lin, Ling Luo, Zhehuan Zhao,
  Zhengguang Li, Zhang Yijia, Zhihao Yang, and Jian
  Wang. 2017. An attention-based effective neural
  model for drug-drug interactions extraction. BMC
  Bioinformatics, 18, 12.
Jianming Zheng, Fei Cai, Taihua Shao, and Honghui
   Chen. 2018. Self-interaction attention mechanism-
   based text representation for document classifica-
   tion. Applied Sciences, 8(4).