<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Impact of Self-Interaction Attention on the Extraction of Drug-Drug Interactions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Putelli</string-name>
          <email>l.putelli@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alfonso E. Gerevini</string-name>
          <email>alfonso.gerevini@unibs.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Lavelli</string-name>
          <email>lavelli@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Serina</string-name>
          <email>ivan.serina@unibs.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi di Brescia</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Since a large amount of medical treatments requires the assumption of multiple drugs, the discovery of how these interact with each other, potentially causing health problems to the patients, is the subject of a huge quantity of documents. In order to obtain this information from free text, several methods involving deep learning have been proposed over the years. In this paper we introduce a Recurrent Neural Network-based method combined with the Self-Interaction Attention Mechanism. Such a method is applied to the DDI2013Extraction task, a popular challenge concerning the extraction and the classification of drug-drug interactions. Our focus is to show its effect over the tendency to predict the majority class and how it differs from the other types of attention mechanisms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Given the increase of publications regarding side
effects, adverse drug reactions and, more in
general, how the assumption of drugs can cause risks
of health issues that may affect patients, a large
quantity of free-text containing crucial
information has become available. For doctors and
researchers, accessing this information is a very
demanding task, given the number and the
complexity of such documents.</p>
      <p>
        Hence, the automatic extraction of Drug-Drug
Interactions (DDI), i.e. situations where the
simultaneous assumption of drugs can cause adverse
drug reactions, is the goal of the
DDIExtraction2013 task
        <xref ref-type="bibr" rid="ref21">(Segura-Bedmar et al., 2014)</xref>
        . DDIs
      </p>
      <p>Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)
have to be extracted from a corpus of free-text
sentences, combining machine learning with natural
language processing (NLP).</p>
      <p>
        Starting from the introduction of word
embedding techniques like Word2Vec
        <xref ref-type="bibr" rid="ref15">(Mikolov et al.,
2013)</xref>
        and GloVe
        <xref ref-type="bibr" rid="ref16">(Pennington et al., 2014)</xref>
        for
word representation, Recurrent Neural Networks
(RNN) and in particular Long Short Term
Memory networks (LSTM) have become the
state-ofthe-art technology for most of natural language
processing tasks like text classification or relation
extraction.
      </p>
      <p>
        The main idea behind the attention mechanism
        <xref ref-type="bibr" rid="ref1">(Bahdanau et al., 2014)</xref>
        is that the model “pays
attention" only to the parts of the input where
the most relevant information is present. In our
case, this mechanism assigns a higher weight to
the most influential words, i.e. the ones which
describe an interaction between drugs.
      </p>
      <p>
        Several attention mechanisms have been
proposed in the last few years
        <xref ref-type="bibr" rid="ref10">(Hu, 2018)</xref>
        , in
particular self-interaction mechanism
        <xref ref-type="bibr" rid="ref27 ref29">(Zheng et al., 2018)</xref>
        applies attention with a different weight vector for
each word in the sequence, producing a matrix that
represents the influence between all word pairs.
We consider this information very meaningful,
especially in a task like this one where we need to
discover connections between pairs of words.
      </p>
      <p>In this paper we show how self-interaction
attention improves the results in the DDI-2013 task,
comparing it to other types of attention
mechanisms. Given that this dataset is strongly
unbalanced, the main focus of the analysis is how each
attention mechanism deals with the tendency to
predict the majority class.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        The best performing teams in the DDI-2013
original challenge
        <xref ref-type="bibr" rid="ref21">(Segura-Bedmar et al., 2014)</xref>
        used
SVM (Björne et al., 2013) but, more recently,
Convolutional Neural Networks (CNN)
        <xref ref-type="bibr" rid="ref13">(Liu et al.,
2016)</xref>
        ,
        <xref ref-type="bibr" rid="ref18">(Quan et al., 2016)</xref>
        and mostly Recurrent
Neural Networks (RNN) have proved to be the
new state of the art.
      </p>
      <p>Kumar and Anand (2017) propose a double
LSTM. The sentences are processed by two
different bidirectional LSTM layers: one followed by a
max-pooling layer and the other one by a custom
made attention-pooling layer that assign weights
to words. Furthermore Zhang et al. (2018) design
a multi-path LSTM neural network. Three
parallel bidirectional LSTM layers process the sentence
sequence and a fourth one processes the shortest
dependency path between the two candidate drugs
in the dependency tree. The output of these four
layers is merged and handled by another
bidirectional LSTM layer.</p>
      <p>Zheng et al. (2017) apply attention directly
to word vectors, creating a
“candidate-drugsoriented" input which is processed by a single
LSTM layer.</p>
      <p>
        Yi et al. (2017) use a RNN with Gated
Recurrent Units (GRU)
        <xref ref-type="bibr" rid="ref1 ref4">(Cho et al., 2014)</xref>
        instead
of LSTM units, followed by a standard attention
mechanism, and exploits information contained in
other sentences with a custom made sentence
attention mechanism.
      </p>
      <p>Putelli et al. (2019) introduce an LSTM model
followed by a self-interaction attention
mechanism which computes, for each pair of words, a
vector representing how much it is related to the
other. These vectors are concatenated into a
single one which is passed to a classification layer.
In this paper, starting from the results reported in
Putelli et al. (2019), we improve the input
representation, the negative filtering and extend the
analysis of self-interaction attention, comparing it
to more standard attention mechanisms.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset description</title>
      <p>
        This dataset was released for the shared challenge
SemEval 2013 - Task 9
        <xref ref-type="bibr" rid="ref21">(Segura-Bedmar et al.,
2014)</xref>
        and contains annotated documents from the
biomedical literature. In particular, there are two
different sources: abstracts from MEDLINE
research articles and texts from DrugBank.
      </p>
      <p>Every document is divided into sentences and,
for each sentence, the dataset provides annotations
of every drug mentioned. The task requires to
classify all the possible n2 pairs of n drugs mentioned
in the given sentences. The dataset provides the
instances with their classification value.</p>
      <p>There are five different classes: unrelated:
there is no relation between the two drugs
mentioned; effect: the text describes the effect of
the drug-drug interaction; advise: the text
recommends to avoid the simultaneous assumption
of two drugs; mechanism: the text describes an
anomaly of the absorption of a drug, if assumed
simultaneously with another one; int: the text states
a generic interaction between the drugs.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Pre-processing</title>
      <p>The pre-processing phase exploits the
“en_core_web_sm" model of spaCy1, a Python
tool for Natural Language Processing, and it is
composed by these steps:</p>
      <p>Substitution: after tokenization and
POStagging, the drug mention tokens are
replaced by the standard terms PairDrug1 and
PairDrug2. In the particular case when the pair
is composed by two mentions of the same drug,
these are replaced by NoPair. Every other drug
mentioned in the sentence is replaced with the
generic name Drug.</p>
      <sec id="sec-4-1">
        <title>Shortest dependency path: spaCy produces</title>
        <p>the dependency tree associated to the sentence,
with tokens as nodes and dependency relations
between the words as edges. Then, we calculate
the shortest path in the dependency tree between
PairDrug1 and PairDrug2.</p>
        <p>Offset features: given a word w in the
sentence, D1 is calculated as the distance (in terms of
words) from the first drug mention, divided by the
length of the sentence. Similarly, D2 is calculated
as the distance from the second drug mention.
4.1</p>
      </sec>
      <sec id="sec-4-2">
        <title>Negative instance filtering</title>
        <p>
          The DDI-2013 dataset contains many “negative
instances", i.e. instances that belong to the
unrelated class. In an unbalanced dataset, machine
learning algorithms are more likely to classify a
new instance over the majority class, leading to
poor performance for the minority classes
          <xref ref-type="bibr" rid="ref24">(Weiss
and Provost, 2001)</xref>
          . Given that previous
studies
          <xref ref-type="bibr" rid="ref11 ref21 ref28 ref3 ref5">(Chowdhury and Lavelli, 2013; Kumar and
Anand, 2017; Zheng et al., 2017)</xref>
          have
demonstrated a positive effect of reducing the number
of negative instances on this dataset, we have
filtered out some instances from the training-set
relying only on the structure of the sentence, starting
from the pairs of drugs with the same name. In
1https://spacy.io
addition to this case, we can filter out a candidate
pair if the two drug mentions appear in coordinate
structure, checking the shortest dependency path
between the two drug mentions.If they are not
connected by a path, i.e. there is no grammatical
relation between them, the candidate pair is filtered
out.
        </p>
        <p>
          While other works like
          <xref ref-type="bibr" rid="ref11">(Kumar and Anand,
2017)</xref>
          and
          <xref ref-type="bibr" rid="ref13">(Liu et al., 2016)</xref>
          apply custom-made
rules for this dataset (such as regular expressions),
our choice is to keep the pre-processing phase as
general as possible, defining an approach that can
be applied for other relation extraction tasks.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Model description</title>
      <p>In this section we present the LSTM-based
model (Figure 1), the self-attention mechanism
and how it is used for relation extraction.
5.1</p>
      <sec id="sec-5-1">
        <title>Embedding</title>
        <p>
          Each word in our corpus is represented with a
vector of length 200. These vectors are obtained with
a Word2Vec
          <xref ref-type="bibr" rid="ref15">(Mikolov et al., 2013)</xref>
          fine-tuning.
We initialized a Word2Vec model with the
vectors obtained by the authors of McDonald et al.
(2018) the same algorithm over PubMed abstracts
and PMC texts, and trained our Word2Vec model
using the DDI-2013 corpus.
        </p>
        <p>PoS tags are represented with vectors of length
4. These are obtained applying the Word2Vec
method to the sequence of PoS tags in our corpus.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Bidirectional LSTM layer</title>
        <p>
          A Recurrent neural network is a deep learning
model for processing sequential data, like
natural language sentences. Its issues with vanishing
gradient are avoided using LSTM cells
          <xref ref-type="bibr" rid="ref20 ref8 ref9">(Hochreiter and Schmidhuber, 1997; Gers et al., 2000)</xref>
          ,
which allow to process longer and more complex
sequences. Given x1; x2 : : : xm, ht 1 and ct 1
where m is the length of the sentence and xi 2 Rd
is the vector obtained by concatenating the
embedded features, ht 1 and ct 1 are the hidden state
and the cell state of the previous LSTM cell (h0
and c0 are initialized as zero vectors), new hidden
state and cell state values are computed as follows:
c^t = tanh(Wc[hti ; xt] + bc)
it = (Wi[hti ; xt] + bi)
ft = (Wf [hti ; xt] + bf )
ot = (Wo[hti ; xt] + bo)
ct = it c^t + ft ct 1
        </p>
        <p>ht = tanh(ct) ot
with being the sigmoid activation function and
denoting the element wise product. Wf , Wi, Wo,
Wc 2 R(N+d) N are weight matrices and bf , bi,
bo, bc 2 RN are bias vectors. Weight matrices and
bias vectors are randomly initialized and learned
by the neural network during the training phase. N
is the LSTM layer size and d is the dimension of
the feature vector for each input word. The vectors
in square brackets are concatenated.</p>
        <p>
          Bidirectional LSTM not only computes the
input sequence in the order of the sentence but also
backwards
          <xref ref-type="bibr" rid="ref20 ref9">(Schuster and Paliwal, 1997)</xref>
          . Hence,
we can compute hr using the same equations
described earlier but reversing the word sequence.
Given ht computed in the sentence order and htr in
the reversed order, the output of the t bidirectional
LSTM cell htb is the result of the concatenation of
r
ht and ht .
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3 Sentence representation and attention mechanisms</title>
        <p>The LSTM layers produce, for each word input
wi, a vector hi 2 Rn which is the result of
computing every word from the start of the sentence
to wi. Hence, given a sentence of length m, hm
can be considered as the sentence representation
produced by the LSTM layer. So, for a sentence
classification task, hm can be used as the input to
a fully connected layer that provides the
classification.</p>
        <p>
          Even if they perform better than simple RNNs,
LSTM neural networks have difficulties
preserving dependencies between distant words
          <xref ref-type="bibr" rid="ref1 ref19">(Raffel
and Ellis, 2015)</xref>
          and, especially for long
sentences, hm may not be influenced by the first
words or may be affected by less relevant words.
The Attention mechanism
          <xref ref-type="bibr" rid="ref1 ref22">(Bahdanau et al., 2014;
Kadlec et al., 2016)</xref>
          deals with these problems
taking into consideration each hi, computing weights
i for each word contribution:
        </p>
        <p>ui = tanh(Wahi + ba)
i = sof tmax(ui) = exp(ui)= Pn
k=1 exp(uk)
where Wa 2 RN N and ba 2 RN .</p>
        <p>The attention mechanism outputs the sentence
representation</p>
        <p>s = Pim=1 ihi</p>
        <p>
          The Context Attention mechanism
          <xref ref-type="bibr" rid="ref25">(Yang et
al., 2016)</xref>
          is more complex. In order to enhance
the importance of the words for the meaning of
the sentence, this uses a word level context vector
uw of additional weights for the calculation of i:
i = sof tmax(uTwui)
        </p>
        <p>As proposed by Zheng et al. (2018),
SelfInteraction Attention mechanism uses multiple
vi for each word wi instead of using a single one.
This way, we can extract the influence (called
action) between the action controller wi and the rest
of the sentence, i.e. each wk for k 2 f1; mg. The
action of wi is calculated as follows:</p>
        <p>si = Pm
ik = exp(vkT ui)k==P1mi;kui
j=1 exp(vjT ui)
with ui defined in the same way as the standard
attention mechanism.
5.4</p>
      </sec>
      <sec id="sec-5-4">
        <title>Model architecture</title>
        <p>
          In order to obtain also in this case a context vector
representing the sentence, in Zheng et al. (2018)
each si is aggregated into a single vector s as its
average, maximum or even applying another
standard attention layer. In our model we choose to
avoid any pooling operations and to concatenate
instead each si, creating a flattened representation
          <xref ref-type="bibr" rid="ref6">(Du et al., 2018)</xref>
          and passing it to the classification
layer.
        </p>
        <p>The model designed (see Figure 1) and tested
for the DDI-2013 Relation Extraction task
includes the following layers: three parallel
embedding layers: one with pre-trained word
vectors, one with pre-trained PoS tag vectors and one
that calculates the embedding of the offset
features; two bidirectional LSTM layers that
process the word sequence; the self-interaction
attention mechanism; a fully connected layer with
5 neurons (one for each class) and softmax
activation function that provides the classification.</p>
        <p>In our experiments, we compare this model
with similar configurations obtained substituting
the self-interaction attention with the standard
attention layer introduced by Bahdanau et al. (2014)
and the context-attention of Yang et al. (2016).
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Results and discussion</title>
      <p>
        Our models are implemented using Keras library
with Tensorflow backend. We perform a
simple random hyper-parameter search
        <xref ref-type="bibr" rid="ref2">(Bergstra and
Bengio, 2012)</xref>
        in order to optimize the learning
phase and avoiding overfitting, using a subset of
sentences as validation set.
6.1
      </p>
      <sec id="sec-6-1">
        <title>Evaluation</title>
        <p>We have tested our two models with different
input configurations: using only word vectors, using
word and PoS tag vectors or adding also offset
features.</p>
        <p>
          In Table 1 we show the recall measure for each
input configuration. The effect of self-interaction
is also verified through the Friedman test
          <xref ref-type="bibr" rid="ref7">(Friedman, 1937)</xref>
          : for all input configurations, the model
with self-interaction attention performs better than
the other configurations with a level of confidence
equals to 99%. Similarly, the simple Attention
Mechanism obtains better performances with
respect to the Context Attention with confidence of
99% (see Figure 2).
        </p>
        <p>In Table 2 we show the F-Score for each class of
the dataset. The overall performance of the
configuration including word vectors, PoS tagging and
offset features as input is considered also in
Table 3.</p>
        <p>In Table 3 we compare our results with other
state-of-the-art methods and compare the overall
performance of the three attention mechanisms.
The Context-Att obtains results similar to those
of most of the approaches based on Convolution
Neural Networks and worse than most of
LSTMbased models.</p>
        <p>
          In terms of F-Score, Word Attention LSTM
          <xref ref-type="bibr" rid="ref28">(Zheng et al., 2017)</xref>
          outperforms our approach and
the other LSTM-based models by more than 4%.
As we discussed in
          <xref ref-type="bibr" rid="ref17">(Putelli et al., 2019)</xref>
          , we have
tried to replicate their model but we could not
obtain the same results. Furthermore, their attention
mechanism aimed to creating a
“candidate-drugsoriented" input did not improve the performance.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and future work</title>
      <p>We have compared the self-interaction attention
model to alternative configurations using the
standard attention mechanism introduced by
Bahdanau et al. (2014) and the context-attention
mechanism of Yang et al. (2016).</p>
      <p>Our experiments show that the self-interaction
mechanism improves the performance with
respect to other versions, in particular reducing the
tendency of predicting the majority class, hence
decreasing the number of false negatives. The
standard attention mechanism produces better
results than the context attention.</p>
      <p>
        As future work, our objective is to exploit or
adapt the Transformer architecture
        <xref ref-type="bibr" rid="ref23">(Vaswani et al.,
2017)</xref>
        , which has become quite popular for
machine translation tasks and relies almost only on
attention mechanisms, and apply it to relation
extraction tasks like DDI-2013.
      </p>
      <p>
        Another direction includes the exploitation of a
different pre-trained language modeling. For
example, BioBERT
        <xref ref-type="bibr" rid="ref12">(Lee et al., 2019)</xref>
        obtains good
results for several NLP tasks like Named Entity
Recognition or Question Answering and we plan
to apply it to our task.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          . cite arxiv:
          <volume>1409</volume>
          .0473Comment:
          <article-title>Accepted at ICLR 2015 as oral presentation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>James</given-names>
            <surname>Bergstra</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Random search for hyper-parameter optimization</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ):
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          , February.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Jari</given-names>
            <surname>Björne</surname>
          </string-name>
          , Suwisa Kaewphan, and
          <string-name>
            <given-names>Tapio</given-names>
            <surname>Salakoski</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>UTurku: Drug named entity recognition and drug-drug interaction extraction using SVM classification and domain knowledge</article-title>
          .
          <source>In Second Joint Conference on Lexical and Computational Semantics (*SEM)</source>
          , Volume
          <volume>2</volume>
          :
          <source>Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval</source>
          <year>2013</year>
          ), pages
          <fpage>651</fpage>
          -
          <lpage>659</lpage>
          , Atlanta, Georgia, USA, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart Van Merriënboer,
          <string-name>
            <surname>Caglar Gulcehre</surname>
            , Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
            <given-names>Yoshua</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1406</source>
          .
          <fpage>1078</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Md. Faisal Mahbub</surname>
            Chowdhury and
            <given-names>Alberto</given-names>
          </string-name>
          <string-name>
            <surname>Lavelli</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>FBK-irst : A multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information</article-title>
          .
          <source>In Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT</source>
          <year>2013</year>
          , Atlanta, Georgia, USA, June 14-15,
          <year>2013</year>
          , pages
          <fpage>351</fpage>
          -
          <lpage>355</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Jinhua</given-names>
            <surname>Du</surname>
          </string-name>
          , Jingguang Han,
          <string-name>
            <given-names>Andy</given-names>
            <surname>Way</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Dadong</given-names>
            <surname>Wan</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Multi-level structured self-attentions for distantly supervised relation extraction</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1809</year>
          .00699.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Milton</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <year>1937</year>
          .
          <article-title>The use of ranks to avoid the assumption of normality implicit in the analysis of variance</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>32</volume>
          (
          <issue>200</issue>
          ):
          <fpage>675</fpage>
          -
          <lpage>701</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Felix A.</given-names>
            <surname>Gers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          , and
          <string-name>
            <surname>Fred</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Cummins</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Learning to forget: Continual prediction with LSTM</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>12</volume>
          :
          <fpage>2451</fpage>
          -
          <lpage>2471</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          :
          <fpage>1735</fpage>
          -
          <lpage>80</lpage>
          ,
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Dichao</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>An introductory survey on attention mechanisms in NLP problems</article-title>
          . CoRR, abs/
          <year>1811</year>
          .05544.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Sunil</given-names>
            <surname>Kumar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Anand</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Drugdrug interaction extraction from biomedical text using long short term memory network</article-title>
          .
          <source>CoRR, abs/1701</source>
          .08303.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Jinhyuk</given-names>
            <surname>Lee</surname>
          </string-name>
          , Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and
          <string-name>
            <given-names>Jaewoo</given-names>
            <surname>Kang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BioBERT: pretrained biomedical language representation model for biomedical text mining</article-title>
          . arXiv preprint arXiv:
          <year>1901</year>
          .08746.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Shengyu</given-names>
            <surname>Liu</surname>
          </string-name>
          , Buzhou Tang, Qingcai Chen, and
          <string-name>
            <given-names>Xiaolong</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Drug-drug interaction extraction via convolutional neural networks</article-title>
          .
          <source>Computational and mathematical methods in medicine</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Ryan</surname>
            <given-names>McDonald</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Georgios-Ioannis Brokos</surname>
            , and
            <given-names>Ion</given-names>
          </string-name>
          <string-name>
            <surname>Androutsopoulos</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep relevance ranking using enhanced document-query interactions</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1809</year>
          .01682.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          . In C. J.
          <string-name>
            <surname>C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Q. Weinberger, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          . Curran Associates, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Luca</given-names>
            <surname>Putelli</surname>
          </string-name>
          , Alfonso E. Gerevini, Alberto Lavelli, and
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Serina</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Applying self-interaction attention for extracting drug-drug interactions</article-title>
          .
          <source>In Proceedings of 18th International Conference of the Italian Association for Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Chanqin</given-names>
            <surname>Quan</surname>
          </string-name>
          , Lei Hua, Xiao Sun, and
          <string-name>
            <given-names>Wenjun</given-names>
            <surname>Bai</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Multichannel convolutional neural network for biological relation extraction</article-title>
          . BioMed research international,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Colin</given-names>
            <surname>Raffel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Daniel P. W.</given-names>
            <surname>Ellis</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Feed-forward networks with attention can solve some long-term memory problems</article-title>
          . CoRR, abs/1512.08756.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Mike</given-names>
            <surname>Schuster and Kuldip K Paliwal</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Bidirectional recurrent neural networks</article-title>
          .
          <source>IEEE Transactions on Signal Processing</source>
          ,
          <volume>45</volume>
          (
          <issue>11</issue>
          ):
          <fpage>2673</fpage>
          -
          <lpage>2681</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Isabel</given-names>
            <surname>Segura-Bedmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Paloma</given-names>
            <surname>Martínez</surname>
          </string-name>
          , and
          <string-name>
            <surname>María</surname>
          </string-name>
          Herrero-Zazo.
          <year>2014</year>
          .
          <article-title>Lessons learnt from the DDIExtraction-2013 shared task</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>51</volume>
          :
          <fpage>152</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Rudolf</given-names>
            <surname>Kadlec</surname>
          </string-name>
          , Martin Schmid, Ondrej Bajgar, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Kleindienst</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Text understanding with the attention sum reader network</article-title>
          .
          <source>CoRR, abs/1603</source>
          .01547.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>CoRR, abs/1706</source>
          .03762.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Gary</given-names>
            <surname>Weiss</surname>
          </string-name>
          and
          <string-name>
            <given-names>Foster</given-names>
            <surname>Provost</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>The effect of class distribution on classifier learning: An empirical study</article-title>
          .
          <source>Technical report</source>
          , Department of Computer Science, Rutgers University.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Zichao</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Diyi</given-names>
            <surname>Yang</surname>
          </string-name>
          , Chris Dyer, Xiaodong He,
          <string-name>
            <surname>Alexander J. Smola</surname>
          </string-name>
          , and
          <string-name>
            <surname>Eduard</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Hierarchical attention networks for document classification</article-title>
          .
          <source>In HLT-NAACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Zibo</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Shasha</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jie</given-names>
            <surname>Yu</surname>
          </string-name>
          , Yusong Tan,
          <string-name>
            <surname>Qingbo Wu</surname>
            , Hong Yuan, and
            <given-names>Ting</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Drug-drug interaction extraction via recurrent neural network with multiple attention layers</article-title>
          .
          <source>In International Conference on Advanced Data Mining and Applications</source>
          , pages
          <fpage>554</fpage>
          -
          <lpage>566</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Yijia</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Wei Zheng, Hongfei Lin, Jian
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Zhihao Yang</surname>
            , and
            <given-names>Michel</given-names>
          </string-name>
          <string-name>
            <surname>Dumontier</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Drug-drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>34</volume>
          (
          <issue>5</issue>
          ):
          <fpage>828</fpage>
          -
          <lpage>835</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Wei</surname>
            <given-names>Zheng</given-names>
          </string-name>
          , Hongfei Lin, Ling Luo,
          <string-name>
            <given-names>Zhehuan</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Zhengguang</given-names>
            <surname>Li</surname>
          </string-name>
          , Zhang Yijia,
          <string-name>
            <given-names>Zhihao</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>An attention-based effective neural model for drug-drug interactions extraction</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>18</volume>
          ,
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Jianming</given-names>
            <surname>Zheng</surname>
          </string-name>
          , Fei Cai, Taihua Shao, and
          <string-name>
            <given-names>Honghui</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Self-interaction attention mechanismbased text representation for document classification</article-title>
          .
          <source>Applied Sciences</source>
          ,
          <volume>8</volume>
          (
          <issue>4</issue>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>