<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Parallel-Attention Model for Tumor Named Entity Recognition in Spanish</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tong Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuanyu Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yongbin Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Science and Engineering, Yunnan University</institution>
          ,
          <addr-line>Kunming 650091</addr-line>
          ,
          <country country="CN">P.R.China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Medical Information Engineering, Zunyi Medical University</institution>
          ,
          <addr-line>Zunyi 563000</addr-line>
          ,
          <country country="CN">P.R.China</country>
        </aff>
      </contrib-group>
      <fpage>438</fpage>
      <lpage>446</lpage>
      <abstract>
        <p>Named Entity Recognition is one of the subtasks of Natural Language Processing, which aims to locate and classify named entities in text into pre-defined categories. The CANTEMIST 2020 is a task for tumor named entity recognition and we propose a new model for this task. We use Recurrent Neural Networks and Convolutional Neural Networks to extract relevant text features. Then dynamic attention mechanism is used to merge features extracted from these two structures. In the final evaluation, we achieve an F1-score of 0.746.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;tumor named entities recognition</kwd>
        <kwd>clinical records</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>parallel-attention structure</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Cancer is one of the diseases with the highest mortality rate, and the number of people sufering
from cancer are increasing year by year. With the rapid development of natural language
processing (NLP), it has become an essential means of early cancer diagnosis, which can
efectively prevent the deterioration of the tumor. The BNER task is a prerequisite task for
other NLP tasks on cancer, and its performance significantly afects the final diagnosis. Tumor
named entity recognition is a subtask of BNER in the tumor field, and its experimental method
is similar to the BNER task.</p>
      <p>
        The current main methods for BNER are machine learning methods and deep learning
methods. For example, Makino et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] use SVM and Settles et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] use CRF to process the
BNER task. However, the feature engineering of machine learning methods relies on manual
processing, and the generalization ability of the model is not enough, which leads to its efect
only on specific tasks.
      </p>
      <p>
        With the rapid development of Neural Networks (NN), people have gradually abandoned
traditional methods to deal with BNER tasks and replaced them with NN methods. Sahu et al.
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] uses the RNN network to process the BNER task and achieve a significant improvement.
Luo L et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] uses An attention-based BiLSTM-CRF approach to solve the BNER task. These
and many other experiments fully prove the efectiveness of RNN for BNER tasks. Later, Jacob
Devlin et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] creates BERT which achieves significant success on the NER task. Subsequently,
Lee et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] applies the BERT model to BNER and proposes Biobert, which achieves great
success and proves the efectiveness of BERT for BNER task. But BERT generates a large number
of parameters and training time. In this context, it becomes a meaningful work to implement a
lightweight model with good performance based on RNN.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Our Model</title>
      <p>Our model includes four parts: the embedding layer, the parallel layer, the attention structure,
and the classification layer. In the embedding layer, we vectorize the text sequences by using
word embedding and character embedding. The RNN and CNN modules are used to extract
context information respectively in the parallel layer, and then we use the dynamic attention
structure to merge these contextual information. Finally, the dense layer acts as a classifier to
output the final prediction. The overall model structure is shown in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Embedding layer</title>
        <p>
          In order to alleviate the OOV problem, our model combines word embedding and character
embedding to obtain OOV information in the embedding layer. The character embedding is
initialized by random feature vector, while word embedding is initialized by Spanish Medical
Word Embeddings [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] that is train on SciELO and Wikipedia Health through fasText [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
        </p>
        <p>For the word embedding, we denote the input sentence as  = [ 1, ...,   , ...,   ], where   is
the i-th token, as well as  is the number of token in the sentence. The word embedding   is
expressed as follows:
  =</p>
        <p>(  ),
to supplement word embedding to obtain OOV information. The character embedding   is
() is the word embedding function. Then we use character embedding
  = [ 1, ...,   , ...,   ],</p>
        <p>= ℎ
(  ),
  .</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Parallel layer</title>
        <p>where ℎ
 is the number of characters in the token   , and  represents the j-th character in the token   .
Because the dimensions of word embedding and character embedding are inconsistent, they
can not be merged directly. Here we use CNN to fine-tune the character embedding, and the
() is a character embedding function to randomly initialize the token   ,
specific step is as follows:
 

=</p>
        <p>([  − /2, ..,   , ..., 


 +( /2)]),
where  represents the size of the convolution kernel, and k is set to 3 in this paper. Finally, we
concatenate word embedding</p>
        <p>and character embedding   to obtain the final token features
The parallel layer is aimed to extract the context-dependence information of the text sequence.
It consists of an RNN module to obtain the long-distance dependence information and a CNN
module to obtain the local dependence information.</p>
        <p>
          RNN module: Whether in industry or academia, long short-term memory (LSTM) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] has
proven its excellent performance in NER, but LSTM adds a forget gate mechanism, which
introduces more parameters and increases training costs. To make the model more lightweight
while ensuring outstanding performance, we combine Gated Recurrent Unit (GRU) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and
Simple Recurrent Unit (SRU) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] in this model to replace the traditional LSTM. Firstly, the
output sequence  = [ 1, ...,   ] of the embedding layer is used as the input of the GRU, and the
output of the GRU can be expressed as follows:
  =  (  ⋅ [  −1,   ]),
  =  (  ⋅ [  −1,   ]),
 ̃ = ℎ
        </p>
        <p>(  ⋅ [  ∗   −1,   ]),
  = (1 −   ) ∗   −1 ∗  ̃ ,
where   ,   , and   are parameter matrix for calculating the update gate   , the reset gate   ,
and the new memory  ̃ , and then we can get the output  = [ 1, ...,   , ...] of the GRU. In order
to extract long-distance dependence information more comprehensively, the model stacks the
GRU and the SRU. At the same time, considering the overfitting caused by ordinary stacking
and the disappearance of SRU gradient, the output of GRU is added as a penalty to the output
of SRU. After the addition operation, the output of the final RNN module is as follows:
  =  (    +     −1 +   ),
  =     −1 + (1 −   )(   ),
  =  (    +     −1 +   ),
  =     + (1 −   )  ,</p>
        <p>=   +   ,
where  ,   ,   are parameter matrix, and   ,   ,   and   are parameter vector in SRU.
 = [ 1, ...,   , ...] is the final output of the entire RNN module.</p>
        <p>CNN module: Because of the characteristics of convolution, CNN can not capture the
longdistance dependence information in the text sequence. However, due to the window sliding
mechanism in convolution operation, we can obtain obvious local features by controlling the
size of convolution kernel. In the CNN module, it extracts local information between diferent
distances by setting three convolutional layers with diferent kernel sizes. And through a lot of
tuning experiments, we set the size of the three convolution kernels to 3, 5 and 7 respectively.
Then we perform data compression through average pooling operation while reducing data
redundancy. Finally, the results of three convolution layers are added to get the output  of
CNN module</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Attention structure</title>
        <p>
          Similar to the self-attention mechanism [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], weighted attention can dynamically assign diferent
weights to long-distance dependence information and local dependence information in the
current sequence for data compression and enhancement of efective data. Firstly, we execute
the activation function ℎ on the  output by the RNN module and the  output by the CNN
module, and then add the two parts as the attention matrix  . The specific steps are as follows:
  = ℎ
  = ℎ
(   +   ),
(   +   ),
 =   +   ,
 = 
(
        </p>
        <p>+  ),
 =  ⋅  + (1 −  ) ⋅  .
where  is weight and  is bias term. Through the weight control of the activation function, the
attention matrix of the current sequence can be obtained after addition operation. Taking into
account the new noise brought by the addition operation, we execute the activation function
 on  , which can enhance the weight of the efective value:
where  is weight and  is bias term. Then we can calculate the attention matrix  and use it
to weight the output of the RNN and CNN modules. The specific method is as follows:</p>
        <p>Through the attention structure, our model can assign diferent weights to the output of the
RNN and CNN modules, and filter the useless information of these two modules at the same
time, thereby efectively reducing the impact of redundant context-dependence information on
the final result.
3.3.1. Classification layer
After encoding at the parallel layer and the attention structure, the model obtains a sequence
of semantic features that contain context-dependence. Considering the over-fitting problem
during training, only a single dense layer is used as the final classifier. Finally, through the
activation function   which computes the probability of a label for each tag, the final
prediction is obtained.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment and Result</title>
      <sec id="sec-4-1">
        <title>4.1. Corpus</title>
        <p>The CANTEMIST corpus consists of 3000 clinical cases, and the professional clinical coding
experts annotate these clinical cases in Spanish with eCIE-O-3.1 codes using the BRAT annotation
tool. These cases are distributed in plain text in UTF8 encoding, and each clinical case is stored
as a file. The corpus is randomly divided into training set, development set and test set. There
are two development set in this task, and we use them to adjust the hyperparameters of this
model. The test set have 300 clinical records with standard annotations, which can be used to
test the performance of the model. This task also introduces a background set without standard
annotations to prevent the participating teams from making manual corrections. Finally we
count the number of records and sentences to help understand the distribution of the data sets.
The results are shown in Table 1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Preprocessing and hyperParameter settings</title>
        <p>In the preprocessing of this task, we perform basic sentence and word segmentation operations
on the original text. Sentences in clinical texts are split by using specific punctuation (such
as line breaks or periods) and words are split by using spaces. Then taking into account the
character embedding in the model, the original text was not processed with stop words in
the preprocessing, and we also remove punctuation marks, special symbols, and single-word
sentences.</p>
        <p>Finally, we use Keras and Tensorflow as the model backend to implement the model, and
train and predict on NVIDIA GeForce GTX 2080Ti. The detailed hyperparameter settings of the
experiment are shown in Table 2.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Comparative experiment and experiment result</title>
        <p>To verify the stability of the model and the credibility of the result, we compare our model with
the BERT, which is the most popular model for NER. The result shown in Table 3. It shows
model</p>
        <p>P
model
GRU
SRU</p>
        <p>LSTM
GRU+LSTM
GRU+SRU</p>
        <p>P
that our model performance on this data sets is better than the BERT, and the F1-score exceeds
2%. Compared with the BERT, our model parameters are less and the training cost is lower.
The comparative experiment proves that our model performs well on the tumor named entity
recognition in Spanish.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Ablation Study</title>
      <p>The ablation study illustrates the efectiveness of the RNN and CNN modules in this model.</p>
      <p>For the RNN module, we mainly compare the performance of LSTM, GRU and SRU without
stacking, and the impact of diferent stacking methods on model performance. Firstly, we use
LSTM, GRU, and SRU as encoders to encode the embedding layer features without stacking.
From experiments, we find that GRU and LSTM have similar performance on this task, and both
F1-scores are 0.736. Consider the cost of training, GRU can efectively reduce single training
time and the required parameter is also less than LSTM. Therefore, we use the GRU as the first
encoder of the RNN module. For the second encoder of the model, the module of SRU and LSTM
have similar performance when stacking GRU, with F1 scores of 0.746 and 0.745 respectively.
Because SRU reduces a lot of parameters and training costs, so combination of SRU and GRU
will be a more suitable choice for this task. The RNN module comparative experiment results
are shown in Table 4.</p>
      <p>For the CNN module, we want to prove whether the existence of the CNN module is necessary.
Firstly, we use the RNN module directly stack with the embedding layer, and the F1-score of
the model can only achieve 0.727. Then we use the CNN module to replace the embedding
layer and find that the F1-score of the model achieve 0.746. This shows that the CNN module is
helpful to improve model performance. The experiment results are shown in Table 5.</p>
      <p>model
RNN+embedding</p>
      <p>RNN+CNN</p>
      <p>P</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this paper, we verified the efectiveness of the parallel-attention model and the outstanding
performance of RNN on tumor named entity recognition through CANTEMIST-NER task.
Compared with BERT, which has higher training costs, over-reliance on corpus, and risk of
overfitting, our model is more lightweight. This also shows that RNN is a more suitable choice
for NER tasks.</p>
      <p>In our model, we use the local dependency extraction ability of CNN to make up for the
feature extraction ability of the RNN. At the same time, the long-distance dependence extraction
ability of RNN is further strengthened by stacking the GRU and SRU. Finally, by adding a
dynamic attention mechanism, our model can filter the redundant features extracted by the
CNN and RNN. For the OOV problem, we combine character embedding and word embedding
to alleviate this problem. And the comparative experiments and ablation study are also suficient
to show the outstanding performance of the model and the rationality of each module in this
model.</p>
      <p>For future work, our team hopes to compare the model with more advanced models and
verify the efectiveness of the model on more data sets to improve generalization capabilities.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miranda-Escalada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <article-title>Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2020</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Named entity recognition from biomedical texts using a fusion attention-based bilstm-crf, IEEE Access 7 (</article-title>
          <year>2019</year>
          )
          <fpage>73627</fpage>
          -
          <lpage>73636</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaremba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <article-title>Recurrent neural network regularization</article-title>
          ,
          <source>arXiv preprint arXiv:1409.2329</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , L. Bottou,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hafner</surname>
          </string-name>
          ,
          <article-title>Gradient-based learning applied to document recognition</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>86</volume>
          (
          <year>1998</year>
          )
          <fpage>2278</fpage>
          -
          <lpage>2324</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Makino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tsujii</surname>
          </string-name>
          , et al.,
          <article-title>Tuning support vector machines for biomedical named entity recognition</article-title>
          ,
          <source>in: Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          ,
          <article-title>Abner: an open source tool for automatically tagging genes, proteins and other entity names in text</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>21</volume>
          (
          <year>2005</year>
          )
          <fpage>3191</fpage>
          -
          <lpage>3192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Sahu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <article-title>Recurrent neural network models for disease name recognition using domain invariant features</article-title>
          ,
          <source>arXiv preprint arXiv:1606.09371</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>An attention-based bilstm-crf approach to document-level chemical named entity recognition</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>34</volume>
          (
          <year>2018</year>
          )
          <fpage>1381</fpage>
          -
          <lpage>1388</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2020</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Soares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez-Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Armengol-Estapé</surname>
          </string-name>
          ,
          <article-title>Medical word embeddings for spanish: Development and evaluation</article-title>
          ,
          <source>in: Proceedings of the 2nd Clinical Natural Language Processing Workshop</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>124</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Bag of tricks for eficient text classification</article-title>
          ,
          <source>arXiv preprint arXiv:1607.01759</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural computation 9</source>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Van</given-names>
            <surname>Merriënboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>
          ,
          <source>arXiv preprint arXiv:1406.1078</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Artzi,
          <article-title>Training rnns as fast as cnns (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , L. u. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          . URL: http://papers.nips.cc/paper/ 7181-attention
          <article-title>-is-all-you-need</article-title>
          .pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>