End-to-End Neural Coder for Tumor Named Entity
Recognition
Mohammed Jabreela
a
    Department of Computer Science, Hodeidah University, Hodeidah 1821, Yemen


                                         Abstract
                                         This paper describes E2ENC, the system that we have developed to participate in CANTEMIST (CANcer
                                         TExt Mining Shared Task – tumor named entity recognition). E2ENC is a data-driven and end-to-end
                                         neural network-based system. It does not rely on external resources such as part-of-speech tagger. It
                                         proposes to solve two problems jointly; the first problem is to automatically extract the tumor morphol-
                                         ogy mentions that can be found in medical documents written in Spanish. The second task is to find the
                                         corresponding eCIE-O-3.1 codes for each extracted entity. E2ENC shows promising results, comparing
                                         to the baseline system. The reported results show that the proposed system achieve a micro-F1 of 84.9%
                                         and 77.7% on the test set for the first and second sub-tasks, respectively, and a MAP of 73.7% on the
                                         third sub-task.

                                         Keywords
                                         Information Extraction, Deep Learning, Text Mining, Medical Documents


1. Introduction
Recently, cancer has become one of the first causes of death in the Globe, surpassing cardio-
vascular diseases. To perform research and improve standards of healthcare and to evaluate
cancer treatment outcomes easy—and ideally, in an automated way —access to a variety of data
sources is required. The knowledge embedded in unstructured textual documents (e.g., pathol-
ogy reports, clinical notes), is crucial to achieve all these goals. Hence, there is an imperative
need to take advantage of natural language processing (NLP) and text mining technologies
to develop Information Extraction (IE) systems that can automatically process unstructured
medical resources, i.e., pathology reports and clinical notes, and extract critical information
that leads to better clinical decision-making.
   There are multiple approaches to building such IE systems. In general, such systems leverage
NLP tools, such as tokenizers, part-of-speech taggers and parsers to pre-process the documents.
After that, modules, that may be rule-based or machine learning-based, are designed, to solve
IE related tasks.
   Starting by a seed collection of entities, the idea of rule-based systems is to manually engineer
some rules based on regular expressions, syntactic, or dependency structures to expand the
collection iteratively [1, 2, 3].
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: mhjabreel@gmail.com (M. Jabreel)
url: https://mhjabreel.com (M. Jabreel)
orcid: 0000-0001-9774-9066 (M. Jabreel)
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   The feature-engineering-based systems aim to train a sequence tagger with rich, hand-crafted
features based on linguistic or syntactic information from annotated corpus to predict a label
(e.g., 𝑂, 𝐵− < 𝑒𝑛𝑡𝑖𝑡𝑦 > or 𝐼 − < 𝑒𝑛𝑡𝑖𝑡𝑦 >) on each token in a sentence [4].
   Rule-based and feature-engineering-based approaches are labor-intensive for constructing
rules or features using linguistic and syntactic information. Despite some promising results,
there are two main issues with these approaches. First, the engineering of rules and features
is a time-consuming task. Moreover, rules always need to be updated. Second, the systems of
these two categories are dependent on some external requirements like a parser analyzing the
syntactic and dependency structure of sentences. Therefore, the performances of these systems
rely on the quality of the parsing results [5, 2]. To avoid these issues, deep-learning is used
to develop systems learn high-level representations for each token, on which a classifier or
sequence tagger can be trained [6, 7, 8].
   CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition) [9]
is the first shared task specifically focusing on named entity recognition of a critical type of
concept related to cancer, namely tumor morphology, in the Spanish language. There are
three structured sub-tasks: "NER", "NORM" and "CODING". The first sub-task aims to identify
automatically tumor morphology mentions. All tumor morphology mentions are defined by
their corresponding character offsets in UTF-8 plain text medical documents. The second sub-
task aims at returning all tumor morphology entity mentions together with their corresponding
eCIE-O-3.1 codes i.e. finding and normalizing tumor morphology mentions. The goal of the
third sub-task is to returning for each document a ranked list of its corresponding ICD-O-3
codes (Spanish version: eCIE-O-3.1).
   We participated in the CANTEMIST challenge by developing E2ENC, an end-to-end neural
coder system for the three sub-tasks. The proposed system provides an end-to-end solution and
does not require any parsers or other linguistic resources. Specifically, the proposed system
is a multilayer neural network, where the first seven layers aim to learn high representation
for a sequence of tokens, then we pass, jointly, the output of these layers to four Conditional
Random Field (CRF) models that are learned jointly. One is for extracting the tumor morphology
mentions, while the others is for identifying their ICD-O-3 codes.
   The rest of the paper structured as follows: Section 2 presents the Methodology; Section 3
explains the experimental settings and discusses the results; finally, Section 4 concludes this
paper.


2. System Description
In this section, we introduce our system, i.e., E2ENC, and its implementation steps in detail. We
first describe the problem and the corresponding sub-tasks. Given a clinical document written
in Spanish, our goal is to build a system that can i) automatically extract tumor morphology
mentions. The output of this step is the start and end positions of the text spans that represent
the character corresponding offsets of the tumor morphology mentions. ii) Normalising the
extracted entities by returning all tumor morphology entity mentions together with their
corresponding eCIE-O-3.1 codes. iii) Finally, given the list of eCIE-0-3.1 codes extracted from
the document, find a ranked list of ICD-O-3 codes (Spanish version: eCIE-O-3.1).


                                              359
       Docu m en t                     Sen t en ce                               For Each Sen t en ce
                                                                  Sen t en ces                           Tok en i zer
                                        Spl i t t er


                                                                                                           Tok en s


      Or der ed                         I CD-O-3         Nor m al i zer                                 NER M odu l e
                       Coder                                                     En t i t i es
     Un i qe Codes                        Codes


Figure 1: The overall architecture of the E2ENC system.


   We propose a deep learning-based system to solve the tasks defined above. Figure 1 depicts
the overall architecture of the system. First, we use a sentence splitter to find all the sentences
in the document. After that, we tokenize each sentence and find the list of tokens. As soon as
we get the tokens, we pass them into the Name-Entity-Recognizer (NER) model to extract the
tumor morphology entity mentions. Then, we use a Normalizer (NORM) model to find the code
of each extracted tumor morphology. Finally, we take the set of the codes, i.e., removing the
duplicates, and use that set as an output for the the third task.
   Our system is based on a deep learning-based model that solves the first two tasks, i.e.,
NER and NORM, jointly. The idea is to regard the problem as a multi-label multi-tags token
classification. Given a sequence of tokens 𝑊 = {𝑤1 , 𝑤2 , ..., 𝑤𝑛 }, the model must produce for each
token 𝑤𝑖 , 1 ≤ 𝑖 ≤ 𝑛 a vector of labels 𝑖 = {𝑙1𝑗 , 𝑙2𝑘 , ..., 𝑙𝑚𝑟 }. Where 𝑛 refers to the sequence length,
i.e., the number of tokens and 𝑚 is the number of labelling tasks. The symbol 𝑙𝑡𝑗 is one of a set of
tags in BIOES tagging schema. That is 𝑙𝑡𝑗 ∈ {𝐵 − 𝑙𝑎𝑏𝑒𝑙, 𝐼 − 𝑙𝑎𝑏𝑒𝑙, 𝑂 − 𝑙𝑎𝑏𝑒𝑙, 𝐸 − 𝑙𝑎𝑏𝑒𝑙, 𝑆 − 𝑙𝑎𝑏𝑒𝑙},
where 𝑙𝑎𝑏𝑒𝑙 is one of the set of labels in the task 𝑡.
   In our case, the number of labelling tasks, i.e., 𝑚, is four. The first task is the NER task and
has only one label that is "MORFOLOGIA_NEOPLASIA". The rest three tasks are related to the
NORM task. Considering that the format of eCIE-O-3.1 codes which is \d{4,}/\d{1,}[/H] and by
splitting the code by the "/" symbol, we decompose the NORM task into three tagging tasks.
Each is corresponding to produce a part of the the code.
   We designed a deep learning-based model, shown in Figure 2 that is composed by eight
consecutive layers to jointly solve the four tagging tasks. The following subsections explain in
details the model and the training procedure that we followed to train the model.
   First, we start by explaining the Embedding layer in Subsection 2.1. After that, in Subsection
2.2, we describe the structure of the encoder layer. Subsection 2.3 explains the output layer.
Finally, in Subsection 2.4, we define the objective function and detail the training routine.


                                                       360
                                   Nor m #3         Nor m #3     ....         Nor m #3


                                   Nor m #2         Nor m #2      ....        Nor m #2

              Ou t pu t
               Layer
                                   Nor m #1         Nor m #1      ....        Nor m #1


                                    Ner              Ner        ....           Ner


                                    LSTM            LSTM        ....          LSTM


                                     ....             ....       ....           ....


                                      +                +                        +

          En coder                  LSTM             LSTM        ....          LSTM
           With 6
           Layer s
                                    LSTM             LSTM        ....         LSTM


                                    LSTM            LSTM         ....         LSTM


                                    LSTM            LSTM         ....         LSTM


                                                 Em beddi n g Layer


                                    W1               W2                ....    Wn


Figure 2: The architecture of E2ENC system.


2.1. Embedding Layer
The goal of the embedding layer is to represent each word 𝑤𝑖 ∈ 𝑊 by a low-dimensional
vector space 𝑣𝑖 ∈ ℝ𝑑 . Here, 𝑑, which is 400d in our case, is the size of the embedding layer. As
shown in Figure 3, we use two levels of embedding: word-level and character-level. For the
word-level embedding, we replace 𝑤𝑖 with its pre-trained word embedding vector 𝑣𝑖𝑤 [10]. We
use a single-layer 1-Dimensional Convolutional Neural Networks (Conv1D) with max-over-time


                                              361
                                            Vi { 400d}


                                         300d             100d


                                                                      M ax -Pool i n g
                                                                       Over Ti m e


                                                                                                 Con v
                 Pr et r ai n ed Wor d
                 Em beddi n g M odel
                  (Look u p -Tabl e)


                                                                  Em beddi n g Layer


                                                          C1     C2    C3     C4      C5   C6   C7
                                                   Ch ar act er Em beddi n g M odel


                                                  Wi


Figure 3: The embedding layer. The solid Gray part is none-trainable.


pooling to represent the word at character-level as the following. Suppose that 𝑤𝑖 is made up of
a sequence of characters [𝑐1 , 𝑐2 , ..., 𝑐𝑁 ], where 𝑁 is the length of 𝑤𝑖 . First, we pass the sequence of
characters of the word 𝑤𝑖 to a randomly initialized character embedding layer to get the matrix
𝐶𝑖 ∈ ℝ𝑟×𝑑 - that is the character-level representation of 𝑤𝑖 . Here, the 𝑗−th column corresponds
           𝑐


to the character embedding for 𝑐𝑗 . After that, we apply a narrow convolution between 𝐶𝑖 and
a filter (or kernel) 𝐾 ∈ ℝ𝑟×𝑘 of width 𝑘, after which we add a bias and apply a nonlinearity to
obtain a feature map 𝑓 𝑖 ∈ ℝ𝑛−𝑘+1 . Specifically, the 𝑚-th element of 𝑓 𝑖 is given by:

                             𝑓 𝑖 [𝑚] = 𝑡𝑎𝑛ℎ(⟨𝐶𝑖 [∗, 𝑚 ∶ 𝑚 + 𝑘 − 1], 𝐾 ⟩ + 𝑏)                             (1)


                                                    362
  where 𝐶𝑖 [∗, 𝑚 ∶ 𝑚 + 𝑘 − 1] is the 𝑚-to-(𝑚 + 𝑤 − 1)-th column of 𝐶𝑖 and ⟨𝐴, 𝐵⟩ is the frobenius
inner product. Finally, we take the max-over-time

                                         𝑣𝑖𝑐 = 𝑚𝑎𝑥𝑚 𝑓 𝑖 [𝑚]                                       (2)

as the feature corresponding to the filter 𝐾 (when applied to word 𝑤𝑖 ). A filter is basically
picking out a character 𝑛-gram, where the size of the 𝑛-gram corresponds to the filter width.
  The final representation of the word 𝑤𝑖 is given by concatenating the word-level vector and
the character-level vector.
                                        𝑣𝑖 = [𝑣𝑖𝑤 ; 𝑣𝑖𝑐 ]                                  (3)
   The embedding layer is the first layer in xx model. It takes as an input a word 𝑤𝑖 and produces
a vector 𝑣𝑖 that is regarded as the representation of that word.

2.2. Encoder Layer
The main objective of the encoder is to transform the sequence of vectors 𝑣1 , 𝑣2 , ..., 𝑣𝑛 produced
by the embedding layer into a sequence of recurrent states ℎ1 , ℎ2 , ..., ℎ𝑛 in a high level of
representation. The encoder of our system is inspired by the encoder in the translation system
GNMT system from google [11].
  It is composed by six Long Short-Term Memory (LSTM) [12] recurrent neural layers. The first
encoder layer is bi-directional in which the blue nodes process the information from left to right
while the red nodes gather information from right to left. The other layers of the encoder are
uni-directional. Residual connections start from the layer third from the bottom in the encoder.

2.3. Output Layer
The output layer receives as an input the sequence of states ℎ1 , ℎ2 , ..., ℎ𝑛 obtained from the
encoder and send pass them in parallel to four fully-connected layers each is responsible to
solve a single tagging problem as mentioned above. It has been shown that in such tasks, i.e.,
sequence labelling, it is beneficial to consider the correlation between labels in neighbourhoods,
specifically when there are strong dependencies across the output labels. For example, the
tag B is more likely to be followed by the tag I. Thus, instead of modeling tagging decisions
independently, we model them jointly using a Conditional Random Field (CRF) [13].
  Formally, let 𝐻 = {ℎ1 , ℎ2 , ..., ℎ𝑛 } be the sequence of vectors to be labeled, which is produced
by the encoder layer, and 𝑌𝑡 = {𝑦1 , 𝑦2 , ..., 𝑦𝑛 } is the corresponding tag sequence for the task 𝑡 .
Each element 𝑦𝑖 of 𝑌𝑡 is one of the 𝐵 − 𝑙𝑎𝑏𝑒𝑙, 𝐼 − 𝑙𝑎𝑏𝑒𝑙, 𝑂 − 𝑙𝑎𝑏𝑒𝑙, 𝐸 − 𝑙𝑎𝑏𝑒𝑙 or 𝑆 − 𝑙𝑎𝑏𝑒𝑙 tags.
Both 𝐻 and L𝑡 are assumed to be random variables and they are jointly modeled. The entire
model can be represented as an undirected graph 𝐺 = (𝑉 , 𝐸) with cliques 𝐶.
  In this work we employed a linear-chain CRF, where 𝐺 is a simple chain or line: 𝐺 = (𝑉 =
{1, 2, ..., 𝑛}, 𝐸 = {(𝑖, 𝑖 + 1)}). It has two different cliques (i.e. 𝐶 = {𝑃, 𝑀}): a unary clique (𝑃)
representing the input-output connection, and a pairwise clique (𝑀) representing the adjacent
output connection. We consider 𝑃 to be the matrix of output scores, where 𝑃𝑖,𝑗 corresponds to
the score of the 𝑗 𝑡ℎ tag of the 𝑖 𝑡ℎ word in a sentence and it is computed as follows:


                                                363
                                 𝑃𝑖,𝑗 = 𝑊𝑖,𝑗 ℎ𝑖 + 𝑏𝑗 ; 𝑗 = 1, 2, 3                             (4)

In this equation, the parameters are 𝑊𝑖,𝑗 ∈ ℝ2∗𝑑ℎ and 𝑏𝑗 ∈ ℝ1 , where 𝑑ℎ is the dimensionality
size of the hidden state.
   The clique 𝑀 is considered to be the matrix of transition scores such that 𝑀𝑖,𝑗 represents the
score of a transition from the tag 𝑖 to the tag 𝑗. Given that, we define the score function of the
sequence of predictions as follows:
                                            𝑛               𝑛
                               𝑠(𝐻 , 𝑌𝑡 ) = ∑ 𝑃𝑖,𝑦𝑖 + ∑ 𝑀𝑦𝑖 ,𝑦𝑖+1                              (5)
                                            𝑖=1            𝑖=0

   In this expression 𝑦0 and 𝑦𝑛+1 denote the start and the end tags of the sentence, that we add
to the set of possible tags.
   A softmax over all possible tag sequences (𝑌𝑡∗ ) on a sequence 𝐻 yields a probability for the
sequence 𝑌𝑡 as follows:


                                                     𝑒 𝑠(𝐻 ,𝑌𝑡 )
                                   𝑝(𝑌𝑡 |𝐻 ) =                                                 (6)
                                                  ∑𝑦̃ ∈𝑌𝑡 𝑒 𝑠(𝐻 ,𝑦̃ )

  During training, we minimize the negative log-probability of the correct tag sequence:


                    𝐽𝑡 = −𝑙𝑜𝑔 (𝑝(𝑌𝑡 |𝐻 )) = −𝑠(𝐻 , 𝑌𝑡 ) + 𝑙𝑜𝑔             ∑ 𝑒 𝑠(𝐻 ,𝑦̃ )        (7)
                                                                        (𝑦̃ ∈𝑌𝑡∗        )

  During inference, we search for the output sequence 𝑌
                                                      ̂𝑡 that obtains the highest probability
given by:


                                    ̂𝑡 = arg max 𝑝(𝑌̃𝑡 |𝐻 )
                                    𝑌                                                          (8)
                                                𝑌̃𝑡 ∈𝑌𝑡∗

  In this model, Eq. 7 and Eq. 8 can be solved efficiently using dynamic programming.

2.4. Training Procedure
Lets 𝑡 referes to one of the tagging tasks 𝑇 and 𝐽𝑡 the objective function of that task computed
by Eq. 7, we define the joint objective function as the following:

                                                 1
                                         𝐽 =         ∑ 𝐽𝑡                                      (9)
                                                |𝑇 | 𝑡∈𝑇

  In this expression |𝑇 | means the number of tasks.


                                                     364
Table 1
The number of each document in the datasets.
                                 #    Split    Num. of Documents
                                 1   Train             501
                                 2   Dev1              250
                                 3   Dev2              250
                                 4   Test              300


   The derivative of the objective function 𝐽 , Eq. 9, is taken through back-propagation with
respect to the whole set of parameters of the model, which are the transition matrix 𝑀, the
parameters of the BiGRU model and the parameters of the matrix 𝑃 defined in Eq. 4. The
parameters are optimized using Adam optimizer with a learning rate of 0.0001. To reduce the
effects of gradient exploding, we set the clipping threshold of the gradient to 5. We apply a
dropout [14] between the embedding layer and the recurrent layer with probability of 0.5 to
prevent over-fitting.


3. Experiments and Results
In this section, we discuss the dataset used and different experimental settings devised to
evaluate our system.

3.1. Datasets
We trained and fine-tuned our system respectively on the training and the development sets
provided by the organizers of the CANTEMIST challenge. The statistical description of the
datasets is shown in Table 1. After that, we submitted the predicted labels of the test set that
are produced by our system to evaluate its performance. The organizers omitted the golden
labels of the test for the tree tasks. We used the train split to train the model, the Dev1 split to
fine-tune the system and find the best parameters and Dev2 for testing the system performance
locally.

3.2. Evaluation Metrics
The official evaluation metrics used to validate the performance of the system are: Precision,
Recall and F1-score for Cantemist-NORM and Cantemist-NER. For the Catemist-CODING Mean
Average Precision (mAP) was used.

3.3. Results
Table 2 shows the results of our submitted system on the Test and the Dev2 sets. It also shows
the results of the baseline system on the Test set, which is simply a dictionary lookup based on
Levenshtein distance. It looks for train and development annotations in the test set. From the
reported results, we can note that in general, our system outperforms the baseline system and


                                                 365
Table 2
The Performances of our system on the test set compared to the performance on the Dev2 set, and the
baseline system.
                            Sub-Task 1 (NER)     Sub-Task 2 (NORM)    Sub-Task 3 (CODING)
      System
                             P     R     F1       P     R      F1             mAP
      Dev2 (Our System)     84.7   85.2   84.9   77.7   78.2   77.9            75.0
      Test (Our System)     83.7    84    83.9   77.5   77.9   77.7            73.7
      Test (Baseline)       18.1   73.3   29.1   18.0   73.0   28.8            58.4


gives comparable performance on both the Dev2 and the Test sets. The maximum difference in
the performance of the system on the Dev2 test and the Test set is recorded for the Recall metric
(with 1% in case of the NER task and 0.3% in case of the NORM task). In terms of the CODING
task, the difference is 1.3%. Another remarkable observation is that our system gives a similar
performance in all the evaluation metrics, which shows its consistency. It is also clearly shown
that the performance of the system in the NORM task affects the outcome in the CODING
task (as we only take the distinct codes as the final ranking result for the CODING task). To
overcome this issue, we plan to develop a ranking model that takes as an input the extracted
entities and the set of coding as a candidate set of codes and produce a ranked set of codes. We
leave this extension for the future work.


4. Conclusion
In this paper, we describe in detail the implementation of End2End Neural Coder (E2ENC)
system, including all the techniques that are crucial to its accuracy, and robustness. The system
is used to solve the tasks defined in CANTEMIST (CANcer TExt Mining Shared Task – tumor
named entity recognition) shared tasks. E2ENC contains four sub-models that were trained
jointly. The first one aims to automatically extract tumor morphology mentions in health reports
written in Spanish. The three sub-models are used to perform the normalization task and return
all tumor morphology entity mentions with their corresponding eCIE-O-3.1 codes. E2ENC
provides an end-to-end solution and does not require any external tools or other linguistic
resources. The effectiveness of the proposed system has been evaluated by participating in the
CANTEMIST shared task. The reported results show that the proposed system is stable and
consistent. In our future work, we plan to perform extensive error analysis and inspect the
performance of the system and improve it. For example, we plan to use a transformer-based
interpretable model like BERT [15] as a pre-trained embedding model instead of using the
Embedding layer.


References
 [1] L. Sweeney, Replacing personally-identifying information in medical records, the scrub sys-
     tem., in: Proceedings of the AMIA annual fall symposium, American Medical Informatics
     Association, 1996, p. 333.


                                                 366
 [2] I. Neamatullah, M. M. Douglass, H. L. Li-wei, A. Reisner, M. Villarroel, W. J. Long,
     P. Szolovits, G. B. Moody, R. G. Mark, G. D. Clifford, Automated de-identification of
     free-text medical records, BMC medical informatics and decision making 8 (2008) 32.
 [3] A. Coden, G. Savova, I. Sominsky, M. Tanenblatt, J. Masanz, K. Schuler, J. Cooper, W. Guan,
     P. C. De Groen, Automatically extracting cancer disease characteristics from pathology
     reports into a disease knowledge representation model, Journal of biomedical informatics
     42 (2009) 937–949.
 [4] Z. Liu, Y. Chen, B. Tang, X. Wang, Q. Chen, H. Li, J. Wang, Q. Deng, S. Zhu, Auto-
     matic de-identification of electronic medical records using token-level and character-level
     conditional random fields, Journal of biomedical informatics 58 (2015) S47–S52.
 [5] Ö. Uzuner, Y. Luo, P. Szolovits, Evaluating the state-of-the-art in automatic de-identification,
     Journal of the American Medical Informatics Association 14 (2007) 550–563.
 [6] Z. Liu, B. Tang, X. Wang, Q. Chen, De-identification of clinical notes via recurrent neural
     network and conditional random field, Journal of biomedical informatics 75 (2017) S34–S42.
 [7] M. Jabreel, F. Hassan, A. Moreno, Target-dependent sentiment analysis of tweets using
     bidirectional gated recurrent neural networks, in: Advances in hybridization of intelligent
     methods, Springer, 2018, pp. 39–55.
 [8] M. Jabreel, F. Hassan, D. Sánchez, J. Domingo-Ferrer, A. Moreno, E2ej: Anonymization of
     spanish medical records using end-to-end joint neural networks., 2019.
 [9] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normal-
     ization and clinical coding: Overview of the cantemist track for cancer text mining in
     spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.
[10] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in:
     Proceedings of the 2014 conference on empirical methods in natural language processing
     (EMNLP), 2014, pp. 1532–1543.
[11] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,
     K. Macherey, et al., Google’s neural machine translation system: Bridging the gap between
     human and machine translation, arXiv preprint arXiv:1609.08144 (2016).
[12] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
     1735–1780.
[13] J. Lafferty, A. McCallum, F. C. Pereira, Conditional random fields: Probabilistic models for
     segmenting and labeling sequence data (2001).
[14] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improv-
     ing neural networks by preventing co-adaptation of feature detectors, arXiv preprint
     arXiv:1207.0580 (2012).
[15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).


                                                367