<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-task Learning Applied to Biomedical Named Entity Recognition Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tahir Mehmood</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alfonso Gerevini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Lavelli</string-name>
          <email>lavellig@fbk.eu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Serina</string-name>
          <email>ivan.serinag@unibs.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Brescia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive, 18 - 38123 Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent deep learning techniques have shown significant improvements in biomedical named entity recognition task. However, such techniques are still facing challenges; one of them is related to the limited availability of annotated text data. In this perspective, with a multi-task approach, simultaneously training different related tasks enables multi-task models to learn common features among different tasks where they share some layers with each other. It is desirable to used stacked long-short term memories (LSTMs) in such models to deal with a large amount of training data and to learn the underlying hidden structure in the data. However, the stacked LSTMs approach also leads to the vanishing gradient problem. To alleviate this limitation, we propose a multi-task model based on convolution neural networks, stacked LSTMs, and conditional random fields and use embedding information at different layers. The model proposed shows results comparable to state-of-the-art approaches. Moreover, we performed an empirical analysis of the proposed model with different variations to see their impact on our model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Named entity recognition (NER) consists in
recognizing chunks of text and labelling them with
predefined categories (e.g., person name,
organization, location, etc). NER is an information
extraction task and has many applications for
instance in co-reference resolution, question
an</p>
      <p>
        Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)
swering systems, machine translation,
information retrieval etc
        <xref ref-type="bibr" rid="ref2">(Chieu and Ng, 2002)</xref>
        . NER is
also performed on biomedical data where it
involves recognizing biomedical concepts (e.g., cell,
chemical, drug, disease, etc) and classifying them
into predetermined categories. This is referred as
biomedical named entity recognition (BioNER).
Large amounts of medical data are available as
free, unstructured text and the quantity of
annually generated biomedical data like books,
scientific papers, and other publications makes it
challenging for physicians to stay up to date.
      </p>
      <p>
        Moreover, biomedical documents are more
complex than normal texts and the names of
the entities show peculiar characteristics. Long
multi-word expressions
(10-ethyl-5-methyl-5,10dideazaaminopterin), ambiguous words (TNF
alpha can be used for both DNA and Protein)
        <xref ref-type="bibr" rid="ref9">(Gridach, 2017)</xref>
        , spelling alternations (e.g.,
10Ethyl-5-methyl-5,10-dideazaaminopterin vs.
10EMDDA) make the BioNER task even more
challenging
        <xref ref-type="bibr" rid="ref12 ref18 ref19 ref8">(Giorgi and Bader, 2018)</xref>
        . BioNER is also
an important preliminary task for other tasks like
the extraction of relations between entities (e.g.,
chemical induced disease relation, drug-drug
interaction, . . . ).
      </p>
      <p>Recent applications of deep learning in BioNER
minimize manual feature engineering process and
at the same time produce promising results. Deep
learning is now the state-of-the-art technique but,
due to the complex structure of biomedical text
data, deep learning models have difficulties in
performing efficiently. Moreover, these systems
require large amounts of input data while the
available annotated biomedical data are not enough to
train these systems effectively. Manually
generating annotated biomedical text data is an expensive
and time-consuming job. In order to address this
limitation, one solution is to take advantage of a
multi-task learning approach. Multi-task learning
(MTL) involves training simultaneously different
but related tasks together. Such an approach has
shown significant improvements in different fields.</p>
      <p>
        In this paper, we propose a multi-task model
(MTM-CW) using convolutional neural networks
(CNN)
        <xref ref-type="bibr" rid="ref10 ref11 ref5 ref6">(dos Santos and Guimara˜es, 2015)</xref>
        , stacked
layers of Bidirectional long-short term memories
(BiLSTM), and conditional random fields (CRFs).
Furthermore, we have conducted an empirical
analysis of the impact of different word input
representation to our model.
      </p>
      <p>The rest of the paper is organized as follows;
Section 2 gives a brief background of the
multitask learning followed by Section 3 where our
multi-task model (MTM-CW) is discussed.
Experimental setup is presented in Section 4 which
is followed by the results and discussion (Section
5). Section 6 concludes and presents possible
future research directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Multi-task Learning</title>
      <p>In general, deep learning model performance
highly depends on the amount of annotated data
available. It performs better when large amount
of data is available. Unfortunately, in different
biomedical tasks only a limited quantity of
annotated text data is available and in this case
deep learning models have difficulties to
generalize well. Moreover, manually annotating new data
is a time consuming job and this issue can be
reduced by using two methods: transfer learning and
multi-task learning.</p>
      <p>
        In transfer learning, the model is partially
trained on an auxiliary task and is then reused on
the main task. This enables the model to fine tune
the weights of the layers which are learned during
the training on the auxiliary task. This helps the
model to generalize well on the main task, which
implies learning generalized features between the
auxiliary and the main task. This method learns
and transfers shallow features from one domain to
another domain
        <xref ref-type="bibr" rid="ref14">(Luong et al., 2016)</xref>
        .
      </p>
      <p>
        On the other hand, multi-task learning (MTL)
is an approach where different related tasks are
trained simultaneously. Unlike transfer
learning, multi-task learning optimizes the model
under construction concurrently. In MTL approach,
some of the layers in the model are shared among
different tasks while keeping some layers
taskspecific. Training jointly on related tasks helps the
multi-task model to learn common features among
different tasks by using shared layers
        <xref ref-type="bibr" rid="ref1">(Bansal et
al., 2016)</xref>
        . The task-specific layers, usually the
lower layers, learn features that are more related to
the current task. MTL lowers the chances of
overfitting as the model has to learn the common
representation among all tasks. MTL has been widely
adopted in many different domains
        <xref ref-type="bibr" rid="ref14">(Luong et al.,
2016)</xref>
        .
      </p>
      <p>Crichton et al. (2017) proposed a multi-task
model (MTM) based on CNN to perform BioNER.
However, they only focused on the word level
features ignoring the character level ones. Although
word level features give much information about
the entities, character level features help to extract
common sub-word structures among the same
entities. Moreover, depending solely on the word
level features can lead to out-of-vocabulary
problems when a specific word is not found in the
pre-trained word embedding. Wang et al. (2019)
also performed BioNER using different multi-task
models. They found that the MTM with the word
level features and extraction of the character level
features using BiLSTM enhances performance of
the model. They concluded that the character level
feature should be considered for the BioNER task.
A similar model is proposed by Mehmood et al.
(2019) where, apart from single shared BiLSTM,
they introduce the task-specific BiLSTM as well
to learn the features that are more specific to the
task. Introduction of task-specific BiLSTM and
use of CNN instead of BiLSTM at character level
showed performance improvement.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Our Proposal</title>
      <p>
        Neural networks work on a concept of hierarchical
feature learning
        <xref ref-type="bibr" rid="ref18">(Xiao et al., 2018)</xref>
        . Hierarchical
feature learning is done as sequences propagates
through the network
        <xref ref-type="bibr" rid="ref11">(LeCun et al., 2015)</xref>
        . Deep
learning can learn the complex hierarchical
structure of the sequence with multiple layers.
Moreover, it is always desirable to stack LSTMs when
a large amounts of training data is available
        <xref ref-type="bibr" rid="ref12">(Li
et al., 2018)</xref>
        . Such intuition can be noticed in the
model proposed by Mehmood et al. (2019) where
increasing the layer of BiLSTM leads to
performance enhancement. However, moving towards
deep LSTMs network can causes gradient
vanishing problem as well
        <xref ref-type="bibr" rid="ref12">(Li et al., 2018)</xref>
        .
      </p>
      <p>To tackle this issue we are proposing a model
which induces the input information at different
layers. Our proposed multi-task model with
character and word input representations (MTM-CW)
propagates input embedding information along
different shared layers as shown in Figure 1. This
not only helps lower layers to learn the complex
structure from encoded representation of the
previous layer but also considers inputs embeddings
as well to overcome the gradient vanishing
problem in stacked LSTMs.</p>
      <p>
        Furthermore, using stacked BiLSTMs will help
hidden states of BiLSTM to learn hidden
structure of the data presented at different level. This
will help BiLSTM to learn features at a more
abstract level. Apart from the shared stacked
BiLSTMs, our model also uses task-specific BiLSTM
as well to extract task-specific features.
Furthermore, we use CNN to extract features at character
level. Many of the previous approaches have used
CNN at character level
        <xref ref-type="bibr" rid="ref5 ref6">(dos Santos et al., 2015;
Collobert et al., 2011)</xref>
        due to its finer ability of
features extraction. CNN learns global level
features from local level features. This enables CNN
to extract more hidden features. More specifically,
lower layers in our proposed MTM-CW model are
task-specific. So for the specific task, both shared
layers and layers belonging to that specific task are
activated.
      </p>
      <p>red rse
haS yaL</p>
      <p>Word
Embedding</p>
      <p>Char
Embedding</p>
      <p>CNN</p>
      <p>
        Finally, we use CRFs for output labeling. CRFs
have the ability to tag the current token by
considering neighboring tags at sentence level
        <xref ref-type="bibr" rid="ref10">(Huang et
al., 2015)</xref>
        . Yang et al. (2018) performed
experiments comparing CRF and Softmax and found out
that CRF produces better results compared to
Softmax.
      </p>
      <p>An alternative training approach was adopted
for the training phase. Let suppose we have
D1,D2,:::, Dt training sets, related to the T1, T2,
:::, Tt tasks respectively. During training, a
training set Di is selected randomly and both shared
layers and layers specific to the corresponding task
Ti are activated. Every task has its own optimizer
so during training only the optimizer specific to
the task Ti is activated and the loss function
related to that optimizer is optimized. It means that
the parameters of the shared layers and of the
taskspecific layers are changed during the training of
the specific task. Optimizing parameters of the
shared layers for all the tasks helps the model to
find the common features among different tasks.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        We performed experiments on the 15 datasets
which were also used by Crichton et al. (2017),
Wang et al. (2019), and Mehmood et al. (2019).
The bio-entities in these datasets are Chemical,
Species, Cell, Gene/Protein, Cell Component, and
Disease1. Descriptions of the datasets can be
found in Crichton et al. (2017). Moreover, to
represent words, we use domain-specific pre-trained
word embeddings since generic word embeddings
can cause a high rate of out-of-vocabulary words.
In particular, we use WikiPubMed-PMC word
embedding which is trained on a large set of the
PubMedCentral(PMC) articles and PubMed
abstracts as well as on English Wikipedia articles
        <xref ref-type="bibr" rid="ref12 ref18 ref19 ref8">(Giorgi and Bader, 2018)</xref>
        . On the other hand,
character embedding is initialized randomly while
orthographic (case) embedding is represented by
the identity matrix where each diagonal 1
represents the presence of a word’s orthographic
feature. Moreover, we analyse the effect of different
input representations (word level, character level,
and case level) of a word on the performance of
our proposed architecture. Furthermore, this
paper reports the average F1-score where each
experiment is run for 10 times. We use the Nadam
1The datasets can be found at the following link
https://github.com/cambridgeltl/MTL-Bioinformatics-2016
optimizer in our model and use CNN with a
filter size of 30 while each LSTM in the model
consists of 275 units and the experiment is run for 50
epochs and early stop is set to 10 epochs.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>
        In Table 1 we compare the results produced by
our model with state-of-the-art models
        <xref ref-type="bibr" rid="ref15 ref17">(Wang et
al., 2019; Mehmood et al., 2019)</xref>
        . We can see a
substantial improvement in the F1-score by
MTMCW compared to these models. However, to
observe whether connecting embedding layers to the
middle layers has truly contributed to the
performance of the model, we made a variation in the
model and dropped the skip connections coming
from embedding layers (refer to Figure 1).
Dropping these skip connections makes our model
similar to the model by Mehmood et al. (2019) where
we have introduced another layer of shared
BiLSTM. The effect of such variation is reported in
Table 2 where it can be noted that few datasets
show moderate performance increase while for
most of them performance degrades. This
supports our intuition that passing embedding layer
information to the lower layers has positive
impact on the model. Moreover, it is interesting that,
even after dropping those skip connections, our
model is still able to perform better compared to
state-of-the-art models. This suggests that, with
increasing size of training examples, more layers
of LSTM should be considered
        <xref ref-type="bibr" rid="ref12">(Li et al., 2018)</xref>
        .
For this reason, the proposed model by Mehmood
et al. (2019) performed better compared to model
proposed by Wang et al. (2019) which used single
layer of LSTM.
      </p>
      <p>
        We then extended our experiments by
introducing orthographic-level representation of a word in
our model. Dugas and Nichols (2016)
SeguraBedmar et al. (2015) Huang et al. (2015) have
shown that orthographic-level information can
improve model’s performance. In addition,
statistical models (e.g. CRF at the output layer) are
also highly dependent on hand-crafted features
        <xref ref-type="bibr" rid="ref1 ref13 ref7">(Limsopatham and Collier, 2016)</xref>
        . In this work,
the orthographic-level feature includes
information on the structure of the word, i.e. either the
word is starting with a capital letter followed by
small letters or all the letters in the word are
capital or contain digits, etc. Table 2 reports
the comparison between MTM-CW and its
variant with orthographic-level features (we name it
Datasets
AnatEM
BC2GM
BC4CHEMD
BC5CDR
BioNLP09
BioNLP11EPI
BioNLP11ID
BioNLP13CG
BioNLP13GE
BioNLP13PC
CRAFT
Ex-PTM
JNLPBA
linnaeus
NCBI-disease
case, MTM-CW-Case). We observe that, for some
datasets, orthographic-level features moderately
improved the results. Thus, we can conclude that
orthographic-level features might help the model
to implicitly learn hidden features at an
orthographic level which could be helpful for some
entities. However, for simplicity we are limiting our
work to explicitly representing the word-level
features; thus we stick to the character-level
representation and the word itself. We also replaced CRF
with Softmax at the output layer to see the impact
of both methods on predicting the output label of
the entities. Table 2 also depicts the comparison
of our proposed model with softmax
(MTM-CWSoftmax) and CRF (proposed MTM-CW) at the
output layer and model with CRF produce better
results compared to the model with Softmax.
      </p>
      <p>
        To statistically evaluate the results obtained by
different variants of our model we perform the
Friedman test
        <xref ref-type="bibr" rid="ref20">(Zimmerman and Zumbo, 1993)</xref>
        .
We also analyse the pairwise comparison of
different models to see which model is statistically
better than the other. The graphical representation of
the pairwise comparison is shown in Figure 2 as it
can be seen in variant of the model proposed with
softmax (MTM-CW-Softmax represented as just
Softmax) which is statistically worse compared to
the others and to other variants of the model.
Figure 3 shows the post-hoc Conover Friedman test
where it can be seen that the difference between
results produced by all the models is significant
with different p values.
      </p>
      <p>Datasets
In this paper we showed that the BioNER
performance can be drastically improved by using
a multi-task approach. We showed that using
stacked LSTMs in such models are effective to
learn hidden structure of the data. Moreover, to
overcome the vanishing gradient problem in using
stacked LSTMs is addressed by passing
embedding information layers to layers. We showed that
our model outperforms in F1-score compared to
the state-of-the-art models.</p>
      <p>For future work, we will extend the multi-task
approach for relation extraction task. In such
approach, BioNER can be used as an auxiliary task
while keeping relation extraction task as the main
task in the multi-task approach.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Trapit</given-names>
            <surname>Bansal</surname>
          </string-name>
          , David Belanger, and
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Ask the gru: Multi-task learning for deep text recommendations</article-title>
          .
          <source>In Proceedings of the 10th ACM Conference on Recommender Systems</source>
          , pages
          <fpage>107</fpage>
          -
          <lpage>114</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>Hai Leong Chieu and Hwee Tou Ng</source>
          .
          <year>2002</year>
          .
          <article-title>Named entity recognition: a maximum entropy approach using global information</article-title>
          .
          <source>In Proceedings of the 19th international conference on Computational linguisticsVolume 1</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          2011.
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>12</volume>
          (Aug):
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Gamal</given-names>
            <surname>Crichton</surname>
          </string-name>
          , Sampo Pyysalo, Billy Chiu, and
          <string-name>
            <given-names>Anna</given-names>
            <surname>Korhonen</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A neural network multi-task learning approach to biomedical named entity recognition</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <fpage>368</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>C´ıcero Nogueira dos Santos and Victor Guimara˜es</article-title>
          .
          <year>2015</year>
          .
          <article-title>Boosting named entity recognition with neural character embeddings</article-title>
          .
          <source>In Proceedings of the Fifth Named Entity Workshop, NEWS@ACL</source>
          <year>2015</year>
          , Beijing, China, July
          <volume>31</volume>
          ,
          <year>2015</year>
          , pages
          <fpage>25</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Cıcero dos Santos</surname>
          </string-name>
          , Victor Guimaraes, RJ Nitero´i, and Rio de Janeiro.
          <year>2015</year>
          .
          <article-title>Boosting named entity recognition with neural character embeddings</article-title>
          .
          <source>In Proceedings of NEWS 2015 The Fifth Named Entities Workshop</source>
          , page 25.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Fabrice</given-names>
            <surname>Dugas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Eric</given-names>
            <surname>Nichols</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>DeepNNNER: Applying BLSTM-CNNs and extended lexicons to named entity recognition in tweets</article-title>
          .
          <source>In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</source>
          , pages
          <fpage>178</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>John M Giorgi and Gary D Bader</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Transfer learning for biomedical named entity recognition with neural networks</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>34</volume>
          (
          <issue>23</issue>
          ):
          <fpage>4087</fpage>
          -
          <lpage>4094</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Mourad</given-names>
            <surname>Gridach</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Character-level neural network for biomedical named entity recognition</article-title>
          .
          <source>Journal of biomedical informatics</source>
          ,
          <volume>70</volume>
          :
          <fpage>85</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Zhiheng</given-names>
            <surname>Huang</surname>
          </string-name>
          , Wei Xu,
          <string-name>
            <given-names>and Kai</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Bidirectional LSTM-CRF models for sequence tagging</article-title>
          .
          <source>CoRR, abs/1508</source>
          .
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Yann</surname>
            <given-names>LeCun</given-names>
          </string-name>
          , Yoshua Bengio, and
          <string-name>
            <given-names>Geoffrey E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          ,
          <volume>521</volume>
          (
          <issue>7553</issue>
          ):
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Jinyu</given-names>
            <surname>Li</surname>
          </string-name>
          , Changliang Liu, and
          <string-name>
            <given-names>Yifan</given-names>
            <surname>Gong</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Layer trajectory LSTM</article-title>
          .
          <source>In Interspeech</source>
          <year>2018</year>
          , 19th Annual Conference of the International Speech Communication Association, Hyderabad, India,
          <fpage>2</fpage>
          -
          <issue>6</issue>
          <year>September 2018</year>
          ., pages
          <fpage>1768</fpage>
          -
          <lpage>1772</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Nut</given-names>
            <surname>Limsopatham</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nigel</given-names>
            <surname>Collier</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Learning orthographic features in bi-directional LSTM for biomedical named entity recognition</article-title>
          .
          <source>In Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining</source>
          ,
          <source>BioTxtM@COLING</source>
          <year>2016</year>
          , Osaka, Japan, December
          <volume>12</volume>
          ,
          <year>2016</year>
          , pages
          <fpage>10</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Minh-Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            , Ilya Sutskever, Oriol Vinyals, and
            <given-names>Lukasz</given-names>
          </string-name>
          <string-name>
            <surname>Kaiser</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Multitask sequence to sequence learning</article-title>
          .
          <source>In 4th International Conference on Learning Representations, ICLR</source>
          <year>2016</year>
          , San Juan, Puerto Rico, May 2-
          <issue>4</issue>
          ,
          <year>2016</year>
          , Conference Track Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Tahir</given-names>
            <surname>Mehmood</surname>
          </string-name>
          , Alfonso Gerevini, Alberto Lavelli, and
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Serina</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Leveraging multi-task learning for biomedical named entity recognition</article-title>
          .
          <source>In International Conference of the Italian Association for Artificial Intelligence</source>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Isabel</given-names>
            <surname>Segura-Bedmar</surname>
          </string-name>
          ,
          <article-title>V´ıctor Sua´rez-</article-title>
          <string-name>
            <surname>Paniagua</surname>
          </string-name>
          , and Paloma Mart´ınez.
          <year>2015</year>
          .
          <article-title>Exploring word embedding for drug name recognition</article-title>
          .
          <source>In Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis</source>
          ,
          <source>Louhi@EMNLP</source>
          <year>2015</year>
          , Lisbon, Portugal,
          <year>September 17</year>
          ,
          <year>2015</year>
          , pages
          <fpage>64</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Xuan</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, Curtis Langlotz, and Jiawei Han.
          <year>2019</year>
          .
          <article-title>Cross-type biomedical named entity recognition with deep multi-task learning</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>35</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1745</fpage>
          -
          <lpage>1752</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Cao</given-names>
            <surname>Xiao</surname>
          </string-name>
          , Edward Choi, and
          <string-name>
            <given-names>Jimeng</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review</article-title>
          .
          <source>JAMIA</source>
          ,
          <volume>25</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1419</fpage>
          -
          <lpage>1428</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Jie</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Shuailong</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Yue</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Design challenges and misconceptions in neural sequence labeling</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          ,
          <string-name>
            <surname>COLING</surname>
          </string-name>
          <year>2018</year>
          ,
          <string-name>
            <given-names>Santa</given-names>
            <surname>Fe</surname>
          </string-name>
          , New Mexico, USA,
          <year>August</year>
          20-
          <issue>26</issue>
          ,
          <year>2018</year>
          , pages
          <fpage>3879</fpage>
          -
          <lpage>3889</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Donald W Zimmerman and Bruno D Zumbo</surname>
          </string-name>
          .
          <year>1993</year>
          .
          <article-title>Relative power of the Wilcoxon test, the Friedman test, and repeated-measures ANOVA on ranks</article-title>
          .
          <source>The Journal of Experimental Education</source>
          ,
          <volume>62</volume>
          (
          <issue>1</issue>
          ):
          <fpage>75</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>