<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Taking a Dive: Experiments in Deep Learning for Automatic Ontology-based Annotation of Scientific Literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Prashanti Manda</string-name>
          <email>manda@uncg.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucas Beasley</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Somya D. Mohanty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of North Carolina at Greensboro</institution>
          ,
          <addr-line>NC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Equal contributions</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>7</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Text mining approaches for automated ontology-based curation of biological and biomedical literature have largely focused on syntactic and lexical analysis along with machine learning. Recent advances in deep learning have shown increased accuracy for textual data annotation. However, the application of deep learning for ontology-based curation is a relatively new area and prior work has focused on a limited set of models. Here, we introduce a new deep learning model/architecture based on combining multiple Gated Recurrent Units (GRU) with a character+word based input. We use data from five ontologies in the CRAFT corpus as a Gold Standard to evaluate our model's performance. We also compare our model to seven models from prior work. We use four metrics Precision, Recall, F1 score, and a semantic similarity metric (Jaccard similarity) to compare our model's output to the Gold Standard. Our model resulted in a 84% Precision, 84% Recall, 83% F1, and a 84% Jaccard similarity. Results show that our GRU-based model outperforms prior models across all five ontologies. We also observed that character+word inputs result in a higher performance across models as compared to word only inputs. These findings indicate that deep learning algorithms are a promising avenue to be explored for automated ontology-based curation of data. This study also serves as a formal comparison and guideline for building and selecting deep learning models and architectures for ontology-based curation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>II. INTRODUCTION</p>
      <p>
        Ontology-based data representation has been widely adopted
in data intensive fields such as biology and biomedicine due
to the need for large scale computationally amenable data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
However, the majority of ontology-based data generation relies
on manual literature curation - a slow and tedious process
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Natural language and text mining methods have been
developed as the solution for scalable ontology-based data
curation [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>
        One of the most important tasks for annotating scientific
literature with ontology concepts is Named Entity Recognition
(NER). In the context of ontology-based annotation, NER can
be described as recognizing ontology concepts from text [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Outside the scope of ontology-based annotation, NER has been
applied to biomedical and biological literature for recognizing
genes, proteins, diseases, etc [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The large majority of ontology driven NER techniques rely
on lexical and syntactic analysis of text in addition to machine
learning for recognizing and tagging ontology concepts [
        <xref ref-type="bibr" rid="ref3 ref4 ref6">3, 4,
6</xref>
        ]. In recent years, deep learning has been introduced for NER
of biological entities from literature [
        <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8 ref9">7, 8, 9, 10, 11</xref>
        ]. However,
the majority of prior work has focused on a limited set of
models, particularly, the Long Short-Term Memory (LSTM)
model (e.g. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
      </p>
      <p>Here, we present a new deep learning architecture that
utilizes Gated Recurrent Units (GRU) while taking advantage
of word and character encodings from the annotation training
data to recognize ontology concepts from text. We evaluate
our model in comparison to 7 deep learning models used in
prior work to show that our model outperforms the state of art
at the task of ontology-based NER.</p>
      <p>
        We use the Colorado Richly Annotated Full-Text (CRAFT)
corpus [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] as a Gold Standard reference to develop and
evaluate the deep learning models. The CRAFT corpus contains
67 open access, full length biomedical articles annotated with
concepts from several ontologies (such as Gene Ontology,
Protein Ontology, Sequence Ontology, etc.). We use four
metrics - 1) Precision, 2) Recall, 3) F-1 Score and 4) Jaccard
semantic similarity to compare each model’s performance to
the Gold Standard.
      </p>
      <p>
        Precision and Recall are traditionally used to assess the
performance of information retrieval systems. However, these
metrics do not take into account the notion of partial
information retrieval which is important for ontology-based annotation
retrieval. Sometimes, an NLP system might not retrieve the
same ontology concept as the gold standard but a related
concept (sub-class or super-class). To assess the performance
of the NLP system accurately, we need semantic
similarity metrics that can measure different degrees of semantic
relatedness between ontology concepts [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Here, we use
Jaccard similarity to compare annotations from each deep
learning model to the gold standard. Jaccard similarity assesses
similarity between two ontology terms based on the ontological
distance between them - the closer two terms are, the more
similar they are considered to be [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>III.</p>
      <p>RELATED WORK</p>
      <p>
        The application of deep learning for ontology-based Named
Entity Recognition is a nascent area with relatively little prior
work. Habibi et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] studied entity recognition on biomedical
literature using long short-term memory network-conditional
random field (LSTM-CRF) and showed that the method
outperformed other NER tools that do not use deep learning or
use deep learning methods without word embeddings. Lyu et
al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] also explored LSTM based models enhanced with
word and character embeddings. They do not evaluate other
deep learning models but present results only based on LSTM
with word embeddings. Wang et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] also propose a
LSTM based method for recognizing biomedical entities from
literature. Similar to the above studies, Wang et al. show that
a bidirectional LSTM method used with Conditional Random
Field (CRF) and word embeddings outperforms other methods.
      </p>
      <p>The striking difference between these prior studies and our
work here is that the majority of prior literature focuses on
LSTM based methods along with CRF and word embeddings.
The potential of other deep learning models such as Recurrent
Neural Networks, Gated Recurrent Units, etc., at the task of
ontology-based NER remains unexplored presenting a unique
need and opportunity. Our study aims to fill this knowledge
gap. In addition, all the above studies focus on non-ontology
based NER for entities such as genes, disease names, etc. In
contrast, our study’s focus is on recognizing ontology concepts
within text.</p>
      <p>IV.</p>
      <p>METHODS</p>
    </sec>
    <sec id="sec-2">
      <title>A. Data Preprocessing</title>
      <p>Annotation files for the 67 papers in CRAFT were cleaned
to remove punctuation symbols (except for period at the end
of sentences), special symbols, and non-ASCII characters.
Annotations for GO, CHEBI, Cell, Protein, and Sequence
ontologies were converted from the cleaned files to separate
ontology-specific text files that represent the presence or
absence of ontology terms. For each ontology, every sentence
containing at least one annotation from that ontology was
represented using two lines in the ontology-specific text file.
The first of these two lines contained an array with each word
in the sentence. The second contained an ordered encoding
corresponding to words in the first line. These encodings could
be an ontology concept ID if the corresponding word was
annotated in CRAFT or an 0O0 if the corresponding word was
not annotated.</p>
      <p>For example, the sentence “Rod and cone photoreceptors
subserve vision under dim and bright light conditions
respectively” where the word “vision” was annotated to GO
ID “GO:0007601 (perception of sight)” would be represented
using the two lines below:
[’Rod’, ’and’, ’cone’, ’photoreceptors’, ’subserve’,
’vision’, ’under’, ’dim’, ’and’, ’bright’, ’light’,
’conditions’, ’respectively’]
[’O’, ’O’, ’O’, ’O’, ’O’, ’GO:0007601’, ’O’, ’O’, ’O’,
’O’, ’O’, ’O’, ’O’]</p>
      <p>Annotations to single words (unigrams) were only included
in these preprocessed files. So, if an annotation was made in
CRAFT to a phrase containing more than one word, it was
ignored in the preprocessed data.</p>
    </sec>
    <sec id="sec-3">
      <title>B. Performance evaluation metrics</title>
      <p>
        Precision, Recall, F1-score, and Jaccard similarity were used
to evaluate the performance of the models. The Jaccard
similarity (J ) of two ontology concepts (in this case, annotations)
(A, B) in an ontology is defined as the ratio of the number of
classes in the intersection of their subsumers over the number
of classes in their union of their subsumers [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>J (A; B) = jS(A) \ S(B)j</p>
      <p>jS(A) [ S(B)j
where S(A) is the set of classes that subsume A. Jaccard
similarity ranges from 0 (no similarity) to 1 (exact match).</p>
    </sec>
    <sec id="sec-4">
      <title>C. Deep learning models</title>
      <p>Below, we describe four deep learning models -
Multilayer perceptrons, Recurrent Neural Networks, Long
ShortTerm Memory, and Gated Recurrent Units. Next, we describe
three architectures - window based, word based, and
characterword based that can be used in conjunction with the above
models. Finally, we describe our new model that combines
character-word based architecture with Gated Recurrent Units
and six models used in prior work.</p>
      <p>
        1) Multi-Layer Perceptron (MLP): A Multi-Layer
Perceptron (MLP) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is a feed-forward deep-neural network model
which consists of an input, single/multiple hidden, and an
output layer, each consisting of a number of perceptrons. A single
perceptron computes the output as = '(Pin=1 wixi + b),
where w is the weight vector, x is the provided input, b is
the bias, and ' is the activation function. The weights and
biases of each perceptron in the layers are adjusted using
backpropagation to minimize prediction error
      </p>
      <p>
        2) Recurrent Neural Network: A Recurrent Neural
Network (RNN) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is an adaption of feed-forward neural
networks, where history of the input sequence is taken into
consideration for future prediction. Given an input sequence
&lt; x0; x1; x2; xi &gt;, the hidden state (ht) of an RNN is
updated as follows:
ht
yt
=
=
      </p>
      <p>0;
sof tmax(W sht)</p>
      <p>t = 0
(W hhht 1 + W hxxt); t &gt; 0
(1)
where, xt is the input provided to the hidden state ht at
time t which is updated using a sigmoid function . is
calculated over the previous time state of the network given
by ht 1 and current input xt. W hh, W hx, and W s are the
weights computed over training. The network can then produce
an output prediction &lt; y0; y1; y2; yj &gt; using a sof tmax
function on the hidden state ht.</p>
      <p>
        A bidirectional Recurrent Neural Network (BiRNN) is an
RNN where the input data is fed to the neural network two
times - once in forward and again in reverse order.
3) Long-Short Term Memory: While RNNs are effective
in learning temporal patterns, they suffer from a vanishing
gradient problem where long term dependencies are lost. A
solution to the problem was proposed by Hochreiter et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
by using a variation of RNNs called Long-Short Term Memory
(LSTM). LSTMs use a memory cell (ct), to keep track of
longterm relationships between text. Using a gated architecture
(input, output, and forget), LSTMs are able to modulate the
exposure of a memory cell by regulating the gates. LSTMs
can be defined as:
it
ft
ot
gt
ct
ht
= (W ixxt + W ihht 1)
= (W fxxt + W fhht 1)
= (W oxxt + W ohht 1)
= tanh(W gxxt + W ghht 1)
= ct 1 ft + gt it
= tanh(ct) ot
zt
rt
˜
ht
ht
=
=
= tanh(W xxt + rt
= zt ht 1 + (1
(W zxxt + W zhht 1)
(W rxxt + W rhht 1)
zt)
      </p>
      <p>Whht 1)
~
ht
where, zt and rt are update and reset gates respectively, h˜t
is the candidate activation/hidden state.</p>
      <p>Similar to the LSTM architecture, GRUs benefit from the
additive properties in their network to remember long term
dependencies, and solve the vanishing gradient problem. Since
GRUs do not utilize an an output gate, they are able to write
the entire contents of their memory cell to the network. The
lack of a memory cell also makes GRUs more efficient in
comparison to LSTMs.</p>
    </sec>
    <sec id="sec-5">
      <title>D. Deep learning Architectures</title>
      <p>Below, we describe three architectures - window-based,
word-based, and word+character based to be used in
conjunction with the different models described above.
where, it, ft, and ot are the input, forget, and output gates
respectively. Each gate uses a sigmoid ( ) function applied
over the sum of input xt and previous hidden state ht 1
(multiplied with their weight matrices W ). gt denotes the
candidate state computed over a tanh function on the input
and previous hidden state. W ix, W fx, W ox, W gx are weight
matrices used with input xt, while W ih, W fh, W oh, and W gh
are used with hidden states for each gate and candidate state.
The memory cell ct utilizes the forget gate (ft) and multiplies
( - element-wise) it old memory cell ct 1 and adds to the
state of candidate (gt) multiplied with the input gate (it). The
hidden state is given by a tanh function applied to the memory
cell ct multiplied with output gate (ot).</p>
      <p>
        4) Gated Recurrent Unit: A variation on LSTM, was
introduced by Cho et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as Gated Recurrent Unit (GRU).
Using update and reset gates, GRUs are able to control amount
of information within a unit (without a separate memory cell
as with LSTM). GRUs can formally be defined as
(2)
(3)
      </p>
      <p>1) Window-based: In this architecture, the window-based
input (iv) consists of feature vectors (fv) for each word/term (t)
within an encoded sentence. Each fv consisted of the following
attributes:
fv =&lt; t; nt; t1; t 1; tC ; taC ; tla; t0p; t0p 1; t0p 2;
ts0; ts0 1; ts0 2; tN ; tP &gt;
(4)
where,
t is the term,
nt is the number of terms in the sentence,
t1 is a boolean value indicating if the term is the first term
in the sentence,
t 1 is 1 if term is the last term in the sentence 0 otherwise,
tC is 1 if first letter in t is uppercase,
taC is 1 if all letter in t are uppercase,
tla is 1 if all in t are lower case,
t0p; t0p 1; t0p 2 record character prefixes of t at various
window size,</p>
      <p>ts0; ts0 1; ts0 2 record character suffixes of t at various
window sizes,
tN and tP are the next and previous terms respectively.
2) Word-based: Each word and its corresponding
annotation labels (tags) are encoded with integer values, derived
from unique words and annotations present in the corpus.
The dataset was based on unigram annotations that only use
ontology annotations where a single word in text maps to an
ontology concept.</p>
      <p>In word-based architectures (Figure 1), the input (XtWr ) is
provided to an Embedding layer which converts the input into
dense vectors of 100 dimensions. The output vectors are then
fed to a bidirectional model (RNN/GRU/LSTM) consisting of
150 hidden units. The output from the model goes to a dense
perceptron layer using ReLU activation which also employs a
0.6 Dropout. The output is further fed into a CRF layer which
looks for correlations between annotations in close sequences
to generate the predictions (ypr).</p>
      <p>3) Character+Word Based: A Character+Word based
architecture is similar to the word based architecture described
above. In addition to word-based inputs (XtWr ) is also takes
advantage of characters (XtCr) within words to make
predictions.</p>
    </sec>
    <sec id="sec-6">
      <title>E. Model development</title>
      <p>We developed a new deep learning architecture that uses a
Character+word based architecture coupled with two
bidirectional Gated Recurrent Units. Our architecture (Figure 2)
consists of character level input (XtCr) provided to an Embedding
layer (E1) which compresses the dimensions of characters to
the number of unique annotations in the corpus (N T ags).</p>
      <p>The output of Embedding layer E1 is fed to a bidirectional
GRU (BiGRU1) layer with 150 units followed by a 60%
output drop in a Dropout layer (D1). Simultaneously, the
word-level input (XtWr ) was provided to a second Embedding
layer (E2) with 30 dimensions. The output from E2 was
concatenated with the output from the first Dropout layer D1
and fed through a second Dropout layer (D2) with a 30% drop.
Output from D2 was fed into a second bidirectional GRU layer
(BiGRU2) consisting of 150 units.</p>
      <p>The above model was tested with and without a final CRF
layer leading to two new configurations - CW BiGRU
CRF and CW BiGRU . The models were run for 15 epochs
with a batchsize of 32 instances.</p>
    </sec>
    <sec id="sec-7">
      <title>F. Model Comparison</title>
      <p>We compared the performance of our new Character+word
based GRU architecture and the two models developed therein
(CW BiGRU CRF , CW BiGRU ) (Section IV-E) to
six state of the art models that have been used in prior work.
Below, we specify the component details of each of the six
prior models that have been evaluated.</p>
      <p>1) MLP: Multi layer perceptrons were used with a window
based architecture to create a three layered (input, hidden,
output) M LP model. The input and the hidden layer consisted
of 512 perceptrons with a Rectified Linear Unit (ReLU)
activation function while the output layer consisted of
perceptrons equal to the number of unique annotations in the
corpus (N T ags). 20% Dropout was used for the hidden and
output layers to prevent overfitting of the data. Categorical
cross-entropy was used for calculating the loss function and
NAdam (Adam RMSprop with Nesterov momentum) was used
as the optimizer function. Each of the feature vectors (from
the training data), were fed into the MLP architecture for 15
epochs with a batch size of 256.</p>
      <p>2) BiRNN-CRF: The BiRNN-CRF model uses a
wordbased input coupled with a BiRNN model and ending with a
CRF model. Similar to the BiRNN architecture (Figure 1), the
BiRNN-CRF model consists of a 100 dimension Embedding
layer followed by a BiRNN with 150 units followed by a 0.6
Dropout layer. The output of the 0.6 Dropout layer is fed to a
CRF which generated the predicted output.</p>
      <p>3) BiLSTM-CRF: The BiLSTM-CRF model is identical to
the BiRNN-CRF except that it uses a LSTM in place of the
RNN.</p>
      <p>4) BiGRU-CRF: The BiGRU-CRF model is identical to
BiRNN-CRF and BiLSTM-CRF except that it uses a Gated
Recurrent Unit in place of the RNN or LSTM.</p>
      <p>5) CW-BiLSTM: The CW-BiLSTM model is similar to the
CW-BiGRU model described above (see Section IV-E) except
that the BiGRU is replaced with a BiLSTM.</p>
      <p>6) CW-BiLSTM-CRF: The CW-BiLSTM-CRF model is
developed by adding a CRF layer at the end of the
CWBiLSTM model pipeline indicating that the output of the
CWBiLSTM model would be fed to a CRF layer to generate the
final predictions.</p>
    </sec>
    <sec id="sec-8">
      <title>G. Parameter Tuning</title>
      <p>The GO annotation data was split into training and test
sets using a 70:30 ratio. The training set was used to tune
the following parameters for all models. Multiple
architecture parameters such as - 1) Number of layers in MLP
(along with number of perceptrons), 2) Number of units in
RNN/GRU/LSTM, 3) Embedding Dimensions for Characters
and Words, and 4) Optimization functions, were evaluated for
model performance. A grid-search model was explored, where
each architecture was evaluated for different combinations of
the parameter. In each case, model performance metrics were
recorded in form of Precision, Recall, F1-score, and Jaccard
similarity.
H. Experiments to predict ontology annotations</p>
      <p>The largest number of annotations in the CRAFT corpus
came from the Gene Ontology. So, we first used the GO
annotations to train and test the suite of 8 models described
above. Subsequently, we applied the best model from these
experiments to annotate the CRAFT corpus with the other four
ontologies (Chebi, Cell, Protein, and Sequence corpora).</p>
      <p>Root-Mean-Square propagation (RMSProp) optimizer was
used to test the performance of the different models. A batch
size of 32 along with 15 epochs was used for model training.
Performance characteristics in terms of train-test loss
(calculated using the CRF function), prediction precision, recall,
F1score along with mean semantic similarity score was recorded
for each model.</p>
      <p>V.</p>
      <p>RESULTS AND DISCUSSION</p>
      <p>The CRAFT corpus contains 67 full length papers with
annotations from five ontologies (GO, CHEBI, Cell, Protein,
and Sequence). For each of these ontologies, we extracted all
sentences across the 67 papers with at least one annotation for
the ontology. The largest number of annotations came from
the GO (Table I) while the Cell ontology accounted for the
lowest number of annotations.</p>
      <p>Figure 3 shows the loss and accuracy trends for each model
on the GO annotation data. The goal of the models is to
minimize loss while increasing accuracy as the number of
epochs increase.</p>
      <p>First, we see that our CW-BiGRU model shows
improvement in both training and validation accuracy as the number
of epochs increase. Correspondingly, we observe a decrease in
training and validation loss indicating that the model is able
to self-improve with each subsequent epoch.</p>
      <p>The CW-BiGRU-CRF model initially shows the same
accuracy improvement like the CW-BiGRU model but later
increases in epochs result in a divergence in the training and
validation accuracy indicating that the model might be prone
to overfitting. While there is a substantial decrease in training
loss, a similar decrease is not observed in validation loss.</p>
      <p>CW-BiLSTM shows similar trends to CW-BiGRU.
CWBiLSTM-CRF training and validation accuracy increase
similarly until a certain point after which the validation accuracy
drops and diverges sharply from the training curve indicating
a case of overfitting.</p>
      <p>BiGRU-CRF and BiRNN-CRF models show substantial
improvement in accuracy with increasing epochs. However,
BiRNN-CRF shows divergence in the loss patterns. Similar to
CW-BiLSTM-CRF, BiLSTM-CRF also shows signs of
overfitting in the accuracy patterns. MLP is the worst performing
model with very minor improvements in validation accuracy
as the number of epochs increase indicating that the model is
unable to improve itself with each subsequent epoch.</p>
      <p>It is clear that the CW-BiGRU models are able to outperform
the other models by improving accuracy and reducing loss with
each epoch without overfitting.</p>
      <p>A large proportion of input data is not annotated to GO
terms but to a tag 0O0 indicating the absence of an annotation.
In addition to accurately predicting GO annotations, the model
also needs to accurately predict the absence of an annotation.
However, given the disproportionate amount of data pertaining
to the absence of annotations, the models were observed to
predict the absence of annotations remarkably accurately in
comparison to predicting presence.</p>
      <p>To provide a more conservative view of the models’
performance, we report Precision, Recall, F-1 Score, and Jaccard
similarity (Table II) only on data indicating presence of
ontology terms, i.e. text annotated with an ontology term. Unlike
the accuracy measurements above, the metrics below do not
take into account the models’ performance at identifying the
absence of annotations, but rather focus on ability to identify
annotations when they’re present in the Gold Standard.</p>
      <p>These results (Table II and Figure 3) show that our model
(CW-BiGRU) outperforms the other 7 models in all four
metrics. Our model outperforms the best among the other 7
models (CW-BiLSTM) by 4% (Precision), 2% (Recall), 3%
(F1 score), 1% (Jaccard similarity).</p>
      <p>Additionally, we observe that character-word based models
(CW-BiGRU, CW-BiLSTM, CW-BiLSTM-CRF,
CW-BiGRUCRF,) outperform models that use only word embeddings.</p>
      <p>Among the character-word based models, surprisingly, the
addition of an extra CRF layer (CW-BiLSTM-CRF,
CWBiGRU-CRF) either fails to improve performance (e.g
CWBiLSTM vs. CW-BiLSTM-CRF) or leads to a decline in
performance (e.g CW-BiGRU vs. CW-BiGRU-CRF) as compared
to not using a CRF end layer (CW-BiLSTM, CW-BiGRU).
The MLP model shows substantially lower performance as
compared to the other models across all four metrics. The
Accuracy and Loss plots (Figure 3) suggest that the decline
in performance when adding a CRF layer is due to potential
overfitting.</p>
      <p>We explored how predictions from our best model,
CWBiGRU, diverge from the Gold Standard. We found that the
majority of predictions (89.25%) are an exact match for the
CRAFT annotations. Surprisingly, only a small proportion of
predictions are partial matches (2.45%). 8.26% of the model’s
predictions are false negatives while 6.38% are false positives.
We hypothesize that one of the primary reasons for false
negatives might be lack of enough training instances for those
particular GO annotations.</p>
      <p>Finally, we applied the best performing model from the
above evaluation (CW-BiGRU) and tested it on data from
four other ontologies. Interestingly, the model shows better
prediction performance on the other ontologies as compared
to GO despite the substantially smaller training datasets (Table
III).</p>
      <p>The data used in this study was limited to single words
annotated to ontology concepts (unigrams). Next, we will
explore more robust models including n-grams to account for
sequences of words tagged with an annotation. Future work
will also include models that can be trained to weight the
prediction of some target classes higher than others. These
models would be able to prioritize presence prediction of
annotations as compared to the absence of an annotation.</p>
      <p>This study demonstrates the utility of deep learning
approaches for automated ontology-based curation of scientific
literature. Specifically, we show that models based on Gated
Recurrent Units are more powerful and accurate at annotation
prediction as compared to the LSTM based models in prior
work. Our findings indicate that deep learning is a promising
new direction for ontology-based text mining, and can be used
for more sophisticated annotation tasks (such as phenotype
curation) that build upon Named Entity Recognition.
Model
1.00
0.99
0.98
0.97
cy0.96
a
ru0.95
cc0.94
A0.93
0.92
0.91
0.90
0.70
0.60
0.50
ss0.40
o
L0.30
0.20
0.10
1.00
0.99
0.98
0.97
cy0.96
a
ru0.95
cc0.94
A0.93
0.92
0.91
0.90
17.40
17.30
17.20
s
so17.10
L
17.00
16.90
16.80
CW-BiGRU</p>
      <p>CW-BiGRU-CRF
CW-BiLSTM</p>
      <p>CW-BiLSTM-CRF
0 2 4 6 8 10 12 14</p>
      <p>BiGRU-CRF
0 2 4 6 8 10 12 14</p>
      <p>BIRNN-CRF
0 2 4 6 8 10 12 14</p>
      <p>BiLSTM-CRF</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ashburner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Ball</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Blake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Botstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Butler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Cherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dolinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Dwight</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Eppig</surname>
          </string-name>
          et al., “
          <article-title>Gene ontology: tool for the unification of biology,” Nature genetics</article-title>
          , vol.
          <volume>25</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>25</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dahdul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Dececchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lapp</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Mabee</surname>
          </string-name>
          , “
          <article-title>Moving the mountain: analysis of the effort required to transform comparative anatomy into computable anatomy</article-title>
          ,
          <source>” Database</source>
          , vol.
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Clement</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nigam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cherie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Musen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callendar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Storey</surname>
          </string-name>
          , “
          <article-title>Ncbo annotator: semantic annotation of biomedical data</article-title>
          ,” in International Semantic Web Conference, Poster and Demo session,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Mungall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>McMurry</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Ko¨hler</article-title>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Balhoff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Borromeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Carbon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Conlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dunn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Engelstad</surname>
          </string-name>
          et al., “
          <article-title>The monarch initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species</article-title>
          ,”
          <source>Nucleic acids research</source>
          , vol.
          <volume>45</volume>
          , no.
          <issue>D1</issue>
          , pp.
          <fpage>D712</fpage>
          -
          <lpage>D722</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Spasic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McNaught</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , “
          <article-title>Text mining and ontologies in biomedicine: making sense of raw text,” Briefings in bioinformatics</article-title>
          , vol.
          <volume>6</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>239</fpage>
          -
          <lpage>251</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dahdul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Dececchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mabee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Balhoff</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Gopalakrishnan</surname>
          </string-name>
          , “
          <article-title>Charaparser+ eq: Performance evaluation without gold standard</article-title>
          ,
          <source>” Proceedings of the Association for Information Science and Technology</source>
          , vol.
          <volume>52</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ballesteros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kawakami</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          , “
          <article-title>Neural architectures for named entity recognition</article-title>
          ,
          <source>” arXiv preprint arXiv:1603.01360</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lafferty</surname>
          </string-name>
          , “
          <article-title>Conditional random fields: Probabilistic models for segmenting and labelling sequence data</article-title>
          ,” in ICML,
          <year>2001</year>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Habibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Wiegandt</surname>
          </string-name>
          , and U. Leser, “
          <article-title>Deep learning with word embeddings improves biomedical named entity recognition</article-title>
          ,
          <source>” Bioinformatics</source>
          , vol.
          <volume>33</volume>
          , no.
          <issue>14</issue>
          , pp.
          <fpage>i37</fpage>
          -
          <lpage>i48</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ren</surname>
          </string-name>
          , and
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Ji, “Long short-term memory rnn for biomedical named entity recognition,” BMC bioinformatics</article-title>
          , vol.
          <volume>18</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>462</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zitnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Langlotz</surname>
          </string-name>
          , and J. Han, “
          <article-title>Cross-type biomedical named entity recognition with deep multi-task learning</article-title>
          ,” arXiv preprint arXiv:
          <year>1801</year>
          .09851,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eckert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shipley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sitnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Baumgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. B.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Verspoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Blake</surname>
          </string-name>
          et al., “
          <article-title>Concept annotation in the craft corpus,” BMC bioinformatics</article-title>
          , vol.
          <volume>13</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>161</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Pesquita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Faria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Falcao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lord</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Couto</surname>
          </string-name>
          , “
          <article-title>Semantic similarity in biomedical ontologies,” PLoS computational biology</article-title>
          , vol.
          <volume>5</volume>
          , no.
          <issue>7</issue>
          , p.
          <fpage>e1000443</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Rumelhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Williams</surname>
          </string-name>
          , “
          <article-title>Learning internal representations by error propagation,”</article-title>
          <source>California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep.</source>
          ,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Faucett</surname>
          </string-name>
          , “
          <article-title>Fundamentals of neural networks,”</article-title>
          <string-name>
            <surname>Architecture</surname>
          </string-name>
          , Algorithms,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>“Long short-term memory,” Neural computation</article-title>
          , vol.
          <volume>9</volume>
          , no.
          <issue>8</issue>
          , pp.
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Van Merrie</surname>
          </string-name>
          ¨nboer,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , “
          <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>
          ,
          <source>” arXiv preprint arXiv:1406.1078</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>