Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                       1


  Taking a Dive: Experiments in Deep Learning for
  Automatic Ontology-based Annotation of Scientific
                     Literature
                                 Prashanti Manda1,* , Lucas Beasley1 and Somya D. Mohanty1,*
               1
                   Department of Computer Science, University of North Carolina at Greensboro, NC, USA
                                                  *
                                                    Equal contributions


                         I. A BSTRACT                                         (NER). In the context of ontology-based annotation, NER can
   Text mining approaches for automated ontology-based cu-                    be described as recognizing ontology concepts from text [5].
ration of biological and biomedical literature have largely                   Outside the scope of ontology-based annotation, NER has been
focused on syntactic and lexical analysis along with machine                  applied to biomedical and biological literature for recognizing
learning. Recent advances in deep learning have shown in-                     genes, proteins, diseases, etc [5].
creased accuracy for textual data annotation. However, the                       The large majority of ontology driven NER techniques rely
application of deep learning for ontology-based curation is a                 on lexical and syntactic analysis of text in addition to machine
relatively new area and prior work has focused on a limited                   learning for recognizing and tagging ontology concepts [3, 4,
set of models.                                                                6]. In recent years, deep learning has been introduced for NER
   Here, we introduce a new deep learning model/architecture                  of biological entities from literature [7, 8, 9, 10, 11]. However,
based on combining multiple Gated Recurrent Units (GRU)                       the majority of prior work has focused on a limited set of
with a character+word based input. We use data from five                      models, particularly, the Long Short-Term Memory (LSTM)
ontologies in the CRAFT corpus as a Gold Standard to                          model (e.g. [7]).
evaluate our model’s performance. We also compare our model                      Here, we present a new deep learning architecture that
to seven models from prior work. We use four metrics -                        utilizes Gated Recurrent Units (GRU) while taking advantage
Precision, Recall, F1 score, and a semantic similarity metric                 of word and character encodings from the annotation training
(Jaccard similarity) to compare our model’s output to the Gold                data to recognize ontology concepts from text. We evaluate
Standard. Our model resulted in a 84% Precision, 84% Recall,                  our model in comparison to 7 deep learning models used in
83% F1, and a 84% Jaccard similarity. Results show that our                   prior work to show that our model outperforms the state of art
GRU-based model outperforms prior models across all five                      at the task of ontology-based NER.
ontologies. We also observed that character+word inputs result                   We use the Colorado Richly Annotated Full-Text (CRAFT)
in a higher performance across models as compared to word                     corpus [12] as a Gold Standard reference to develop and eval-
only inputs.                                                                  uate the deep learning models. The CRAFT corpus contains
   These findings indicate that deep learning algorithms are a                67 open access, full length biomedical articles annotated with
promising avenue to be explored for automated ontology-based                  concepts from several ontologies (such as Gene Ontology,
curation of data. This study also serves as a formal comparison               Protein Ontology, Sequence Ontology, etc.). We use four
and guideline for building and selecting deep learning models                 metrics - 1) Precision, 2) Recall, 3) F-1 Score and 4) Jaccard
and architectures for ontology-based curation.                                semantic similarity to compare each model’s performance to
                                                                              the Gold Standard.
                      II. I NTRODUCTION                                          Precision and Recall are traditionally used to assess the
                                                                              performance of information retrieval systems. However, these
   Ontology-based data representation has been widely adopted                 metrics do not take into account the notion of partial informa-
in data intensive fields such as biology and biomedicine due                  tion retrieval which is important for ontology-based annotation
to the need for large scale computationally amenable data [1].                retrieval. Sometimes, an NLP system might not retrieve the
However, the majority of ontology-based data generation relies                same ontology concept as the gold standard but a related
on manual literature curation - a slow and tedious process                    concept (sub-class or super-class). To assess the performance
[2]. Natural language and text mining methods have been                       of the NLP system accurately, we need semantic similar-
developed as the solution for scalable ontology-based data                    ity metrics that can measure different degrees of semantic
curation [3, 4].                                                              relatedness between ontology concepts [13]. Here, we use
   One of the most important tasks for annotating scientific
                                                                              Jaccard similarity to compare annotations from each deep
literature with ontology concepts is Named Entity Recognition
                                                                              learning model to the gold standard. Jaccard similarity assesses
 p manda@uncg.edu                                                             similarity between two ontology terms based on the ontological
 sdmohant@uncg.edu                                                            distance between them - the closer two terms are, the more


     ICBO 2018                                                   August 7-10, 2018                                                    1
     Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                       2


similar they are considered to be [13].                                          Annotations to single words (unigrams) were only included
                                                                              in these preprocessed files. So, if an annotation was made in
                     III. R ELATED W ORK                                      CRAFT to a phrase containing more than one word, it was
   The application of deep learning for ontology-based Named                  ignored in the preprocessed data.
Entity Recognition is a nascent area with relatively little prior
work. Habibi et al. [9] studied entity recognition on biomedical              B. Performance evaluation metrics
literature using long short-term memory network-conditional                      Precision, Recall, F1-score, and Jaccard similarity were used
random field (LSTM-CRF) and showed that the method out-                       to evaluate the performance of the models. The Jaccard simi-
performed other NER tools that do not use deep learning or                    larity (J) of two ontology concepts (in this case, annotations)
use deep learning methods without word embeddings. Lyu et                     (A, B) in an ontology is defined as the ratio of the number of
al. [10] also explored LSTM based models enhanced with                        classes in the intersection of their subsumers over the number
word and character embeddings. They do not evaluate other                     of classes in their union of their subsumers [13].
deep learning models but present results only based on LSTM
with word embeddings. Wang et al. [11] also propose a                                                             |S(A) ∩ S(B)|
                                                                                                   J(A, B) =
LSTM based method for recognizing biomedical entities from                                                        |S(A) ∪ S(B)|
literature. Similar to the above studies, Wang et al. show that
a bidirectional LSTM method used with Conditional Random                      where S(A) is the set of classes that subsume A. Jaccard
Field (CRF) and word embeddings outperforms other methods.                    similarity ranges from 0 (no similarity) to 1 (exact match).
   The striking difference between these prior studies and our
work here is that the majority of prior literature focuses on                 C. Deep learning models
LSTM based methods along with CRF and word embeddings.                           Below, we describe four deep learning models - Multi-
The potential of other deep learning models such as Recurrent                 layer perceptrons, Recurrent Neural Networks, Long Short-
Neural Networks, Gated Recurrent Units, etc., at the task of                  Term Memory, and Gated Recurrent Units. Next, we describe
ontology-based NER remains unexplored presenting a unique                     three architectures - window based, word based, and character-
need and opportunity. Our study aims to fill this knowledge                   word based that can be used in conjunction with the above
gap. In addition, all the above studies focus on non-ontology                 models. Finally, we describe our new model that combines
based NER for entities such as genes, disease names, etc. In                  character-word based architecture with Gated Recurrent Units
contrast, our study’s focus is on recognizing ontology concepts               and six models used in prior work.
within text.                                                                     1) Multi-Layer Perceptron (MLP): A Multi-Layer Percep-
                                                                              tron (MLP) [14] is a feed-forward deep-neural network model
                        IV. M ETHODS                                          which consists of an input, single/multiple hidden, and an out-
A. Data Preprocessing                                                         put layer, each consisting of a number of perceptrons.
                                                                                                                                Pn      A single
   Annotation files for the 67 papers in CRAFT were cleaned                   perceptron computes the output as γ = ϕ( i=1 wi xi + b),
to remove punctuation symbols (except for period at the end                   where w is the weight vector, x is the provided input, b is
of sentences), special symbols, and non-ASCII characters.                     the bias, and ϕ is the activation function. The weights and
Annotations for GO, CHEBI, Cell, Protein, and Sequence                        biases of each perceptron in the layers are adjusted using back-
ontologies were converted from the cleaned files to separate                  propagation to minimize prediction error
ontology-specific text files that represent the presence or ab-                  2) Recurrent Neural Network: A Recurrent Neural Net-
sence of ontology terms. For each ontology, every sentence                    work (RNN) [15] is an adaption of feed-forward neural
containing at least one annotation from that ontology was                     networks, where history of the input sequence is taken into
represented using two lines in the ontology-specific text file.               consideration for future prediction. Given an input sequence
The first of these two lines contained an array with each word                < x0 , x1 , x2 , · · · xi >, the hidden state (ht ) of an RNN is
in the sentence. The second contained an ordered encoding                     updated as follows:
corresponding to words in the first line. These encodings could                                       
                                                                                                        0,                        t=0
be an ontology concept ID if the corresponding word was                                 ht =
                                                                                                        σ(W hh ht−1 + W hx xt ), t > 0       (1)
annotated in CRAFT or an 0 O0 if the corresponding word was
not annotated.                                                                          yt = sof tmax(W s ht )
   For example, the sentence “Rod and cone photoreceptors                     where, xt is the input provided to the hidden state ht at
subserve vision under dim and bright light conditions re-                     time t which is updated using a sigmoid function σ. σ is
spectively” where the word “vision” was annotated to GO                       calculated over the previous time state of the network given
ID “GO:0007601 (perception of sight)” would be represented                    by ht−1 and current input xt . W hh , W hx , and W s are the
using the two lines below:                                                    weights computed over training. The network can then produce
   • [’Rod’, ’and’, ’cone’, ’photoreceptors’, ’subserve’, ’vi-                an output prediction < y0 , y1 , y2 , · · · yj > using a sof tmax
      sion’, ’under’, ’dim’, ’and’, ’bright’, ’light’, ’condi-                function on the hidden state ht .
      tions’, ’respectively’]                                                    A bidirectional Recurrent Neural Network (BiRNN) is an
   • [’O’, ’O’, ’O’, ’O’, ’O’, ’GO:0007601’, ’O’, ’O’, ’O’,                   RNN where the input data is fed to the neural network two
      ’O’, ’O’, ’O’, ’O’]                                                     times - once in forward and again in reverse order.


     ICBO 2018                                                   August 7-10, 2018                                                    2
      Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                                   3


   3) Long-Short Term Memory: While RNNs are effective                            1) Window-based: In this architecture, the window-based
in learning temporal patterns, they suffer from a vanishing                    input (iv ) consists of feature vectors (fv ) for each word/term (t)
gradient problem where long term dependencies are lost. A                      within an encoded sentence. Each fv consisted of the following
solution to the problem was proposed by Hochreiter et al. [16]                 attributes:
by using a variation of RNNs called Long-Short Term Memory
(LSTM). LSTMs use a memory cell (ct ), to keep track of long-                     fv =< t, nt , t1 , t−1 , tC , taC , tal , tp0 , tp0−1 , tp0−2 ,
term relationships between text. Using a gated architecture                                                                 ts0 , ts0−1 , ts0−2 , tN , tP > (4)
(input, output, and forget), LSTMs are able to modulate the
exposure of a memory cell by regulating the gates. LSTMs                       where,
can be defined as:                                                                t is the term,
                                                                                  nt is the number of terms in the sentence,
               it   =    σ(W ix xt + W ih ht−1 )                                  t1 is a boolean value indicating if the term is the first term
              ft    =    σ(W f x xt + W f h ht−1 )                             in the sentence,
              ot    =    σ(W ox xt + W oh ht−1 )                                  t−1 is 1 if term is the last term in the sentence 0 otherwise,
                                                                      (2)
              gt    =    tanh(W gx xt + W gh ht−1 )                               tC is 1 if first letter in t is uppercase,
              ct    =    ct−1 ft + gt it                                          taC is 1 if all letter in t are uppercase,
              ht    =    tanh(ct ) ot                                             tal is 1 if all in t are lower case,
                                                                                  tp0 , tp0−1 , tp0−2 record character prefixes of t at various win-
   where, it , ft , and ot are the input, forget, and output gates             dow size,
respectively. Each gate uses a sigmoid (σ) function applied                       ts0 , ts0−1 , ts0−2 record character suffixes of t at various win-
over the sum of input xt and previous hidden state ht−1                        dow sizes,
(multiplied with their weight matrices W ). gt denotes the                        tN and tP are the next and previous terms respectively.
candidate state computed over a tanh function on the input                        2) Word-based: Each word and its corresponding annota-
and previous hidden state. W ix , W f x , W ox , W gx are weight               tion labels (tags) are encoded with integer values, derived
matrices used with input xt , while W ih , W f h , W oh , and W gh             from unique words and annotations present in the corpus.
are used with hidden states for each gate and candidate state.                 The dataset was based on unigram annotations that only use
The memory cell ct utilizes the forget gate (ft ) and multiplies               ontology annotations where a single word in text maps to an
( - element-wise) it old memory cell ct−1 and adds to the                      ontology concept.
state of candidate (gt ) multiplied with the input gate (it ). The                In word-based architectures (Figure 1), the input (Xtr       W
                                                                                                                                                  ) is
hidden state is given by a tanh function applied to the memory                 provided to an Embedding layer which converts the input into
cell ct multiplied with output gate (ot ).                                     dense vectors of 100 dimensions. The output vectors are then
   4) Gated Recurrent Unit: A variation on LSTM, was in-                       fed to a bidirectional model (RNN/GRU/LSTM) consisting of
troduced by Cho et al. [17] as Gated Recurrent Unit (GRU).                     150 hidden units. The output from the model goes to a dense
Using update and reset gates, GRUs are able to control amount                  perceptron layer using ReLU activation which also employs a
of information within a unit (without a separate memory cell                   0.6 Dropout. The output is further fed into a CRF layer which
as with LSTM). GRUs can formally be defined as                                 looks for correlations between annotations in close sequences
                                                                               to generate the predictions (ypr ).
             zt     =   σ(W zx xt + W zh ht−1 )                                   3) Character+Word Based: A Character+Word based archi-
             rt     =   σ(W rx xt + W rh ht−1 )                                tecture is similar to the word based architecture described
                                                                      (3)                                                          W
             h̃t    =   tanh(W x xt + rt Wh ht−1 )                             above. In addition to word-based inputs (Xtr           ) is also takes
                                                                                                                 C
             ht     =   zt ht−1 + (1 − zt ) h̃t                                advantage of characters (Xtr        ) within words to make predic-
                                                                               tions.
   where, zt and rt are update and reset gates respectively, h̃t
is the candidate activation/hidden state.                                      E. Model development
   Similar to the LSTM architecture, GRUs benefit from the
                                                                                  We developed a new deep learning architecture that uses a
additive properties in their network to remember long term
                                                                               Character+word based architecture coupled with two bidirec-
dependencies, and solve the vanishing gradient problem. Since
                                                                               tional Gated Recurrent Units. Our architecture (Figure 2) con-
GRUs do not utilize an an output gate, they are able to write                                                   C
                                                                               sists of character level input (Xtr ) provided to an Embedding
the entire contents of their memory cell to the network. The
                                                                               layer (E1 ) which compresses the dimensions of characters to
lack of a memory cell also makes GRUs more efficient in
                                                                               the number of unique annotations in the corpus (N T ags).
comparison to LSTMs.
                                                                                  The output of Embedding layer E1 is fed to a bidirectional
                                                                               GRU (BiGRU1 ) layer with 150 units followed by a 60%
                                                                               output drop in a Dropout layer (D1 ). Simultaneously, the
D. Deep learning Architectures                                                                       W
                                                                               word-level input (Xtr   ) was provided to a second Embedding
   Below, we describe three architectures - window-based,                      layer (E2 ) with 30 dimensions. The output from E2 was
word-based, and word+character based to be used in conjunc-                    concatenated with the output from the first Dropout layer D1
tion with the different models described above.                                and fed through a second Dropout layer (D2 ) with a 30% drop.


      ICBO 2018                                                   August 7-10, 2018                                                                3
          Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                   4


Fig. 1.    Word-based architecture using bidirectional RNN/GRU/LSTM models


Output from D2 was fed into a second bidirectional GRU layer                       layer followed by a BiRNN with 150 units followed by a 0.6
(BiGRU2 ) consisting of 150 units.                                                 Dropout layer. The output of the 0.6 Dropout layer is fed to a
   The above model was tested with and without a final CRF                         CRF which generated the predicted output.
layer leading to two new configurations - CW − BiGRU −                                3) BiLSTM-CRF: The BiLSTM-CRF model is identical to
CRF and CW −BiGRU . The models were run for 15 epochs                              the BiRNN-CRF except that it uses a LSTM in place of the
with a batchsize of 32 instances.                                                  RNN.
                                                                                      4) BiGRU-CRF: The BiGRU-CRF model is identical to
F. Model Comparison                                                                BiRNN-CRF and BiLSTM-CRF except that it uses a Gated
                                                                                   Recurrent Unit in place of the RNN or LSTM.
   We compared the performance of our new Character+word                              5) CW-BiLSTM: The CW-BiLSTM model is similar to the
based GRU architecture and the two models developed therein                        CW-BiGRU model described above (see Section IV-E) except
(CW − BiGRU − CRF , CW − BiGRU ) (Section IV-E) to                                 that the BiGRU is replaced with a BiLSTM.
six state of the art models that have been used in prior work.                        6) CW-BiLSTM-CRF: The CW-BiLSTM-CRF model is
Below, we specify the component details of each of the six                         developed by adding a CRF layer at the end of the CW-
prior models that have been evaluated.                                             BiLSTM model pipeline indicating that the output of the CW-
   1) MLP: Multi layer perceptrons were used with a window                         BiLSTM model would be fed to a CRF layer to generate the
based architecture to create a three layered (input, hidden,                       final predictions.
output) M LP model. The input and the hidden layer consisted
of 512 perceptrons with a Rectified Linear Unit (ReLU)
activation function while the output layer consisted of per-                       G. Parameter Tuning
ceptrons equal to the number of unique annotations in the                             The GO annotation data was split into training and test
corpus (N T ags). 20% Dropout was used for the hidden and                          sets using a 70:30 ratio. The training set was used to tune
output layers to prevent overfitting of the data. Categorical                      the following parameters for all models. Multiple architec-
cross-entropy was used for calculating the loss function and                       ture parameters such as - 1) Number of layers in MLP
NAdam (Adam RMSprop with Nesterov momentum) was used                               (along with number of perceptrons), 2) Number of units in
as the optimizer function. Each of the feature vectors (from                       RNN/GRU/LSTM, 3) Embedding Dimensions for Characters
the training data), were fed into the MLP architecture for 15                      and Words, and 4) Optimization functions, were evaluated for
epochs with a batch size of 256.                                                   model performance. A grid-search model was explored, where
   2) BiRNN-CRF: The BiRNN-CRF model uses a word-                                  each architecture was evaluated for different combinations of
based input coupled with a BiRNN model and ending with a                           the parameter. In each case, model performance metrics were
CRF model. Similar to the BiRNN architecture (Figure 1), the                       recorded in form of Precision, Recall, F1-score, and Jaccard
BiRNN-CRF model consists of a 100 dimension Embedding                              similarity.


          ICBO 2018                                                   August 7-10, 2018                                                4
          Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                                5


Fig. 2.    Character+word based architecture using two bidirectional GRU models.


                                                                                    TABLE I.  C HARACTERISTICS OF THE CRAFT CORPUS - N UMBER OF
H. Experiments to predict ontology annotations                                       SENTENCES WITH AT LEAST ONE ANNOTATION , NUMBER OF UNIQUE
   The largest number of annotations in the CRAFT corpus                            ANNOTATIONS ( UNIGRAMS ONLY ), AND NUMBER OF UNIQUE WORDS IN
                                                                                                            THE CORPUS .
came from the Gene Ontology. So, we first used the GO
annotations to train and test the suite of 8 models described                        Dataset
                                                                                                 Number of        Number of               Number of
                                                                                                 Sentences    Unique Annotations   Unique Words in the Corpus
above. Subsequently, we applied the best model from these                            GO           17,921             359                    9,571
experiments to annotate the CRAFT corpus with the other four                         Sequence     15,606             156                    7,262
ontologies (Chebi, Cell, Protein, and Sequence corpora).                             Protein      12,621             546                    5,153
                                                                                     Chebi        11,109             309                    3,127
   Root-Mean-Square propagation (RMSProp) optimizer was                              Cell          9,088             68                     3,042
used to test the performance of the different models. A batch
size of 32 along with 15 epochs was used for model training.
Performance characteristics in terms of train-test loss (calcu-                    training and validation loss indicating that the model is able
lated using the CRF function), prediction precision, recall, F1-                   to self-improve with each subsequent epoch.
score along with mean semantic similarity score was recorded                           The CW-BiGRU-CRF model initially shows the same ac-
for each model.                                                                    curacy improvement like the CW-BiGRU model but later
                                                                                   increases in epochs result in a divergence in the training and
                    V.    R ESULTS AND D ISCUSSION                                 validation accuracy indicating that the model might be prone
                                                                                   to overfitting. While there is a substantial decrease in training
   The CRAFT corpus contains 67 full length papers with                            loss, a similar decrease is not observed in validation loss.
annotations from five ontologies (GO, CHEBI, Cell, Protein,                            CW-BiLSTM shows similar trends to CW-BiGRU. CW-
and Sequence). For each of these ontologies, we extracted all                      BiLSTM-CRF training and validation accuracy increase simi-
sentences across the 67 papers with at least one annotation for                    larly until a certain point after which the validation accuracy
the ontology. The largest number of annotations came from                          drops and diverges sharply from the training curve indicating
the GO (Table I) while the Cell ontology accounted for the                         a case of overfitting.
lowest number of annotations.                                                          BiGRU-CRF and BiRNN-CRF models show substantial
   Figure 3 shows the loss and accuracy trends for each model                      improvement in accuracy with increasing epochs. However,
on the GO annotation data. The goal of the models is to                            BiRNN-CRF shows divergence in the loss patterns. Similar to
minimize loss while increasing accuracy as the number of                           CW-BiLSTM-CRF, BiLSTM-CRF also shows signs of over-
epochs increase.                                                                   fitting in the accuracy patterns. MLP is the worst performing
   First, we see that our CW-BiGRU model shows improve-                            model with very minor improvements in validation accuracy
ment in both training and validation accuracy as the number                        as the number of epochs increase indicating that the model is
of epochs increase. Correspondingly, we observe a decrease in                      unable to improve itself with each subsequent epoch.


          ICBO 2018                                                   August 7-10, 2018                                                             5
     Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                                   6


   It is clear that the CW-BiGRU models are able to outperform                  negatives might be lack of enough training instances for those
the other models by improving accuracy and reducing loss with                   particular GO annotations.
each epoch without overfitting.                                                    Finally, we applied the best performing model from the
   A large proportion of input data is not annotated to GO                      above evaluation (CW-BiGRU) and tested it on data from
terms but to a tag 0 O0 indicating the absence of an annotation.                four other ontologies. Interestingly, the model shows better
In addition to accurately predicting GO annotations, the model                  prediction performance on the other ontologies as compared
also needs to accurately predict the absence of an annotation.                  to GO despite the substantially smaller training datasets (Table
However, given the disproportionate amount of data pertaining                   III).
to the absence of annotations, the models were observed to
predict the absence of annotations remarkably accurately in                        TABLE III.      P RECISION , R ECALL , F1, AND JACCARD S IMILARITY
                                                                                        SCORES FOR THE EIGHT MODELS ON ANNOTATIONS FROM FIVE
comparison to predicting presence.                                                                       ONTOLOGIES IN CRAFT.
   To provide a more conservative view of the models’ per-
                                                                                                                                              Jaccard
formance, we report Precision, Recall, F-1 Score, and Jaccard                           Model        Ontology    Precision   Recall   F1
                                                                                                                                             Similarity
similarity (Table II) only on data indicating presence of ontol-                        CW-BiGRU       Cell        0.92      0.92     0.92     0.925
ogy terms, i.e. text annotated with an ontology term. Unlike                            CW-BiGRU      Protein      0.91      0.90     0.90     0.917
                                                                                        CW-BiGRU      CHEBI        0.86      0.87     0.86     0.882
the accuracy measurements above, the metrics below do not                               CW-BiGRU       GO          0.84      0.84     0.83     0.843
take into account the models’ performance at identifying the                            CW-BiGRU     Sequence      0.83      0.86     0.84     0.864
absence of annotations, but rather focus on ability to identify
annotations when they’re present in the Gold Standard.

  TABLE II.    P RECISION , R ECALL , F1, AND JACCARD S IMILARITY
                                                                                            VI.    C ONCLUSIONS AND F UTURE W ORK
    SCORES FOR THE EIGHT MODELS ON CRAFT G ENE O NTOLOGY                           The data used in this study was limited to single words
                        ANNOTATION DATA .
                                                                                annotated to ontology concepts (unigrams). Next, we will
       Model               Precision   Recall   F1
                                                        Jaccard                 explore more robust models including n-grams to account for
                                                       Similarity
       CW-BiGRU              0.84      0.84     0.83      0.84                  sequences of words tagged with an annotation. Future work
       CW-BiLSTM             0.80      0.82     0.80      0.83                  will also include models that can be trained to weight the
       CW-BiLSTM-CRF         0.80      0.82     0.80      0.82                  prediction of some target classes higher than others. These
       CW-BiGRU-CRF          0.77      0.80     0.78      0.82
       BiGRU-CRF             0.75      0.77     0.75      0.78                  models would be able to prioritize presence prediction of
       BiRNN-CRF             0.72      0.74     0.72      0.75                  annotations as compared to the absence of an annotation.
       BiLSTM-CRF            0.70      0.70     0.70      0.71                     This study demonstrates the utility of deep learning ap-
       MLP                   0.65      0.60     0.61      0.61
                                                                                proaches for automated ontology-based curation of scientific
                                                                                literature. Specifically, we show that models based on Gated
   These results (Table II and Figure 3) show that our model                    Recurrent Units are more powerful and accurate at annotation
(CW-BiGRU) outperforms the other 7 models in all four                           prediction as compared to the LSTM based models in prior
metrics. Our model outperforms the best among the other 7                       work. Our findings indicate that deep learning is a promising
models (CW-BiLSTM) by 4% (Precision), 2% (Recall), 3%                           new direction for ontology-based text mining, and can be used
(F1 score), 1% (Jaccard similarity).                                            for more sophisticated annotation tasks (such as phenotype
   Additionally, we observe that character-word based models                    curation) that build upon Named Entity Recognition.
(CW-BiGRU, CW-BiLSTM, CW-BiLSTM-CRF, CW-BiGRU-
CRF,) outperform models that use only word embeddings.
   Among the character-word based models, surprisingly, the                                                     R EFERENCES
addition of an extra CRF layer (CW-BiLSTM-CRF, CW-                               [1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein,
BiGRU-CRF) either fails to improve performance (e.g CW-                              H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S.
BiLSTM vs. CW-BiLSTM-CRF) or leads to a decline in per-                              Dwight, J. T. Eppig et al., “Gene ontology: tool for the
formance (e.g CW-BiGRU vs. CW-BiGRU-CRF) as compared                                 unification of biology,” Nature genetics, vol. 25, no. 1,
to not using a CRF end layer (CW-BiLSTM, CW-BiGRU).                                  p. 25, 2000.
The MLP model shows substantially lower performance as                           [2] W. Dahdul, T. A. Dececchi, N. Ibrahim, H. Lapp, and
compared to the other models across all four metrics. The                            P. Mabee, “Moving the mountain: analysis of the effort
Accuracy and Loss plots (Figure 3) suggest that the decline                          required to transform comparative anatomy into com-
in performance when adding a CRF layer is due to potential                           putable anatomy,” Database, vol. 2015, 2015.
overfitting.                                                                     [3] J. Clement, S. Nigam, Y. Cherie, M. Musen, C. Callendar,
   We explored how predictions from our best model, CW-                              and M. Storey, “Ncbo annotator: semantic annotation of
BiGRU, diverge from the Gold Standard. We found that the                             biomedical data,” in International Semantic Web Confer-
majority of predictions (89.25%) are an exact match for the                          ence, Poster and Demo session, 2009.
CRAFT annotations. Surprisingly, only a small proportion of                      [4] C. J. Mungall, J. A. McMurry, S. Köhler, J. P. Balhoff,
predictions are partial matches (2.45%). 8.26% of the model’s                        C. Borromeo, M. Brush, S. Carbon, T. Conlin, N. Dunn,
predictions are false negatives while 6.38% are false positives.                     M. Engelstad et al., “The monarch initiative: an inte-
We hypothesize that one of the primary reasons for false                             grative data and analytic platform connecting phenotypes


     ICBO 2018                                                      August 7-10, 2018                                                             6
          Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                                                          7


           1.00
                          CW-BiGRU                    1.00
                                                                     CW-BiGRU-CRF                 1.00
                                                                                                                 CW-BiLSTM                   1.00
                                                                                                                                                        CW-BiLSTM-CRF
           0.99                                       0.99                                        0.99                                       0.99
           0.98                                       0.98                                        0.98                                       0.98
           0.97                                       0.97                                        0.97                                       0.97
    Accuracy

           0.96                                       0.96                                        0.96                                       0.96
           0.95                                       0.95                                        0.95                                       0.95
           0.94                                       0.94                                        0.94                                       0.94
           0.93                                       0.93                                        0.93                                       0.93
           0.92                                       0.92                                        0.92                                       0.92
           0.91                                       0.91                                        0.91                                       0.91
           0.90                                       0.90                                        0.90                                       0.90

                                                                                                  0.80                                      17.00
           0.70                                      16.90
                                                                                                  0.70                                      16.90
           0.60                                      16.80
                                                                                                  0.60                                      16.80
           0.50                                      16.70
                                                                                                  0.50
                                                                                                                                            16.70
    Loss


           0.40                                      16.60                                        0.40
                                                                                                                                            16.60
           0.30                                      16.50                                        0.30
           0.20                                                                                                                             16.50
                                                     16.40                                        0.20
           0.10                                                                                   0.10                                      16.40
                                                     16.30

                  0   2   4   6   8   10   12   14           0   2    4   6   8   10   12   14           0   2   4   6   8   10   12   14           0   2   4   6   8    10   12   14

           1.00
                          BiGRU-CRF                   1.00
                                                                      BIRNN-CRF                   1.00
                                                                                                                 BiLSTM-CRF                  1.00
                                                                                                                                                                MLP
           0.99                                       0.99                                        0.99                                       0.99
           0.98                                       0.98                                        0.98                                       0.98
           0.97                                       0.97                                        0.97                                       0.97
    Accuracy


           0.96                                       0.96                                        0.96                                       0.96
           0.95                                       0.95                                        0.95                                       0.95
           0.94                                       0.94                                        0.94                                       0.94
           0.93                                       0.93                                        0.93                                       0.93
           0.92                                       0.92                                        0.92                                       0.92
           0.91                                       0.91                                        0.91                                       0.91
           0.90                                       0.90                                        0.90                                       0.90

                                                     17.80                                                                                                              Training
          17.40                                                                                  17.40
                                                     17.60                                       17.30
                                                                                                                                             0.30                       Validation
          17.30
                                                     17.40                                       17.20                                       0.25
          17.20
                                                                                                 17.10
   Loss


                                                     17.20                                                                                   0.20
          17.10
                                                                                                 17.00
                                                     17.00
          17.00                                                                                                                              0.15
                                                                                                 16.90
                                                     16.80
          16.90                                                                                  16.80                                       0.10
                                                     16.60                                       16.70
          16.80
                                                                                                                                             0.05
                  0   2   4   6   8   10   12   14           0   2    4   6   8   10   12   14           0   2   4   6   8   10   12   14           0   2   4   6   8    10   12   14
                                                                                  Number of Epochs

Fig. 3.        Comparison of model loss and accuracy on training and validation data using Gene Ontology annotations


     to genotypes across species,” Nucleic acids research,                                            matics, vol. 33, no. 14, pp. i37–i48, 2017.
     vol. 45, no. D1, pp. D712–D722, 2016.                                                       [10] C. Lyu, B. Chen, Y. Ren, and D. Ji, “Long short-term
 [5] I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar,                                              memory rnn for biomedical named entity recognition,”
     “Text mining and ontologies in biomedicine: making                                               BMC bioinformatics, vol. 18, no. 1, p. 462, 2017.
     sense of raw text,” Briefings in bioinformatics, vol. 6,                                    [11] X. Wang, Y. Zhang, X. Ren, Y. Zhang, M. Zitnik,
     no. 3, pp. 239–251, 2005.                                                                        J. Shang, C. Langlotz, and J. Han, “Cross-type biomedi-
 [6] H. Cui, W. Dahdul, A. T. Dececchi, N. Ibrahim, P. Mabee,                                         cal named entity recognition with deep multi-task learn-
     J. P. Balhoff, and H. Gopalakrishnan, “Charaparser+                                              ing,” arXiv preprint arXiv:1801.09851, 2018.
     eq: Performance evaluation without gold standard,” Pro-                                     [12] M. Bada, M. Eckert, D. Evans, K. Garcia, K. Shipley,
     ceedings of the Association for Information Science and                                          D. Sitnikov, W. A. Baumgartner, K. B. Cohen, K. Ver-
     Technology, vol. 52, no. 1, pp. 1–10, 2015.                                                      spoor, J. A. Blake et al., “Concept annotation in the craft
 [7] G. Lample, M. Ballesteros, S. Subramanian,                                                       corpus,” BMC bioinformatics, vol. 13, no. 1, p. 161, 2012.
     K. Kawakami, and C. Dyer, “Neural architectures                                             [13] C. Pesquita, D. Faria, A. O. Falcao, P. Lord, and F. M.
     for named entity recognition,” arXiv preprint                                                    Couto, “Semantic similarity in biomedical ontologies,”
     arXiv:1603.01360, 2016.                                                                          PLoS computational biology, vol. 5, no. 7, p. e1000443,
 [8] J. Lafferty, “Conditional random fields: Probabilistic                                           2009.
     models for segmenting and labelling sequence data,” in                                      [14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,
     ICML, 2001, 2001.                                                                                “Learning internal representations by error propagation,”
 [9] M. Habibi, L. Weber, M. Neves, D. L. Wiegandt, and                                               California Univ San Diego La Jolla Inst for Cognitive
     U. Leser, “Deep learning with word embeddings im-                                                Science, Tech. Rep., 1985.
     proves biomedical named entity recognition,” Bioinfor-                                      [15] L. Faucett, “Fundamentals of neural networks,” Architec-


          ICBO 2018                                                               August 7-10, 2018                                                                           7
     Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA   8


     ture, Algorithms, 1994.
[16] S. Hochreiter and J. Schmidhuber, “Long short-term
     memory,” Neural computation, vol. 9, no. 8, pp. 1735–
     1780, 1997.
[17] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bah-
     danau, F. Bougares, H. Schwenk, and Y. Bengio, “Learn-
     ing phrase representations using rnn encoder-decoder
     for statistical machine translation,” arXiv preprint
     arXiv:1406.1078, 2014.


     ICBO 2018                                                   August 7-10, 2018                                8