=Paper=
{{Paper
|id=Vol-2285/ICBO_2018_paper_18
|storemode=property
|title=Taking a Dive: Experiments in Deep Learning for Automatic Ontology-Based Annotation of Scientific Literature
|pdfUrl=https://ceur-ws.org/Vol-2285/ICBO_2018_paper_18.pdf
|volume=Vol-2285
|authors=Prashanti Manda,Lucas Beasley,Somya D. Mohanty
|dblpUrl=https://dblp.org/rec/conf/icbo/MandaBM18
}}
==Taking a Dive: Experiments in Deep Learning for Automatic Ontology-Based Annotation of Scientific Literature==
Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 1
Taking a Dive: Experiments in Deep Learning for
Automatic Ontology-based Annotation of Scientific
Literature
Prashanti Manda1,* , Lucas Beasley1 and Somya D. Mohanty1,*
1
Department of Computer Science, University of North Carolina at Greensboro, NC, USA
*
Equal contributions
I. A BSTRACT (NER). In the context of ontology-based annotation, NER can
Text mining approaches for automated ontology-based cu- be described as recognizing ontology concepts from text [5].
ration of biological and biomedical literature have largely Outside the scope of ontology-based annotation, NER has been
focused on syntactic and lexical analysis along with machine applied to biomedical and biological literature for recognizing
learning. Recent advances in deep learning have shown in- genes, proteins, diseases, etc [5].
creased accuracy for textual data annotation. However, the The large majority of ontology driven NER techniques rely
application of deep learning for ontology-based curation is a on lexical and syntactic analysis of text in addition to machine
relatively new area and prior work has focused on a limited learning for recognizing and tagging ontology concepts [3, 4,
set of models. 6]. In recent years, deep learning has been introduced for NER
Here, we introduce a new deep learning model/architecture of biological entities from literature [7, 8, 9, 10, 11]. However,
based on combining multiple Gated Recurrent Units (GRU) the majority of prior work has focused on a limited set of
with a character+word based input. We use data from five models, particularly, the Long Short-Term Memory (LSTM)
ontologies in the CRAFT corpus as a Gold Standard to model (e.g. [7]).
evaluate our model’s performance. We also compare our model Here, we present a new deep learning architecture that
to seven models from prior work. We use four metrics - utilizes Gated Recurrent Units (GRU) while taking advantage
Precision, Recall, F1 score, and a semantic similarity metric of word and character encodings from the annotation training
(Jaccard similarity) to compare our model’s output to the Gold data to recognize ontology concepts from text. We evaluate
Standard. Our model resulted in a 84% Precision, 84% Recall, our model in comparison to 7 deep learning models used in
83% F1, and a 84% Jaccard similarity. Results show that our prior work to show that our model outperforms the state of art
GRU-based model outperforms prior models across all five at the task of ontology-based NER.
ontologies. We also observed that character+word inputs result We use the Colorado Richly Annotated Full-Text (CRAFT)
in a higher performance across models as compared to word corpus [12] as a Gold Standard reference to develop and eval-
only inputs. uate the deep learning models. The CRAFT corpus contains
These findings indicate that deep learning algorithms are a 67 open access, full length biomedical articles annotated with
promising avenue to be explored for automated ontology-based concepts from several ontologies (such as Gene Ontology,
curation of data. This study also serves as a formal comparison Protein Ontology, Sequence Ontology, etc.). We use four
and guideline for building and selecting deep learning models metrics - 1) Precision, 2) Recall, 3) F-1 Score and 4) Jaccard
and architectures for ontology-based curation. semantic similarity to compare each model’s performance to
the Gold Standard.
II. I NTRODUCTION Precision and Recall are traditionally used to assess the
performance of information retrieval systems. However, these
Ontology-based data representation has been widely adopted metrics do not take into account the notion of partial informa-
in data intensive fields such as biology and biomedicine due tion retrieval which is important for ontology-based annotation
to the need for large scale computationally amenable data [1]. retrieval. Sometimes, an NLP system might not retrieve the
However, the majority of ontology-based data generation relies same ontology concept as the gold standard but a related
on manual literature curation - a slow and tedious process concept (sub-class or super-class). To assess the performance
[2]. Natural language and text mining methods have been of the NLP system accurately, we need semantic similar-
developed as the solution for scalable ontology-based data ity metrics that can measure different degrees of semantic
curation [3, 4]. relatedness between ontology concepts [13]. Here, we use
One of the most important tasks for annotating scientific
Jaccard similarity to compare annotations from each deep
literature with ontology concepts is Named Entity Recognition
learning model to the gold standard. Jaccard similarity assesses
p manda@uncg.edu similarity between two ontology terms based on the ontological
sdmohant@uncg.edu distance between them - the closer two terms are, the more
ICBO 2018 August 7-10, 2018 1
Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 2
similar they are considered to be [13]. Annotations to single words (unigrams) were only included
in these preprocessed files. So, if an annotation was made in
III. R ELATED W ORK CRAFT to a phrase containing more than one word, it was
The application of deep learning for ontology-based Named ignored in the preprocessed data.
Entity Recognition is a nascent area with relatively little prior
work. Habibi et al. [9] studied entity recognition on biomedical B. Performance evaluation metrics
literature using long short-term memory network-conditional Precision, Recall, F1-score, and Jaccard similarity were used
random field (LSTM-CRF) and showed that the method out- to evaluate the performance of the models. The Jaccard simi-
performed other NER tools that do not use deep learning or larity (J) of two ontology concepts (in this case, annotations)
use deep learning methods without word embeddings. Lyu et (A, B) in an ontology is defined as the ratio of the number of
al. [10] also explored LSTM based models enhanced with classes in the intersection of their subsumers over the number
word and character embeddings. They do not evaluate other of classes in their union of their subsumers [13].
deep learning models but present results only based on LSTM
with word embeddings. Wang et al. [11] also propose a |S(A) ∩ S(B)|
J(A, B) =
LSTM based method for recognizing biomedical entities from |S(A) ∪ S(B)|
literature. Similar to the above studies, Wang et al. show that
a bidirectional LSTM method used with Conditional Random where S(A) is the set of classes that subsume A. Jaccard
Field (CRF) and word embeddings outperforms other methods. similarity ranges from 0 (no similarity) to 1 (exact match).
The striking difference between these prior studies and our
work here is that the majority of prior literature focuses on C. Deep learning models
LSTM based methods along with CRF and word embeddings. Below, we describe four deep learning models - Multi-
The potential of other deep learning models such as Recurrent layer perceptrons, Recurrent Neural Networks, Long Short-
Neural Networks, Gated Recurrent Units, etc., at the task of Term Memory, and Gated Recurrent Units. Next, we describe
ontology-based NER remains unexplored presenting a unique three architectures - window based, word based, and character-
need and opportunity. Our study aims to fill this knowledge word based that can be used in conjunction with the above
gap. In addition, all the above studies focus on non-ontology models. Finally, we describe our new model that combines
based NER for entities such as genes, disease names, etc. In character-word based architecture with Gated Recurrent Units
contrast, our study’s focus is on recognizing ontology concepts and six models used in prior work.
within text. 1) Multi-Layer Perceptron (MLP): A Multi-Layer Percep-
tron (MLP) [14] is a feed-forward deep-neural network model
IV. M ETHODS which consists of an input, single/multiple hidden, and an out-
A. Data Preprocessing put layer, each consisting of a number of perceptrons.
Pn A single
Annotation files for the 67 papers in CRAFT were cleaned perceptron computes the output as γ = ϕ( i=1 wi xi + b),
to remove punctuation symbols (except for period at the end where w is the weight vector, x is the provided input, b is
of sentences), special symbols, and non-ASCII characters. the bias, and ϕ is the activation function. The weights and
Annotations for GO, CHEBI, Cell, Protein, and Sequence biases of each perceptron in the layers are adjusted using back-
ontologies were converted from the cleaned files to separate propagation to minimize prediction error
ontology-specific text files that represent the presence or ab- 2) Recurrent Neural Network: A Recurrent Neural Net-
sence of ontology terms. For each ontology, every sentence work (RNN) [15] is an adaption of feed-forward neural
containing at least one annotation from that ontology was networks, where history of the input sequence is taken into
represented using two lines in the ontology-specific text file. consideration for future prediction. Given an input sequence
The first of these two lines contained an array with each word < x0 , x1 , x2 , · · · xi >, the hidden state (ht ) of an RNN is
in the sentence. The second contained an ordered encoding updated as follows:
corresponding to words in the first line. These encodings could
0, t=0
be an ontology concept ID if the corresponding word was ht =
σ(W hh ht−1 + W hx xt ), t > 0 (1)
annotated in CRAFT or an 0 O0 if the corresponding word was
not annotated. yt = sof tmax(W s ht )
For example, the sentence “Rod and cone photoreceptors where, xt is the input provided to the hidden state ht at
subserve vision under dim and bright light conditions re- time t which is updated using a sigmoid function σ. σ is
spectively” where the word “vision” was annotated to GO calculated over the previous time state of the network given
ID “GO:0007601 (perception of sight)” would be represented by ht−1 and current input xt . W hh , W hx , and W s are the
using the two lines below: weights computed over training. The network can then produce
• [’Rod’, ’and’, ’cone’, ’photoreceptors’, ’subserve’, ’vi- an output prediction < y0 , y1 , y2 , · · · yj > using a sof tmax
sion’, ’under’, ’dim’, ’and’, ’bright’, ’light’, ’condi- function on the hidden state ht .
tions’, ’respectively’] A bidirectional Recurrent Neural Network (BiRNN) is an
• [’O’, ’O’, ’O’, ’O’, ’O’, ’GO:0007601’, ’O’, ’O’, ’O’, RNN where the input data is fed to the neural network two
’O’, ’O’, ’O’, ’O’] times - once in forward and again in reverse order.
ICBO 2018 August 7-10, 2018 2
Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 3
3) Long-Short Term Memory: While RNNs are effective 1) Window-based: In this architecture, the window-based
in learning temporal patterns, they suffer from a vanishing input (iv ) consists of feature vectors (fv ) for each word/term (t)
gradient problem where long term dependencies are lost. A within an encoded sentence. Each fv consisted of the following
solution to the problem was proposed by Hochreiter et al. [16] attributes:
by using a variation of RNNs called Long-Short Term Memory
(LSTM). LSTMs use a memory cell (ct ), to keep track of long- fv =< t, nt , t1 , t−1 , tC , taC , tal , tp0 , tp0−1 , tp0−2 ,
term relationships between text. Using a gated architecture ts0 , ts0−1 , ts0−2 , tN , tP > (4)
(input, output, and forget), LSTMs are able to modulate the
exposure of a memory cell by regulating the gates. LSTMs where,
can be defined as: t is the term,
nt is the number of terms in the sentence,
it = σ(W ix xt + W ih ht−1 ) t1 is a boolean value indicating if the term is the first term
ft = σ(W f x xt + W f h ht−1 ) in the sentence,
ot = σ(W ox xt + W oh ht−1 ) t−1 is 1 if term is the last term in the sentence 0 otherwise,
(2)
gt = tanh(W gx xt + W gh ht−1 ) tC is 1 if first letter in t is uppercase,
ct = ct−1 ft + gt it taC is 1 if all letter in t are uppercase,
ht = tanh(ct ) ot tal is 1 if all in t are lower case,
tp0 , tp0−1 , tp0−2 record character prefixes of t at various win-
where, it , ft , and ot are the input, forget, and output gates dow size,
respectively. Each gate uses a sigmoid (σ) function applied ts0 , ts0−1 , ts0−2 record character suffixes of t at various win-
over the sum of input xt and previous hidden state ht−1 dow sizes,
(multiplied with their weight matrices W ). gt denotes the tN and tP are the next and previous terms respectively.
candidate state computed over a tanh function on the input 2) Word-based: Each word and its corresponding annota-
and previous hidden state. W ix , W f x , W ox , W gx are weight tion labels (tags) are encoded with integer values, derived
matrices used with input xt , while W ih , W f h , W oh , and W gh from unique words and annotations present in the corpus.
are used with hidden states for each gate and candidate state. The dataset was based on unigram annotations that only use
The memory cell ct utilizes the forget gate (ft ) and multiplies ontology annotations where a single word in text maps to an
( - element-wise) it old memory cell ct−1 and adds to the ontology concept.
state of candidate (gt ) multiplied with the input gate (it ). The In word-based architectures (Figure 1), the input (Xtr W
) is
hidden state is given by a tanh function applied to the memory provided to an Embedding layer which converts the input into
cell ct multiplied with output gate (ot ). dense vectors of 100 dimensions. The output vectors are then
4) Gated Recurrent Unit: A variation on LSTM, was in- fed to a bidirectional model (RNN/GRU/LSTM) consisting of
troduced by Cho et al. [17] as Gated Recurrent Unit (GRU). 150 hidden units. The output from the model goes to a dense
Using update and reset gates, GRUs are able to control amount perceptron layer using ReLU activation which also employs a
of information within a unit (without a separate memory cell 0.6 Dropout. The output is further fed into a CRF layer which
as with LSTM). GRUs can formally be defined as looks for correlations between annotations in close sequences
to generate the predictions (ypr ).
zt = σ(W zx xt + W zh ht−1 ) 3) Character+Word Based: A Character+Word based archi-
rt = σ(W rx xt + W rh ht−1 ) tecture is similar to the word based architecture described
(3) W
h̃t = tanh(W x xt + rt Wh ht−1 ) above. In addition to word-based inputs (Xtr ) is also takes
C
ht = zt ht−1 + (1 − zt ) h̃t advantage of characters (Xtr ) within words to make predic-
tions.
where, zt and rt are update and reset gates respectively, h̃t
is the candidate activation/hidden state. E. Model development
Similar to the LSTM architecture, GRUs benefit from the
We developed a new deep learning architecture that uses a
additive properties in their network to remember long term
Character+word based architecture coupled with two bidirec-
dependencies, and solve the vanishing gradient problem. Since
tional Gated Recurrent Units. Our architecture (Figure 2) con-
GRUs do not utilize an an output gate, they are able to write C
sists of character level input (Xtr ) provided to an Embedding
the entire contents of their memory cell to the network. The
layer (E1 ) which compresses the dimensions of characters to
lack of a memory cell also makes GRUs more efficient in
the number of unique annotations in the corpus (N T ags).
comparison to LSTMs.
The output of Embedding layer E1 is fed to a bidirectional
GRU (BiGRU1 ) layer with 150 units followed by a 60%
output drop in a Dropout layer (D1 ). Simultaneously, the
D. Deep learning Architectures W
word-level input (Xtr ) was provided to a second Embedding
Below, we describe three architectures - window-based, layer (E2 ) with 30 dimensions. The output from E2 was
word-based, and word+character based to be used in conjunc- concatenated with the output from the first Dropout layer D1
tion with the different models described above. and fed through a second Dropout layer (D2 ) with a 30% drop.
ICBO 2018 August 7-10, 2018 3
Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 4
Fig. 1. Word-based architecture using bidirectional RNN/GRU/LSTM models
Output from D2 was fed into a second bidirectional GRU layer layer followed by a BiRNN with 150 units followed by a 0.6
(BiGRU2 ) consisting of 150 units. Dropout layer. The output of the 0.6 Dropout layer is fed to a
The above model was tested with and without a final CRF CRF which generated the predicted output.
layer leading to two new configurations - CW − BiGRU − 3) BiLSTM-CRF: The BiLSTM-CRF model is identical to
CRF and CW −BiGRU . The models were run for 15 epochs the BiRNN-CRF except that it uses a LSTM in place of the
with a batchsize of 32 instances. RNN.
4) BiGRU-CRF: The BiGRU-CRF model is identical to
F. Model Comparison BiRNN-CRF and BiLSTM-CRF except that it uses a Gated
Recurrent Unit in place of the RNN or LSTM.
We compared the performance of our new Character+word 5) CW-BiLSTM: The CW-BiLSTM model is similar to the
based GRU architecture and the two models developed therein CW-BiGRU model described above (see Section IV-E) except
(CW − BiGRU − CRF , CW − BiGRU ) (Section IV-E) to that the BiGRU is replaced with a BiLSTM.
six state of the art models that have been used in prior work. 6) CW-BiLSTM-CRF: The CW-BiLSTM-CRF model is
Below, we specify the component details of each of the six developed by adding a CRF layer at the end of the CW-
prior models that have been evaluated. BiLSTM model pipeline indicating that the output of the CW-
1) MLP: Multi layer perceptrons were used with a window BiLSTM model would be fed to a CRF layer to generate the
based architecture to create a three layered (input, hidden, final predictions.
output) M LP model. The input and the hidden layer consisted
of 512 perceptrons with a Rectified Linear Unit (ReLU)
activation function while the output layer consisted of per- G. Parameter Tuning
ceptrons equal to the number of unique annotations in the The GO annotation data was split into training and test
corpus (N T ags). 20% Dropout was used for the hidden and sets using a 70:30 ratio. The training set was used to tune
output layers to prevent overfitting of the data. Categorical the following parameters for all models. Multiple architec-
cross-entropy was used for calculating the loss function and ture parameters such as - 1) Number of layers in MLP
NAdam (Adam RMSprop with Nesterov momentum) was used (along with number of perceptrons), 2) Number of units in
as the optimizer function. Each of the feature vectors (from RNN/GRU/LSTM, 3) Embedding Dimensions for Characters
the training data), were fed into the MLP architecture for 15 and Words, and 4) Optimization functions, were evaluated for
epochs with a batch size of 256. model performance. A grid-search model was explored, where
2) BiRNN-CRF: The BiRNN-CRF model uses a word- each architecture was evaluated for different combinations of
based input coupled with a BiRNN model and ending with a the parameter. In each case, model performance metrics were
CRF model. Similar to the BiRNN architecture (Figure 1), the recorded in form of Precision, Recall, F1-score, and Jaccard
BiRNN-CRF model consists of a 100 dimension Embedding similarity.
ICBO 2018 August 7-10, 2018 4
Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 5
Fig. 2. Character+word based architecture using two bidirectional GRU models.
TABLE I. C HARACTERISTICS OF THE CRAFT CORPUS - N UMBER OF
H. Experiments to predict ontology annotations SENTENCES WITH AT LEAST ONE ANNOTATION , NUMBER OF UNIQUE
The largest number of annotations in the CRAFT corpus ANNOTATIONS ( UNIGRAMS ONLY ), AND NUMBER OF UNIQUE WORDS IN
THE CORPUS .
came from the Gene Ontology. So, we first used the GO
annotations to train and test the suite of 8 models described Dataset
Number of Number of Number of
Sentences Unique Annotations Unique Words in the Corpus
above. Subsequently, we applied the best model from these GO 17,921 359 9,571
experiments to annotate the CRAFT corpus with the other four Sequence 15,606 156 7,262
ontologies (Chebi, Cell, Protein, and Sequence corpora). Protein 12,621 546 5,153
Chebi 11,109 309 3,127
Root-Mean-Square propagation (RMSProp) optimizer was Cell 9,088 68 3,042
used to test the performance of the different models. A batch
size of 32 along with 15 epochs was used for model training.
Performance characteristics in terms of train-test loss (calcu- training and validation loss indicating that the model is able
lated using the CRF function), prediction precision, recall, F1- to self-improve with each subsequent epoch.
score along with mean semantic similarity score was recorded The CW-BiGRU-CRF model initially shows the same ac-
for each model. curacy improvement like the CW-BiGRU model but later
increases in epochs result in a divergence in the training and
V. R ESULTS AND D ISCUSSION validation accuracy indicating that the model might be prone
to overfitting. While there is a substantial decrease in training
The CRAFT corpus contains 67 full length papers with loss, a similar decrease is not observed in validation loss.
annotations from five ontologies (GO, CHEBI, Cell, Protein, CW-BiLSTM shows similar trends to CW-BiGRU. CW-
and Sequence). For each of these ontologies, we extracted all BiLSTM-CRF training and validation accuracy increase simi-
sentences across the 67 papers with at least one annotation for larly until a certain point after which the validation accuracy
the ontology. The largest number of annotations came from drops and diverges sharply from the training curve indicating
the GO (Table I) while the Cell ontology accounted for the a case of overfitting.
lowest number of annotations. BiGRU-CRF and BiRNN-CRF models show substantial
Figure 3 shows the loss and accuracy trends for each model improvement in accuracy with increasing epochs. However,
on the GO annotation data. The goal of the models is to BiRNN-CRF shows divergence in the loss patterns. Similar to
minimize loss while increasing accuracy as the number of CW-BiLSTM-CRF, BiLSTM-CRF also shows signs of over-
epochs increase. fitting in the accuracy patterns. MLP is the worst performing
First, we see that our CW-BiGRU model shows improve- model with very minor improvements in validation accuracy
ment in both training and validation accuracy as the number as the number of epochs increase indicating that the model is
of epochs increase. Correspondingly, we observe a decrease in unable to improve itself with each subsequent epoch.
ICBO 2018 August 7-10, 2018 5
Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 6
It is clear that the CW-BiGRU models are able to outperform negatives might be lack of enough training instances for those
the other models by improving accuracy and reducing loss with particular GO annotations.
each epoch without overfitting. Finally, we applied the best performing model from the
A large proportion of input data is not annotated to GO above evaluation (CW-BiGRU) and tested it on data from
terms but to a tag 0 O0 indicating the absence of an annotation. four other ontologies. Interestingly, the model shows better
In addition to accurately predicting GO annotations, the model prediction performance on the other ontologies as compared
also needs to accurately predict the absence of an annotation. to GO despite the substantially smaller training datasets (Table
However, given the disproportionate amount of data pertaining III).
to the absence of annotations, the models were observed to
predict the absence of annotations remarkably accurately in TABLE III. P RECISION , R ECALL , F1, AND JACCARD S IMILARITY
SCORES FOR THE EIGHT MODELS ON ANNOTATIONS FROM FIVE
comparison to predicting presence. ONTOLOGIES IN CRAFT.
To provide a more conservative view of the models’ per-
Jaccard
formance, we report Precision, Recall, F-1 Score, and Jaccard Model Ontology Precision Recall F1
Similarity
similarity (Table II) only on data indicating presence of ontol- CW-BiGRU Cell 0.92 0.92 0.92 0.925
ogy terms, i.e. text annotated with an ontology term. Unlike CW-BiGRU Protein 0.91 0.90 0.90 0.917
CW-BiGRU CHEBI 0.86 0.87 0.86 0.882
the accuracy measurements above, the metrics below do not CW-BiGRU GO 0.84 0.84 0.83 0.843
take into account the models’ performance at identifying the CW-BiGRU Sequence 0.83 0.86 0.84 0.864
absence of annotations, but rather focus on ability to identify
annotations when they’re present in the Gold Standard.
TABLE II. P RECISION , R ECALL , F1, AND JACCARD S IMILARITY
VI. C ONCLUSIONS AND F UTURE W ORK
SCORES FOR THE EIGHT MODELS ON CRAFT G ENE O NTOLOGY The data used in this study was limited to single words
ANNOTATION DATA .
annotated to ontology concepts (unigrams). Next, we will
Model Precision Recall F1
Jaccard explore more robust models including n-grams to account for
Similarity
CW-BiGRU 0.84 0.84 0.83 0.84 sequences of words tagged with an annotation. Future work
CW-BiLSTM 0.80 0.82 0.80 0.83 will also include models that can be trained to weight the
CW-BiLSTM-CRF 0.80 0.82 0.80 0.82 prediction of some target classes higher than others. These
CW-BiGRU-CRF 0.77 0.80 0.78 0.82
BiGRU-CRF 0.75 0.77 0.75 0.78 models would be able to prioritize presence prediction of
BiRNN-CRF 0.72 0.74 0.72 0.75 annotations as compared to the absence of an annotation.
BiLSTM-CRF 0.70 0.70 0.70 0.71 This study demonstrates the utility of deep learning ap-
MLP 0.65 0.60 0.61 0.61
proaches for automated ontology-based curation of scientific
literature. Specifically, we show that models based on Gated
These results (Table II and Figure 3) show that our model Recurrent Units are more powerful and accurate at annotation
(CW-BiGRU) outperforms the other 7 models in all four prediction as compared to the LSTM based models in prior
metrics. Our model outperforms the best among the other 7 work. Our findings indicate that deep learning is a promising
models (CW-BiLSTM) by 4% (Precision), 2% (Recall), 3% new direction for ontology-based text mining, and can be used
(F1 score), 1% (Jaccard similarity). for more sophisticated annotation tasks (such as phenotype
Additionally, we observe that character-word based models curation) that build upon Named Entity Recognition.
(CW-BiGRU, CW-BiLSTM, CW-BiLSTM-CRF, CW-BiGRU-
CRF,) outperform models that use only word embeddings.
Among the character-word based models, surprisingly, the R EFERENCES
addition of an extra CRF layer (CW-BiLSTM-CRF, CW- [1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein,
BiGRU-CRF) either fails to improve performance (e.g CW- H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S.
BiLSTM vs. CW-BiLSTM-CRF) or leads to a decline in per- Dwight, J. T. Eppig et al., “Gene ontology: tool for the
formance (e.g CW-BiGRU vs. CW-BiGRU-CRF) as compared unification of biology,” Nature genetics, vol. 25, no. 1,
to not using a CRF end layer (CW-BiLSTM, CW-BiGRU). p. 25, 2000.
The MLP model shows substantially lower performance as [2] W. Dahdul, T. A. Dececchi, N. Ibrahim, H. Lapp, and
compared to the other models across all four metrics. The P. Mabee, “Moving the mountain: analysis of the effort
Accuracy and Loss plots (Figure 3) suggest that the decline required to transform comparative anatomy into com-
in performance when adding a CRF layer is due to potential putable anatomy,” Database, vol. 2015, 2015.
overfitting. [3] J. Clement, S. Nigam, Y. Cherie, M. Musen, C. Callendar,
We explored how predictions from our best model, CW- and M. Storey, “Ncbo annotator: semantic annotation of
BiGRU, diverge from the Gold Standard. We found that the biomedical data,” in International Semantic Web Confer-
majority of predictions (89.25%) are an exact match for the ence, Poster and Demo session, 2009.
CRAFT annotations. Surprisingly, only a small proportion of [4] C. J. Mungall, J. A. McMurry, S. Köhler, J. P. Balhoff,
predictions are partial matches (2.45%). 8.26% of the model’s C. Borromeo, M. Brush, S. Carbon, T. Conlin, N. Dunn,
predictions are false negatives while 6.38% are false positives. M. Engelstad et al., “The monarch initiative: an inte-
We hypothesize that one of the primary reasons for false grative data and analytic platform connecting phenotypes
ICBO 2018 August 7-10, 2018 6
Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 7
1.00
CW-BiGRU 1.00
CW-BiGRU-CRF 1.00
CW-BiLSTM 1.00
CW-BiLSTM-CRF
0.99 0.99 0.99 0.99
0.98 0.98 0.98 0.98
0.97 0.97 0.97 0.97
Accuracy
0.96 0.96 0.96 0.96
0.95 0.95 0.95 0.95
0.94 0.94 0.94 0.94
0.93 0.93 0.93 0.93
0.92 0.92 0.92 0.92
0.91 0.91 0.91 0.91
0.90 0.90 0.90 0.90
0.80 17.00
0.70 16.90
0.70 16.90
0.60 16.80
0.60 16.80
0.50 16.70
0.50
16.70
Loss
0.40 16.60 0.40
16.60
0.30 16.50 0.30
0.20 16.50
16.40 0.20
0.10 0.10 16.40
16.30
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1.00
BiGRU-CRF 1.00
BIRNN-CRF 1.00
BiLSTM-CRF 1.00
MLP
0.99 0.99 0.99 0.99
0.98 0.98 0.98 0.98
0.97 0.97 0.97 0.97
Accuracy
0.96 0.96 0.96 0.96
0.95 0.95 0.95 0.95
0.94 0.94 0.94 0.94
0.93 0.93 0.93 0.93
0.92 0.92 0.92 0.92
0.91 0.91 0.91 0.91
0.90 0.90 0.90 0.90
17.80 Training
17.40 17.40
17.60 17.30
0.30 Validation
17.30
17.40 17.20 0.25
17.20
17.10
Loss
17.20 0.20
17.10
17.00
17.00
17.00 0.15
16.90
16.80
16.90 16.80 0.10
16.60 16.70
16.80
0.05
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Number of Epochs
Fig. 3. Comparison of model loss and accuracy on training and validation data using Gene Ontology annotations
to genotypes across species,” Nucleic acids research, matics, vol. 33, no. 14, pp. i37–i48, 2017.
vol. 45, no. D1, pp. D712–D722, 2016. [10] C. Lyu, B. Chen, Y. Ren, and D. Ji, “Long short-term
[5] I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar, memory rnn for biomedical named entity recognition,”
“Text mining and ontologies in biomedicine: making BMC bioinformatics, vol. 18, no. 1, p. 462, 2017.
sense of raw text,” Briefings in bioinformatics, vol. 6, [11] X. Wang, Y. Zhang, X. Ren, Y. Zhang, M. Zitnik,
no. 3, pp. 239–251, 2005. J. Shang, C. Langlotz, and J. Han, “Cross-type biomedi-
[6] H. Cui, W. Dahdul, A. T. Dececchi, N. Ibrahim, P. Mabee, cal named entity recognition with deep multi-task learn-
J. P. Balhoff, and H. Gopalakrishnan, “Charaparser+ ing,” arXiv preprint arXiv:1801.09851, 2018.
eq: Performance evaluation without gold standard,” Pro- [12] M. Bada, M. Eckert, D. Evans, K. Garcia, K. Shipley,
ceedings of the Association for Information Science and D. Sitnikov, W. A. Baumgartner, K. B. Cohen, K. Ver-
Technology, vol. 52, no. 1, pp. 1–10, 2015. spoor, J. A. Blake et al., “Concept annotation in the craft
[7] G. Lample, M. Ballesteros, S. Subramanian, corpus,” BMC bioinformatics, vol. 13, no. 1, p. 161, 2012.
K. Kawakami, and C. Dyer, “Neural architectures [13] C. Pesquita, D. Faria, A. O. Falcao, P. Lord, and F. M.
for named entity recognition,” arXiv preprint Couto, “Semantic similarity in biomedical ontologies,”
arXiv:1603.01360, 2016. PLoS computational biology, vol. 5, no. 7, p. e1000443,
[8] J. Lafferty, “Conditional random fields: Probabilistic 2009.
models for segmenting and labelling sequence data,” in [14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,
ICML, 2001, 2001. “Learning internal representations by error propagation,”
[9] M. Habibi, L. Weber, M. Neves, D. L. Wiegandt, and California Univ San Diego La Jolla Inst for Cognitive
U. Leser, “Deep learning with word embeddings im- Science, Tech. Rep., 1985.
proves biomedical named entity recognition,” Bioinfor- [15] L. Faucett, “Fundamentals of neural networks,” Architec-
ICBO 2018 August 7-10, 2018 7
Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 8
ture, Algorithms, 1994.
[16] S. Hochreiter and J. Schmidhuber, “Long short-term
memory,” Neural computation, vol. 9, no. 8, pp. 1735–
1780, 1997.
[17] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bah-
danau, F. Bougares, H. Schwenk, and Y. Bengio, “Learn-
ing phrase representations using rnn encoder-decoder
for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
ICBO 2018 August 7-10, 2018 8