Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 1 Taking a Dive: Experiments in Deep Learning for Automatic Ontology-based Annotation of Scientific Literature Prashanti Manda1,* , Lucas Beasley1 and Somya D. Mohanty1,* 1 Department of Computer Science, University of North Carolina at Greensboro, NC, USA * Equal contributions I. A BSTRACT (NER). In the context of ontology-based annotation, NER can Text mining approaches for automated ontology-based cu- be described as recognizing ontology concepts from text [5]. ration of biological and biomedical literature have largely Outside the scope of ontology-based annotation, NER has been focused on syntactic and lexical analysis along with machine applied to biomedical and biological literature for recognizing learning. Recent advances in deep learning have shown in- genes, proteins, diseases, etc [5]. creased accuracy for textual data annotation. However, the The large majority of ontology driven NER techniques rely application of deep learning for ontology-based curation is a on lexical and syntactic analysis of text in addition to machine relatively new area and prior work has focused on a limited learning for recognizing and tagging ontology concepts [3, 4, set of models. 6]. In recent years, deep learning has been introduced for NER Here, we introduce a new deep learning model/architecture of biological entities from literature [7, 8, 9, 10, 11]. However, based on combining multiple Gated Recurrent Units (GRU) the majority of prior work has focused on a limited set of with a character+word based input. We use data from five models, particularly, the Long Short-Term Memory (LSTM) ontologies in the CRAFT corpus as a Gold Standard to model (e.g. [7]). evaluate our model’s performance. We also compare our model Here, we present a new deep learning architecture that to seven models from prior work. We use four metrics - utilizes Gated Recurrent Units (GRU) while taking advantage Precision, Recall, F1 score, and a semantic similarity metric of word and character encodings from the annotation training (Jaccard similarity) to compare our model’s output to the Gold data to recognize ontology concepts from text. We evaluate Standard. Our model resulted in a 84% Precision, 84% Recall, our model in comparison to 7 deep learning models used in 83% F1, and a 84% Jaccard similarity. Results show that our prior work to show that our model outperforms the state of art GRU-based model outperforms prior models across all five at the task of ontology-based NER. ontologies. We also observed that character+word inputs result We use the Colorado Richly Annotated Full-Text (CRAFT) in a higher performance across models as compared to word corpus [12] as a Gold Standard reference to develop and eval- only inputs. uate the deep learning models. The CRAFT corpus contains These findings indicate that deep learning algorithms are a 67 open access, full length biomedical articles annotated with promising avenue to be explored for automated ontology-based concepts from several ontologies (such as Gene Ontology, curation of data. This study also serves as a formal comparison Protein Ontology, Sequence Ontology, etc.). We use four and guideline for building and selecting deep learning models metrics - 1) Precision, 2) Recall, 3) F-1 Score and 4) Jaccard and architectures for ontology-based curation. semantic similarity to compare each model’s performance to the Gold Standard. II. I NTRODUCTION Precision and Recall are traditionally used to assess the performance of information retrieval systems. However, these Ontology-based data representation has been widely adopted metrics do not take into account the notion of partial informa- in data intensive fields such as biology and biomedicine due tion retrieval which is important for ontology-based annotation to the need for large scale computationally amenable data [1]. retrieval. Sometimes, an NLP system might not retrieve the However, the majority of ontology-based data generation relies same ontology concept as the gold standard but a related on manual literature curation - a slow and tedious process concept (sub-class or super-class). To assess the performance [2]. Natural language and text mining methods have been of the NLP system accurately, we need semantic similar- developed as the solution for scalable ontology-based data ity metrics that can measure different degrees of semantic curation [3, 4]. relatedness between ontology concepts [13]. Here, we use One of the most important tasks for annotating scientific Jaccard similarity to compare annotations from each deep literature with ontology concepts is Named Entity Recognition learning model to the gold standard. Jaccard similarity assesses p manda@uncg.edu similarity between two ontology terms based on the ontological sdmohant@uncg.edu distance between them - the closer two terms are, the more ICBO 2018 August 7-10, 2018 1 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 2 similar they are considered to be [13]. Annotations to single words (unigrams) were only included in these preprocessed files. So, if an annotation was made in III. R ELATED W ORK CRAFT to a phrase containing more than one word, it was The application of deep learning for ontology-based Named ignored in the preprocessed data. Entity Recognition is a nascent area with relatively little prior work. Habibi et al. [9] studied entity recognition on biomedical B. Performance evaluation metrics literature using long short-term memory network-conditional Precision, Recall, F1-score, and Jaccard similarity were used random field (LSTM-CRF) and showed that the method out- to evaluate the performance of the models. The Jaccard simi- performed other NER tools that do not use deep learning or larity (J) of two ontology concepts (in this case, annotations) use deep learning methods without word embeddings. Lyu et (A, B) in an ontology is defined as the ratio of the number of al. [10] also explored LSTM based models enhanced with classes in the intersection of their subsumers over the number word and character embeddings. They do not evaluate other of classes in their union of their subsumers [13]. deep learning models but present results only based on LSTM with word embeddings. Wang et al. [11] also propose a |S(A) ∩ S(B)| J(A, B) = LSTM based method for recognizing biomedical entities from |S(A) ∪ S(B)| literature. Similar to the above studies, Wang et al. show that a bidirectional LSTM method used with Conditional Random where S(A) is the set of classes that subsume A. Jaccard Field (CRF) and word embeddings outperforms other methods. similarity ranges from 0 (no similarity) to 1 (exact match). The striking difference between these prior studies and our work here is that the majority of prior literature focuses on C. Deep learning models LSTM based methods along with CRF and word embeddings. Below, we describe four deep learning models - Multi- The potential of other deep learning models such as Recurrent layer perceptrons, Recurrent Neural Networks, Long Short- Neural Networks, Gated Recurrent Units, etc., at the task of Term Memory, and Gated Recurrent Units. Next, we describe ontology-based NER remains unexplored presenting a unique three architectures - window based, word based, and character- need and opportunity. Our study aims to fill this knowledge word based that can be used in conjunction with the above gap. In addition, all the above studies focus on non-ontology models. Finally, we describe our new model that combines based NER for entities such as genes, disease names, etc. In character-word based architecture with Gated Recurrent Units contrast, our study’s focus is on recognizing ontology concepts and six models used in prior work. within text. 1) Multi-Layer Perceptron (MLP): A Multi-Layer Percep- tron (MLP) [14] is a feed-forward deep-neural network model IV. M ETHODS which consists of an input, single/multiple hidden, and an out- A. Data Preprocessing put layer, each consisting of a number of perceptrons. Pn A single Annotation files for the 67 papers in CRAFT were cleaned perceptron computes the output as γ = ϕ( i=1 wi xi + b), to remove punctuation symbols (except for period at the end where w is the weight vector, x is the provided input, b is of sentences), special symbols, and non-ASCII characters. the bias, and ϕ is the activation function. The weights and Annotations for GO, CHEBI, Cell, Protein, and Sequence biases of each perceptron in the layers are adjusted using back- ontologies were converted from the cleaned files to separate propagation to minimize prediction error ontology-specific text files that represent the presence or ab- 2) Recurrent Neural Network: A Recurrent Neural Net- sence of ontology terms. For each ontology, every sentence work (RNN) [15] is an adaption of feed-forward neural containing at least one annotation from that ontology was networks, where history of the input sequence is taken into represented using two lines in the ontology-specific text file. consideration for future prediction. Given an input sequence The first of these two lines contained an array with each word < x0 , x1 , x2 , · · · xi >, the hidden state (ht ) of an RNN is in the sentence. The second contained an ordered encoding updated as follows: corresponding to words in the first line. These encodings could  0, t=0 be an ontology concept ID if the corresponding word was ht = σ(W hh ht−1 + W hx xt ), t > 0 (1) annotated in CRAFT or an 0 O0 if the corresponding word was not annotated. yt = sof tmax(W s ht ) For example, the sentence “Rod and cone photoreceptors where, xt is the input provided to the hidden state ht at subserve vision under dim and bright light conditions re- time t which is updated using a sigmoid function σ. σ is spectively” where the word “vision” was annotated to GO calculated over the previous time state of the network given ID “GO:0007601 (perception of sight)” would be represented by ht−1 and current input xt . W hh , W hx , and W s are the using the two lines below: weights computed over training. The network can then produce • [’Rod’, ’and’, ’cone’, ’photoreceptors’, ’subserve’, ’vi- an output prediction < y0 , y1 , y2 , · · · yj > using a sof tmax sion’, ’under’, ’dim’, ’and’, ’bright’, ’light’, ’condi- function on the hidden state ht . tions’, ’respectively’] A bidirectional Recurrent Neural Network (BiRNN) is an • [’O’, ’O’, ’O’, ’O’, ’O’, ’GO:0007601’, ’O’, ’O’, ’O’, RNN where the input data is fed to the neural network two ’O’, ’O’, ’O’, ’O’] times - once in forward and again in reverse order. ICBO 2018 August 7-10, 2018 2 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 3 3) Long-Short Term Memory: While RNNs are effective 1) Window-based: In this architecture, the window-based in learning temporal patterns, they suffer from a vanishing input (iv ) consists of feature vectors (fv ) for each word/term (t) gradient problem where long term dependencies are lost. A within an encoded sentence. Each fv consisted of the following solution to the problem was proposed by Hochreiter et al. [16] attributes: by using a variation of RNNs called Long-Short Term Memory (LSTM). LSTMs use a memory cell (ct ), to keep track of long- fv =< t, nt , t1 , t−1 , tC , taC , tal , tp0 , tp0−1 , tp0−2 , term relationships between text. Using a gated architecture ts0 , ts0−1 , ts0−2 , tN , tP > (4) (input, output, and forget), LSTMs are able to modulate the exposure of a memory cell by regulating the gates. LSTMs where, can be defined as: t is the term, nt is the number of terms in the sentence, it = σ(W ix xt + W ih ht−1 ) t1 is a boolean value indicating if the term is the first term ft = σ(W f x xt + W f h ht−1 ) in the sentence, ot = σ(W ox xt + W oh ht−1 ) t−1 is 1 if term is the last term in the sentence 0 otherwise, (2) gt = tanh(W gx xt + W gh ht−1 ) tC is 1 if first letter in t is uppercase, ct = ct−1 ft + gt it taC is 1 if all letter in t are uppercase, ht = tanh(ct ) ot tal is 1 if all in t are lower case, tp0 , tp0−1 , tp0−2 record character prefixes of t at various win- where, it , ft , and ot are the input, forget, and output gates dow size, respectively. Each gate uses a sigmoid (σ) function applied ts0 , ts0−1 , ts0−2 record character suffixes of t at various win- over the sum of input xt and previous hidden state ht−1 dow sizes, (multiplied with their weight matrices W ). gt denotes the tN and tP are the next and previous terms respectively. candidate state computed over a tanh function on the input 2) Word-based: Each word and its corresponding annota- and previous hidden state. W ix , W f x , W ox , W gx are weight tion labels (tags) are encoded with integer values, derived matrices used with input xt , while W ih , W f h , W oh , and W gh from unique words and annotations present in the corpus. are used with hidden states for each gate and candidate state. The dataset was based on unigram annotations that only use The memory cell ct utilizes the forget gate (ft ) and multiplies ontology annotations where a single word in text maps to an ( - element-wise) it old memory cell ct−1 and adds to the ontology concept. state of candidate (gt ) multiplied with the input gate (it ). The In word-based architectures (Figure 1), the input (Xtr W ) is hidden state is given by a tanh function applied to the memory provided to an Embedding layer which converts the input into cell ct multiplied with output gate (ot ). dense vectors of 100 dimensions. The output vectors are then 4) Gated Recurrent Unit: A variation on LSTM, was in- fed to a bidirectional model (RNN/GRU/LSTM) consisting of troduced by Cho et al. [17] as Gated Recurrent Unit (GRU). 150 hidden units. The output from the model goes to a dense Using update and reset gates, GRUs are able to control amount perceptron layer using ReLU activation which also employs a of information within a unit (without a separate memory cell 0.6 Dropout. The output is further fed into a CRF layer which as with LSTM). GRUs can formally be defined as looks for correlations between annotations in close sequences to generate the predictions (ypr ). zt = σ(W zx xt + W zh ht−1 ) 3) Character+Word Based: A Character+Word based archi- rt = σ(W rx xt + W rh ht−1 ) tecture is similar to the word based architecture described (3) W h̃t = tanh(W x xt + rt Wh ht−1 ) above. In addition to word-based inputs (Xtr ) is also takes C ht = zt ht−1 + (1 − zt ) h̃t advantage of characters (Xtr ) within words to make predic- tions. where, zt and rt are update and reset gates respectively, h̃t is the candidate activation/hidden state. E. Model development Similar to the LSTM architecture, GRUs benefit from the We developed a new deep learning architecture that uses a additive properties in their network to remember long term Character+word based architecture coupled with two bidirec- dependencies, and solve the vanishing gradient problem. Since tional Gated Recurrent Units. Our architecture (Figure 2) con- GRUs do not utilize an an output gate, they are able to write C sists of character level input (Xtr ) provided to an Embedding the entire contents of their memory cell to the network. The layer (E1 ) which compresses the dimensions of characters to lack of a memory cell also makes GRUs more efficient in the number of unique annotations in the corpus (N T ags). comparison to LSTMs. The output of Embedding layer E1 is fed to a bidirectional GRU (BiGRU1 ) layer with 150 units followed by a 60% output drop in a Dropout layer (D1 ). Simultaneously, the D. Deep learning Architectures W word-level input (Xtr ) was provided to a second Embedding Below, we describe three architectures - window-based, layer (E2 ) with 30 dimensions. The output from E2 was word-based, and word+character based to be used in conjunc- concatenated with the output from the first Dropout layer D1 tion with the different models described above. and fed through a second Dropout layer (D2 ) with a 30% drop. ICBO 2018 August 7-10, 2018 3 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 4 Fig. 1. Word-based architecture using bidirectional RNN/GRU/LSTM models Output from D2 was fed into a second bidirectional GRU layer layer followed by a BiRNN with 150 units followed by a 0.6 (BiGRU2 ) consisting of 150 units. Dropout layer. The output of the 0.6 Dropout layer is fed to a The above model was tested with and without a final CRF CRF which generated the predicted output. layer leading to two new configurations - CW − BiGRU − 3) BiLSTM-CRF: The BiLSTM-CRF model is identical to CRF and CW −BiGRU . The models were run for 15 epochs the BiRNN-CRF except that it uses a LSTM in place of the with a batchsize of 32 instances. RNN. 4) BiGRU-CRF: The BiGRU-CRF model is identical to F. Model Comparison BiRNN-CRF and BiLSTM-CRF except that it uses a Gated Recurrent Unit in place of the RNN or LSTM. We compared the performance of our new Character+word 5) CW-BiLSTM: The CW-BiLSTM model is similar to the based GRU architecture and the two models developed therein CW-BiGRU model described above (see Section IV-E) except (CW − BiGRU − CRF , CW − BiGRU ) (Section IV-E) to that the BiGRU is replaced with a BiLSTM. six state of the art models that have been used in prior work. 6) CW-BiLSTM-CRF: The CW-BiLSTM-CRF model is Below, we specify the component details of each of the six developed by adding a CRF layer at the end of the CW- prior models that have been evaluated. BiLSTM model pipeline indicating that the output of the CW- 1) MLP: Multi layer perceptrons were used with a window BiLSTM model would be fed to a CRF layer to generate the based architecture to create a three layered (input, hidden, final predictions. output) M LP model. The input and the hidden layer consisted of 512 perceptrons with a Rectified Linear Unit (ReLU) activation function while the output layer consisted of per- G. Parameter Tuning ceptrons equal to the number of unique annotations in the The GO annotation data was split into training and test corpus (N T ags). 20% Dropout was used for the hidden and sets using a 70:30 ratio. The training set was used to tune output layers to prevent overfitting of the data. Categorical the following parameters for all models. Multiple architec- cross-entropy was used for calculating the loss function and ture parameters such as - 1) Number of layers in MLP NAdam (Adam RMSprop with Nesterov momentum) was used (along with number of perceptrons), 2) Number of units in as the optimizer function. Each of the feature vectors (from RNN/GRU/LSTM, 3) Embedding Dimensions for Characters the training data), were fed into the MLP architecture for 15 and Words, and 4) Optimization functions, were evaluated for epochs with a batch size of 256. model performance. A grid-search model was explored, where 2) BiRNN-CRF: The BiRNN-CRF model uses a word- each architecture was evaluated for different combinations of based input coupled with a BiRNN model and ending with a the parameter. In each case, model performance metrics were CRF model. Similar to the BiRNN architecture (Figure 1), the recorded in form of Precision, Recall, F1-score, and Jaccard BiRNN-CRF model consists of a 100 dimension Embedding similarity. ICBO 2018 August 7-10, 2018 4 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 5 Fig. 2. Character+word based architecture using two bidirectional GRU models. TABLE I. C HARACTERISTICS OF THE CRAFT CORPUS - N UMBER OF H. Experiments to predict ontology annotations SENTENCES WITH AT LEAST ONE ANNOTATION , NUMBER OF UNIQUE The largest number of annotations in the CRAFT corpus ANNOTATIONS ( UNIGRAMS ONLY ), AND NUMBER OF UNIQUE WORDS IN THE CORPUS . came from the Gene Ontology. So, we first used the GO annotations to train and test the suite of 8 models described Dataset Number of Number of Number of Sentences Unique Annotations Unique Words in the Corpus above. Subsequently, we applied the best model from these GO 17,921 359 9,571 experiments to annotate the CRAFT corpus with the other four Sequence 15,606 156 7,262 ontologies (Chebi, Cell, Protein, and Sequence corpora). Protein 12,621 546 5,153 Chebi 11,109 309 3,127 Root-Mean-Square propagation (RMSProp) optimizer was Cell 9,088 68 3,042 used to test the performance of the different models. A batch size of 32 along with 15 epochs was used for model training. Performance characteristics in terms of train-test loss (calcu- training and validation loss indicating that the model is able lated using the CRF function), prediction precision, recall, F1- to self-improve with each subsequent epoch. score along with mean semantic similarity score was recorded The CW-BiGRU-CRF model initially shows the same ac- for each model. curacy improvement like the CW-BiGRU model but later increases in epochs result in a divergence in the training and V. R ESULTS AND D ISCUSSION validation accuracy indicating that the model might be prone to overfitting. While there is a substantial decrease in training The CRAFT corpus contains 67 full length papers with loss, a similar decrease is not observed in validation loss. annotations from five ontologies (GO, CHEBI, Cell, Protein, CW-BiLSTM shows similar trends to CW-BiGRU. CW- and Sequence). For each of these ontologies, we extracted all BiLSTM-CRF training and validation accuracy increase simi- sentences across the 67 papers with at least one annotation for larly until a certain point after which the validation accuracy the ontology. The largest number of annotations came from drops and diverges sharply from the training curve indicating the GO (Table I) while the Cell ontology accounted for the a case of overfitting. lowest number of annotations. BiGRU-CRF and BiRNN-CRF models show substantial Figure 3 shows the loss and accuracy trends for each model improvement in accuracy with increasing epochs. However, on the GO annotation data. The goal of the models is to BiRNN-CRF shows divergence in the loss patterns. Similar to minimize loss while increasing accuracy as the number of CW-BiLSTM-CRF, BiLSTM-CRF also shows signs of over- epochs increase. fitting in the accuracy patterns. MLP is the worst performing First, we see that our CW-BiGRU model shows improve- model with very minor improvements in validation accuracy ment in both training and validation accuracy as the number as the number of epochs increase indicating that the model is of epochs increase. Correspondingly, we observe a decrease in unable to improve itself with each subsequent epoch. ICBO 2018 August 7-10, 2018 5 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 6 It is clear that the CW-BiGRU models are able to outperform negatives might be lack of enough training instances for those the other models by improving accuracy and reducing loss with particular GO annotations. each epoch without overfitting. Finally, we applied the best performing model from the A large proportion of input data is not annotated to GO above evaluation (CW-BiGRU) and tested it on data from terms but to a tag 0 O0 indicating the absence of an annotation. four other ontologies. Interestingly, the model shows better In addition to accurately predicting GO annotations, the model prediction performance on the other ontologies as compared also needs to accurately predict the absence of an annotation. to GO despite the substantially smaller training datasets (Table However, given the disproportionate amount of data pertaining III). to the absence of annotations, the models were observed to predict the absence of annotations remarkably accurately in TABLE III. P RECISION , R ECALL , F1, AND JACCARD S IMILARITY SCORES FOR THE EIGHT MODELS ON ANNOTATIONS FROM FIVE comparison to predicting presence. ONTOLOGIES IN CRAFT. To provide a more conservative view of the models’ per- Jaccard formance, we report Precision, Recall, F-1 Score, and Jaccard Model Ontology Precision Recall F1 Similarity similarity (Table II) only on data indicating presence of ontol- CW-BiGRU Cell 0.92 0.92 0.92 0.925 ogy terms, i.e. text annotated with an ontology term. Unlike CW-BiGRU Protein 0.91 0.90 0.90 0.917 CW-BiGRU CHEBI 0.86 0.87 0.86 0.882 the accuracy measurements above, the metrics below do not CW-BiGRU GO 0.84 0.84 0.83 0.843 take into account the models’ performance at identifying the CW-BiGRU Sequence 0.83 0.86 0.84 0.864 absence of annotations, but rather focus on ability to identify annotations when they’re present in the Gold Standard. TABLE II. P RECISION , R ECALL , F1, AND JACCARD S IMILARITY VI. C ONCLUSIONS AND F UTURE W ORK SCORES FOR THE EIGHT MODELS ON CRAFT G ENE O NTOLOGY The data used in this study was limited to single words ANNOTATION DATA . annotated to ontology concepts (unigrams). Next, we will Model Precision Recall F1 Jaccard explore more robust models including n-grams to account for Similarity CW-BiGRU 0.84 0.84 0.83 0.84 sequences of words tagged with an annotation. Future work CW-BiLSTM 0.80 0.82 0.80 0.83 will also include models that can be trained to weight the CW-BiLSTM-CRF 0.80 0.82 0.80 0.82 prediction of some target classes higher than others. These CW-BiGRU-CRF 0.77 0.80 0.78 0.82 BiGRU-CRF 0.75 0.77 0.75 0.78 models would be able to prioritize presence prediction of BiRNN-CRF 0.72 0.74 0.72 0.75 annotations as compared to the absence of an annotation. BiLSTM-CRF 0.70 0.70 0.70 0.71 This study demonstrates the utility of deep learning ap- MLP 0.65 0.60 0.61 0.61 proaches for automated ontology-based curation of scientific literature. Specifically, we show that models based on Gated These results (Table II and Figure 3) show that our model Recurrent Units are more powerful and accurate at annotation (CW-BiGRU) outperforms the other 7 models in all four prediction as compared to the LSTM based models in prior metrics. Our model outperforms the best among the other 7 work. Our findings indicate that deep learning is a promising models (CW-BiLSTM) by 4% (Precision), 2% (Recall), 3% new direction for ontology-based text mining, and can be used (F1 score), 1% (Jaccard similarity). for more sophisticated annotation tasks (such as phenotype Additionally, we observe that character-word based models curation) that build upon Named Entity Recognition. (CW-BiGRU, CW-BiLSTM, CW-BiLSTM-CRF, CW-BiGRU- CRF,) outperform models that use only word embeddings. Among the character-word based models, surprisingly, the R EFERENCES addition of an extra CRF layer (CW-BiLSTM-CRF, CW- [1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, BiGRU-CRF) either fails to improve performance (e.g CW- H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. BiLSTM vs. CW-BiLSTM-CRF) or leads to a decline in per- Dwight, J. T. Eppig et al., “Gene ontology: tool for the formance (e.g CW-BiGRU vs. CW-BiGRU-CRF) as compared unification of biology,” Nature genetics, vol. 25, no. 1, to not using a CRF end layer (CW-BiLSTM, CW-BiGRU). p. 25, 2000. The MLP model shows substantially lower performance as [2] W. Dahdul, T. A. Dececchi, N. Ibrahim, H. Lapp, and compared to the other models across all four metrics. The P. Mabee, “Moving the mountain: analysis of the effort Accuracy and Loss plots (Figure 3) suggest that the decline required to transform comparative anatomy into com- in performance when adding a CRF layer is due to potential putable anatomy,” Database, vol. 2015, 2015. overfitting. [3] J. Clement, S. Nigam, Y. Cherie, M. Musen, C. Callendar, We explored how predictions from our best model, CW- and M. Storey, “Ncbo annotator: semantic annotation of BiGRU, diverge from the Gold Standard. We found that the biomedical data,” in International Semantic Web Confer- majority of predictions (89.25%) are an exact match for the ence, Poster and Demo session, 2009. CRAFT annotations. Surprisingly, only a small proportion of [4] C. J. Mungall, J. A. McMurry, S. Köhler, J. P. Balhoff, predictions are partial matches (2.45%). 8.26% of the model’s C. Borromeo, M. Brush, S. Carbon, T. Conlin, N. Dunn, predictions are false negatives while 6.38% are false positives. M. Engelstad et al., “The monarch initiative: an inte- We hypothesize that one of the primary reasons for false grative data and analytic platform connecting phenotypes ICBO 2018 August 7-10, 2018 6 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 7 1.00 CW-BiGRU 1.00 CW-BiGRU-CRF 1.00 CW-BiLSTM 1.00 CW-BiLSTM-CRF 0.99 0.99 0.99 0.99 0.98 0.98 0.98 0.98 0.97 0.97 0.97 0.97 Accuracy 0.96 0.96 0.96 0.96 0.95 0.95 0.95 0.95 0.94 0.94 0.94 0.94 0.93 0.93 0.93 0.93 0.92 0.92 0.92 0.92 0.91 0.91 0.91 0.91 0.90 0.90 0.90 0.90 0.80 17.00 0.70 16.90 0.70 16.90 0.60 16.80 0.60 16.80 0.50 16.70 0.50 16.70 Loss 0.40 16.60 0.40 16.60 0.30 16.50 0.30 0.20 16.50 16.40 0.20 0.10 0.10 16.40 16.30 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 1.00 BiGRU-CRF 1.00 BIRNN-CRF 1.00 BiLSTM-CRF 1.00 MLP 0.99 0.99 0.99 0.99 0.98 0.98 0.98 0.98 0.97 0.97 0.97 0.97 Accuracy 0.96 0.96 0.96 0.96 0.95 0.95 0.95 0.95 0.94 0.94 0.94 0.94 0.93 0.93 0.93 0.93 0.92 0.92 0.92 0.92 0.91 0.91 0.91 0.91 0.90 0.90 0.90 0.90 17.80 Training 17.40 17.40 17.60 17.30 0.30 Validation 17.30 17.40 17.20 0.25 17.20 17.10 Loss 17.20 0.20 17.10 17.00 17.00 17.00 0.15 16.90 16.80 16.90 16.80 0.10 16.60 16.70 16.80 0.05 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Number of Epochs Fig. 3. Comparison of model loss and accuracy on training and validation data using Gene Ontology annotations to genotypes across species,” Nucleic acids research, matics, vol. 33, no. 14, pp. i37–i48, 2017. vol. 45, no. D1, pp. D712–D722, 2016. [10] C. Lyu, B. Chen, Y. Ren, and D. Ji, “Long short-term [5] I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar, memory rnn for biomedical named entity recognition,” “Text mining and ontologies in biomedicine: making BMC bioinformatics, vol. 18, no. 1, p. 462, 2017. sense of raw text,” Briefings in bioinformatics, vol. 6, [11] X. Wang, Y. Zhang, X. Ren, Y. Zhang, M. Zitnik, no. 3, pp. 239–251, 2005. J. Shang, C. Langlotz, and J. Han, “Cross-type biomedi- [6] H. Cui, W. Dahdul, A. T. Dececchi, N. Ibrahim, P. Mabee, cal named entity recognition with deep multi-task learn- J. P. Balhoff, and H. Gopalakrishnan, “Charaparser+ ing,” arXiv preprint arXiv:1801.09851, 2018. eq: Performance evaluation without gold standard,” Pro- [12] M. Bada, M. Eckert, D. Evans, K. Garcia, K. Shipley, ceedings of the Association for Information Science and D. Sitnikov, W. A. Baumgartner, K. B. Cohen, K. Ver- Technology, vol. 52, no. 1, pp. 1–10, 2015. spoor, J. A. Blake et al., “Concept annotation in the craft [7] G. Lample, M. Ballesteros, S. Subramanian, corpus,” BMC bioinformatics, vol. 13, no. 1, p. 161, 2012. K. Kawakami, and C. Dyer, “Neural architectures [13] C. Pesquita, D. Faria, A. O. Falcao, P. Lord, and F. M. for named entity recognition,” arXiv preprint Couto, “Semantic similarity in biomedical ontologies,” arXiv:1603.01360, 2016. PLoS computational biology, vol. 5, no. 7, p. e1000443, [8] J. Lafferty, “Conditional random fields: Probabilistic 2009. models for segmenting and labelling sequence data,” in [14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ICML, 2001, 2001. “Learning internal representations by error propagation,” [9] M. Habibi, L. Weber, M. Neves, D. L. Wiegandt, and California Univ San Diego La Jolla Inst for Cognitive U. Leser, “Deep learning with word embeddings im- Science, Tech. Rep., 1985. proves biomedical named entity recognition,” Bioinfor- [15] L. Faucett, “Fundamentals of neural networks,” Architec- ICBO 2018 August 7-10, 2018 7 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 8 ture, Algorithms, 1994. [16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735– 1780, 1997. [17] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bah- danau, F. Bougares, H. Schwenk, and Y. Bengio, “Learn- ing phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. ICBO 2018 August 7-10, 2018 8