<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Taking a Dive: Experiments in Deep Learning for Automatic Ontology-based Annotation of Scientific Literature</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Prashanti</forename><surname>Manda</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of North Carolina at Greensboro</orgName>
								<address>
									<region>NC</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Equal contributions I</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lucas</forename><surname>Beasley</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of North Carolina at Greensboro</orgName>
								<address>
									<region>NC</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Somya</forename><forename type="middle">D</forename><surname>Mohanty</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of North Carolina at Greensboro</orgName>
								<address>
									<region>NC</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Equal contributions I</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Taking a Dive: Experiments in Deep Learning for Automatic Ontology-based Annotation of Scientific Literature</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">BDD548F575628E718CF7CF89759133F2</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:05+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Text mining approaches for automated ontology-based curation of biological and biomedical literature have largely focused on syntactic and lexical analysis along with machine learning. Recent advances in deep learning have shown increased accuracy for textual data annotation. However, the application of deep learning for ontology-based curation is a relatively new area and prior work has focused on a limited set of models.</p><p>Here, we introduce a new deep learning model/architecture based on combining multiple Gated Recurrent Units (GRU) with a character+word based input. We use data from five ontologies in the CRAFT corpus as a Gold Standard to evaluate our model's performance. We also compare our model to seven models from prior work. We use four metrics -Precision, Recall, F1 score, and a semantic similarity metric (Jaccard similarity) to compare our model's output to the Gold Standard. Our model resulted in a 84% Precision, 84% Recall, 83% F1, and a 84% Jaccard similarity. Results show that our GRU-based model outperforms prior models across all five ontologies. We also observed that character+word inputs result in a higher performance across models as compared to word only inputs.</p><p>These findings indicate that deep learning algorithms are a promising avenue to be explored for automated ontology-based curation of data. This study also serves as a formal comparison and guideline for building and selecting deep learning models and architectures for ontology-based curation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. INTRODUCTION</head><p>Ontology-based data representation has been widely adopted in data intensive fields such as biology and biomedicine due to the need for large scale computationally amenable data <ref type="bibr" target="#b0">[1]</ref>. However, the majority of ontology-based data generation relies on manual literature curation -a slow and tedious process <ref type="bibr" target="#b1">[2]</ref>. Natural language and text mining methods have been developed as the solution for scalable ontology-based data curation <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>.</p><p>One of the most important tasks for annotating scientific literature with ontology concepts is Named Entity Recognition p manda@uncg.edu sdmohant@uncg.edu (NER). In the context of ontology-based annotation, NER can be described as recognizing ontology concepts from text <ref type="bibr" target="#b4">[5]</ref>. Outside the scope of ontology-based annotation, NER has been applied to biomedical and biological literature for recognizing genes, proteins, diseases, etc <ref type="bibr" target="#b4">[5]</ref>.</p><p>The large majority of ontology driven NER techniques rely on lexical and syntactic analysis of text in addition to machine learning for recognizing and tagging ontology concepts <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b5">6]</ref>. In recent years, deep learning has been introduced for NER of biological entities from literature <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref>. However, the majority of prior work has focused on a limited set of models, particularly, the Long Short-Term Memory (LSTM) model (e.g. <ref type="bibr" target="#b6">[7]</ref>).</p><p>Here, we present a new deep learning architecture that utilizes Gated Recurrent Units (GRU) while taking advantage of word and character encodings from the annotation training data to recognize ontology concepts from text. We evaluate our model in comparison to 7 deep learning models used in prior work to show that our model outperforms the state of art at the task of ontology-based NER.</p><p>We use the Colorado Richly Annotated Full-Text (CRAFT) corpus <ref type="bibr" target="#b11">[12]</ref> as a Gold Standard reference to develop and evaluate the deep learning models. The CRAFT corpus contains 67 open access, full length biomedical articles annotated with concepts from several ontologies (such as Gene Ontology, Protein Ontology, Sequence Ontology, etc.). We use four metrics -1) Precision, 2) Recall, 3) F-1 Score and 4) Jaccard semantic similarity to compare each model's performance to the Gold Standard.</p><p>Precision and Recall are traditionally used to assess the performance of information retrieval systems. However, these metrics do not take into account the notion of partial information retrieval which is important for ontology-based annotation retrieval. Sometimes, an NLP system might not retrieve the same ontology concept as the gold standard but a related concept (sub-class or super-class). To assess the performance of the NLP system accurately, we need semantic similarity metrics that can measure different degrees of semantic relatedness between ontology concepts <ref type="bibr" target="#b12">[13]</ref>. Here, we use Jaccard similarity to compare annotations from each deep learning model to the gold standard. Jaccard similarity assesses similarity between two ontology terms based on the ontological distance between them -the closer two terms are, the more similar they are considered to be <ref type="bibr" target="#b12">[13]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. RELATED WORK</head><p>The application of deep learning for ontology-based Named Entity Recognition is a nascent area with relatively little prior work. Habibi et al. <ref type="bibr" target="#b8">[9]</ref> studied entity recognition on biomedical literature using long short-term memory network-conditional random field (LSTM-CRF) and showed that the method outperformed other NER tools that do not use deep learning or use deep learning methods without word embeddings. Lyu et al. <ref type="bibr" target="#b9">[10]</ref> also explored LSTM based models enhanced with word and character embeddings. They do not evaluate other deep learning models but present results only based on LSTM with word embeddings. Wang et al. <ref type="bibr" target="#b10">[11]</ref> also propose a LSTM based method for recognizing biomedical entities from literature. Similar to the above studies, Wang et al. show that a bidirectional LSTM method used with Conditional Random Field (CRF) and word embeddings outperforms other methods.</p><p>The striking difference between these prior studies and our work here is that the majority of prior literature focuses on LSTM based methods along with CRF and word embeddings. The potential of other deep learning models such as Recurrent Neural Networks, Gated Recurrent Units, etc., at the task of ontology-based NER remains unexplored presenting a unique need and opportunity. Our study aims to fill this knowledge gap. In addition, all the above studies focus on non-ontology based NER for entities such as genes, disease names, etc. In contrast, our study's focus is on recognizing ontology concepts within text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. METHODS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Data Preprocessing</head><p>Annotation files for the 67 papers in CRAFT were cleaned to remove punctuation symbols (except for period at the end of sentences), special symbols, and non-ASCII characters. Annotations for GO, CHEBI, Cell, Protein, and Sequence ontologies were converted from the cleaned files to separate ontology-specific text files that represent the presence or absence of ontology terms. For each ontology, every sentence containing at least one annotation from that ontology was represented using two lines in the ontology-specific text file. The first of these two lines contained an array with each word in the sentence. The second contained an ordered encoding corresponding to words in the first line. These encodings could be an ontology concept ID if the corresponding word was annotated in CRAFT or an O if the corresponding word was not annotated.</p><p>For example, the sentence "Rod and cone photoreceptors subserve vision under dim and bright light conditions respectively" where the word "vision" was annotated to GO ID "GO:0007601 (perception of sight)" would be represented using the two lines below:</p><p>• Annotations to single words (unigrams) were only included in these preprocessed files. So, if an annotation was made in CRAFT to a phrase containing more than one word, it was ignored in the preprocessed data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Performance evaluation metrics</head><p>Precision, Recall, F1-score, and Jaccard similarity were used to evaluate the performance of the models. The Jaccard similarity (J) of two ontology concepts (in this case, annotations) (A, B) in an ontology is defined as the ratio of the number of classes in the intersection of their subsumers over the number of classes in their union of their subsumers <ref type="bibr" target="#b12">[13]</ref>.</p><formula xml:id="formula_0">J(A, B) = |S(A) ∩ S(B)| |S(A) ∪ S(B)|</formula><p>where S(A) is the set of classes that subsume A. Jaccard similarity ranges from 0 (no similarity) to 1 (exact match).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Deep learning models</head><p>Below, we describe four deep learning models -Multilayer perceptrons, Recurrent Neural Networks, Long Short-Term Memory, and Gated Recurrent Units. Next, we describe three architectures -window based, word based, and characterword based that can be used in conjunction with the above models. Finally, we describe our new model that combines character-word based architecture with Gated Recurrent Units and six models used in prior work.</p><p>1) Multi-Layer Perceptron (MLP): A Multi-Layer Perceptron (MLP) <ref type="bibr" target="#b13">[14]</ref> is a feed-forward deep-neural network model which consists of an input, single/multiple hidden, and an output layer, each consisting of a number of perceptrons. A single perceptron computes the output as γ = ϕ(</p><formula xml:id="formula_1">n i=1 w i x i + b),</formula><p>where w is the weight vector, x is the provided input, b is the bias, and ϕ is the activation function. The weights and biases of each perceptron in the layers are adjusted using backpropagation to minimize prediction error 2) Recurrent Neural Network: A Recurrent Neural Network (RNN) <ref type="bibr" target="#b14">[15]</ref> is an adaption of feed-forward neural networks, where history of the input sequence is taken into consideration for future prediction. Given an input sequence &lt; x 0 , x 1 , x 2 , • • • x i &gt;, the hidden state (h t ) of an RNN is updated as follows:</p><formula xml:id="formula_2">h t = 0, t = 0 σ(W hh h t−1 + W hx x t ), t &gt; 0 y t = sof tmax(W s h t ) (1)</formula><p>where, x t is the input provided to the hidden state h t at time t which is updated using a sigmoid function σ. σ is calculated over the previous time state of the network given by h t−1 and current input x t . W hh , W hx , and W s are the weights computed over training. The network can then produce an output prediction &lt; y 0 , y 1 , y 2 , • • • y j &gt; using a sof tmax function on the hidden state h t .</p><p>A bidirectional Recurrent Neural Network (BiRNN) is an RNN where the input data is fed to the neural network two times -once in forward and again in reverse order.</p><p>3) Long-Short Term Memory: While RNNs are effective in learning temporal patterns, they suffer from a vanishing gradient problem where long term dependencies are lost. A solution to the problem was proposed by Hochreiter et al. <ref type="bibr" target="#b15">[16]</ref> by using a variation of RNNs called Long-Short Term Memory (LSTM). LSTMs use a memory cell (c t ), to keep track of longterm relationships between text. Using a gated architecture (input, output, and forget), LSTMs are able to modulate the exposure of a memory cell by regulating the gates. LSTMs can be defined as:</p><formula xml:id="formula_3">i t = σ(W ix x t + W ih h t−1 ) f t = σ(W f x x t + W f h h t−1 ) o t = σ(W ox x t + W oh h t−1 ) g t = tanh(W gx x t + W gh h t−1 ) c t = c t−1 f t + g t i t h t = tanh(c t ) o t (2)</formula><p>where, i t , f t , and o t are the input, forget, and output gates respectively. Each gate uses a sigmoid (σ) function applied over the sum of input x t and previous hidden state h t−1 (multiplied with their weight matrices W ). g t denotes the candidate state computed over a tanh function on the input and previous hidden state. W ix , W f x , W ox , W gx are weight matrices used with input x t , while W ih , W f h , W oh , and W gh are used with hidden states for each gate and candidate state. The memory cell c t utilizes the forget gate (f t ) and multiplies ( -element-wise) it old memory cell c t−1 and adds to the state of candidate (g t ) multiplied with the input gate (i t ). The hidden state is given by a tanh function applied to the memory cell c t multiplied with output gate (o t ).</p><p>4) Gated Recurrent Unit: A variation on LSTM, was introduced by Cho et al. <ref type="bibr" target="#b16">[17]</ref> as Gated Recurrent Unit (GRU). Using update and reset gates, GRUs are able to control amount of information within a unit (without a separate memory cell as with LSTM). GRUs can formally be defined as</p><formula xml:id="formula_4">z t = σ(W zx x t + W zh h t−1 ) r t = σ(W rx x t + W rh h t−1 ) ht = tanh(W x x t + r t W h h t−1 ) h t = z t h t−1 + (1 − z t ) ht (3)</formula><p>where, z t and r t are update and reset gates respectively, ht is the candidate activation/hidden state.</p><p>Similar to the LSTM architecture, GRUs benefit from the additive properties in their network to remember long term dependencies, and solve the vanishing gradient problem. Since GRUs do not utilize an an output gate, they are able to write the entire contents of their memory cell to the network. The lack of a memory cell also makes GRUs more efficient in comparison to LSTMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Deep learning Architectures</head><p>Below, we describe three architectures -window-based, word-based, and word+character based to be used in conjunction with the different models described above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1) Window-based:</head><p>In this architecture, the window-based input (i v ) consists of feature vectors (f v ) for each word/term (t) within an encoded sentence. Each f v consisted of the following attributes: 2) Word-based: Each word and its corresponding annotation labels (tags) are encoded with integer values, derived from unique words and annotations present in the corpus. The dataset was based on unigram annotations that only use ontology annotations where a single word in text maps to an ontology concept.</p><formula xml:id="formula_5">f v =&lt; t, n t , t 1 , t −1 , t C , t a C , t a l , t p 0 , t p 0−1 , t p 0−2 , t s 0 , t s 0−1 , t s 0−2 ,</formula><p>In word-based architectures (Figure <ref type="figure" target="#fig_0">1</ref>), the input (X W tr ) is provided to an Embedding layer which converts the input into dense vectors of 100 dimensions. The output vectors are then fed to a bidirectional model (RNN/GRU/LSTM) consisting of 150 hidden units. The output from the model goes to a dense perceptron layer using ReLU activation which also employs a 0.6 Dropout. The output is further fed into a CRF layer which looks for correlations between annotations in close sequences to generate the predictions (y pr ).</p><p>3) Character+Word Based: A Character+Word based architecture is similar to the word based architecture described above. In addition to word-based inputs (X W tr ) is also takes advantage of characters (X C tr ) within words to make predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Model development</head><p>We developed a new deep learning architecture that uses a Character+word based architecture coupled with two bidirectional Gated Recurrent Units. Our architecture (Figure <ref type="figure" target="#fig_1">2</ref>) consists of character level input (X C tr ) provided to an Embedding layer (E 1 ) which compresses the dimensions of characters to the number of unique annotations in the corpus (N T ags).</p><p>The output of Embedding layer E 1 is fed to a bidirectional GRU (BiGRU 1 ) layer with 150 units followed by a 60% output drop in a Dropout layer (D 1 ). Simultaneously, the word-level input (X W tr ) was provided to a second Embedding layer (E 2 ) with 30 dimensions. The output from E 2 was concatenated with the output from the first Dropout layer D 1 and fed through a second Dropout layer (D 2 ) with a 30% drop. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Model Comparison</head><p>We compared the performance of our new Character+word based GRU architecture and the two models developed therein (CW − BiGRU − CRF , CW − BiGRU ) (Section IV-E) to six state of the art models that have been used in prior work. Below, we specify the component details of each of the six prior models that have been evaluated.</p><p>1) MLP: Multi layer perceptrons were used with a window based architecture to create a three layered (input, hidden, output) M LP model. The input and the hidden layer consisted of 512 perceptrons with a Rectified Linear Unit (ReLU) activation function while the output layer consisted of perceptrons equal to the number of unique annotations in the corpus (N T ags). 20% Dropout was used for the hidden and output layers to prevent overfitting of the data. Categorical cross-entropy was used for calculating the loss function and NAdam (Adam RMSprop with Nesterov momentum) was used as the optimizer function. Each of the feature vectors (from the training data), were fed into the MLP architecture for 15 epochs with a batch size of 256.</p><p>2) BiRNN-CRF: The BiRNN-CRF model uses a wordbased input coupled with a BiRNN model and ending with a CRF model. Similar to the BiRNN architecture (Figure <ref type="figure" target="#fig_0">1</ref>), the BiRNN-CRF model consists of a 100 dimension Embedding layer followed by a BiRNN with 150 units followed by a 0.6 Dropout layer. The output of the 0.6 Dropout layer is fed to a CRF which generated the predicted output.</p><p>3) BiLSTM-CRF: The BiLSTM-CRF model is identical to the BiRNN-CRF except that it uses a LSTM in place of the RNN.</p><p>4) BiGRU-CRF: The BiGRU-CRF model is identical to BiRNN-CRF and BiLSTM-CRF except that it uses a Gated Recurrent Unit in place of the RNN or LSTM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5) CW-BiLSTM:</head><p>The CW-BiLSTM model is similar to the CW-BiGRU model described above (see Section IV-E) except that the BiGRU is replaced with a BiLSTM.</p><p>6) CW-BiLSTM-CRF: The CW-BiLSTM-CRF model is developed by adding a CRF layer at the end of the CW-BiLSTM model pipeline indicating that the output of the CW-BiLSTM model would be fed to a CRF layer to generate the final predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G. Parameter Tuning</head><p>The GO annotation data was split into training and test sets using a 70:30 ratio. The training set was used to tune the following parameters for all models. Multiple architecture parameters such as -1) Number of layers in MLP (along with number of perceptrons), 2) Number of units in RNN/GRU/LSTM, 3) Embedding Dimensions for Characters and Words, and 4) Optimization functions, were evaluated for model performance. A grid-search model was explored, where each architecture was evaluated for different combinations of the parameter. In each case, model performance metrics were recorded in form of Precision, Recall, F1-score, and Jaccard similarity. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H. Experiments to predict ontology annotations</head><p>The largest number of annotations in the CRAFT corpus came from the Gene Ontology. So, we first used the GO annotations to train and test the suite of 8 models described above. Subsequently, we applied the best model from these experiments to annotate the CRAFT corpus with the other four ontologies (Chebi, Cell, Protein, and Sequence corpora).</p><p>Root-Mean-Square propagation (RMSProp) optimizer was used to test the performance of the different models. A batch size of 32 along with 15 epochs was used for model training. Performance characteristics in terms of train-test loss (calculated using the CRF function), prediction precision, recall, F1score along with mean semantic similarity score was recorded for each model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. RESULTS AND DISCUSSION</head><p>The CRAFT corpus contains 67 full length papers with annotations from five ontologies (GO, CHEBI, Cell, Protein, and Sequence). For each of these ontologies, we extracted all sentences across the 67 papers with at least one annotation for the ontology. The largest number of annotations came from the GO (Table <ref type="table" target="#tab_2">I</ref>) while the Cell ontology accounted for the lowest number of annotations.</p><p>Figure <ref type="figure" target="#fig_2">3</ref> shows the loss and accuracy trends for each model on the GO annotation data. The goal of the models is to minimize loss while increasing accuracy as the number of epochs increase.</p><p>First, we see that our CW-BiGRU model shows improvement in both training and validation accuracy as the number of epochs increase. Correspondingly, we observe a decrease in training and validation loss indicating that the model is able to self-improve with each subsequent epoch. The CW-BiGRU-CRF model initially shows the same accuracy improvement like the CW-BiGRU model but later increases in epochs result in a divergence in the training and validation accuracy indicating that the model might be prone to overfitting. While there is a substantial decrease in training loss, a similar decrease is not observed in validation loss.</p><p>CW-BiLSTM shows similar trends to CW-BiGRU. CW-BiLSTM-CRF training and validation accuracy increase similarly until a certain point after which the validation accuracy drops and diverges sharply from the training curve indicating a case of overfitting.</p><p>BiGRU-CRF and BiRNN-CRF models show substantial improvement in accuracy with increasing epochs. However, BiRNN-CRF shows divergence in the loss patterns. Similar to CW-BiLSTM-CRF, BiLSTM-CRF also shows signs of overfitting in the accuracy patterns. MLP is the worst performing model with very minor improvements in validation accuracy as the number of epochs increase indicating that the model is unable to improve itself with each subsequent epoch.</p><p>It is clear that the CW-BiGRU models are able to outperform the other models by improving accuracy and reducing loss with each epoch without overfitting.</p><p>A large proportion of input data is not annotated to GO terms but to a tag O indicating the absence of an annotation. In addition to accurately predicting GO annotations, the model also needs to accurately predict the absence of an annotation. However, given the disproportionate amount of data pertaining to the absence of annotations, the models were observed to predict the absence of annotations remarkably accurately in comparison to predicting presence.</p><p>To provide a more conservative view of the models' performance, we report Precision, Recall, F-1 Score, and Jaccard similarity (Table <ref type="table" target="#tab_3">II</ref>) only on data indicating presence of ontology terms, i.e. text annotated with an ontology term. Unlike the accuracy measurements above, the metrics below do not take into account the models' performance identifying the absence of annotations, but rather focus on ability to identify annotations when they're present in the Gold Standard. These results (Table <ref type="table" target="#tab_3">II</ref> and Figure <ref type="figure" target="#fig_2">3</ref>) show that our model (CW-BiGRU) outperforms the other 7 models in all four metrics. Our model outperforms the best among the other 7 models (CW-BiLSTM) by 4% (Precision), 2% (Recall), 3% (F1 score), 1% (Jaccard similarity).</p><p>Additionally, we observe that character-word based models (CW-BiGRU, CW-BiLSTM, CW-BiLSTM-CRF, CW-BiGRU-CRF,) outperform models that use only word embeddings.</p><p>Among the character-word based models, surprisingly, the addition of an extra CRF layer (CW-BiLSTM-CRF, CW-BiGRU-CRF) either fails to improve performance (e.g CW-BiLSTM vs. CW-BiLSTM-CRF) or leads to a decline in performance (e.g CW-BiGRU vs. CW-BiGRU-CRF) as compared to not using a CRF end layer (CW-BiLSTM, CW-BiGRU). The MLP model shows substantially lower performance as compared to the other models across all four metrics. The Accuracy and Loss plots (Figure <ref type="figure" target="#fig_2">3</ref>) suggest that the decline in performance when adding a CRF layer is due to potential overfitting.</p><p>We explored how predictions from our best model, CW-BiGRU, diverge from the Gold Standard. We found that the majority of predictions (89.25%) are an exact match for the CRAFT annotations. Surprisingly, only a small proportion of predictions are partial matches (2.45%). 8.26% of the model's predictions are false negatives while 6.38% are false positives. We hypothesize that one of the primary reasons for false negatives might be lack of enough training instances for those particular GO annotations.</p><p>Finally, we applied the best performing model from the above evaluation (CW-BiGRU) and tested it on data from four other ontologies. Interestingly, the model shows better prediction performance on the other ontologies as compared to GO despite the substantially smaller training datasets (Table <ref type="table" target="#tab_4">III</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSIONS AND FUTURE WORK</head><p>The data used in this study was limited to single words annotated to ontology concepts (unigrams). Next, we will explore more robust models including n-grams to account for sequences of words tagged with an annotation. Future work will also include models that can be trained to weight the prediction of some target classes higher than others. These models would be able to prioritize presence prediction of annotations as compared to the absence of an annotation.</p><p>This study demonstrates the utility of deep learning approaches for automated ontology-based curation of scientific literature. Specifically, we show that models based on Gated Recurrent Units are more powerful and accurate at annotation prediction as compared to the LSTM based models in prior work. Our findings indicate that deep learning is a promising new direction for ontology-based text mining, and can be used for more sophisticated annotation tasks (such as phenotype curation) that build upon Named Entity Recognition.  </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Word-based architecture using bidirectional RNN/GRU/LSTM models</figDesc><graphic coords="4,58.03,53.14,504.00,252.01" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Character+word based architecture using two bidirectional GRU models.</figDesc><graphic coords="5,58.03,53.14,504.01,252.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Comparison of model loss and accuracy on training and validation data using Gene Ontology annotations</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>t N , t P &gt; (4) where, t is the term, n t is the number of terms in the sentence, t 1 is a boolean value indicating if the term is the first term in the sentence, t −1 is 1 if term is the last term in the sentence 0 otherwise, t C is 1 if first letter in t is uppercase,</figDesc><table><row><cell>t a C is 1 if all letter in t are uppercase, t a l is 1 if all in t are lower case, t p 0 , t p 0−1 , t p 0−2 record character prefixes of t at various win-</cell></row><row><cell>dow size,</cell></row><row><cell>t s 0 , t s 0−1 , t s 0−2 record character suffixes of t at various win-</cell></row><row><cell>dow sizes,</cell></row><row><cell>t</cell></row></table><note>N and t P are the next and previous terms respectively.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>TABLE I .</head><label>I</label><figDesc></figDesc><table><row><cell>Dataset</cell><cell>Number of Sentences</cell><cell>Number of Unique Annotations</cell><cell>Number of Unique Words in the Corpus</cell></row><row><cell>GO</cell><cell>17,921</cell><cell>359</cell><cell>9,571</cell></row><row><cell>Sequence</cell><cell>15,606</cell><cell>156</cell><cell>7,262</cell></row><row><cell>Protein</cell><cell>12,621</cell><cell>546</cell><cell>5,153</cell></row><row><cell>Chebi</cell><cell>11,109</cell><cell>309</cell><cell>3,127</cell></row><row><cell>Cell</cell><cell>9,088</cell><cell>68</cell><cell>3,042</cell></row></table><note>CHARACTERISTICS OF THE CRAFT CORPUS -NUMBER OF SENTENCES WITH AT LEAST ONE ANNOTATION, NUMBER OF UNIQUE ANNOTATIONS (UNIGRAMS ONLY), AND NUMBER OF UNIQUE WORDS IN THE CORPUS.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>TABLE II .</head><label>II</label><figDesc>PRECISION, RECALL, F1, AND JACCARD SIMILARITY SCORES FOR THE EIGHT MODELS ON CRAFT GENE ONTOLOGY</figDesc><table><row><cell></cell><cell cols="2">ANNOTATION DATA.</cell><cell></cell><cell></cell></row><row><cell>Model</cell><cell>Precision</cell><cell>Recall</cell><cell>F1</cell><cell>Jaccard Similarity</cell></row><row><cell>CW-BiGRU</cell><cell>0.84</cell><cell>0.84</cell><cell>0.83</cell><cell>0.84</cell></row><row><cell>CW-BiLSTM</cell><cell>0.80</cell><cell>0.82</cell><cell>0.80</cell><cell>0.83</cell></row><row><cell>CW-BiLSTM-CRF</cell><cell>0.80</cell><cell>0.82</cell><cell>0.80</cell><cell>0.82</cell></row><row><cell>CW-BiGRU-CRF</cell><cell>0.77</cell><cell>0.80</cell><cell>0.78</cell><cell>0.82</cell></row><row><cell>BiGRU-CRF</cell><cell>0.75</cell><cell>0.77</cell><cell>0.75</cell><cell>0.78</cell></row><row><cell>BiRNN-CRF</cell><cell>0.72</cell><cell>0.74</cell><cell>0.72</cell><cell>0.75</cell></row><row><cell>BiLSTM-CRF</cell><cell>0.70</cell><cell>0.70</cell><cell>0.70</cell><cell>0.71</cell></row><row><cell>MLP</cell><cell>0.65</cell><cell>0.60</cell><cell>0.61</cell><cell>0.61</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>TABLE III</head><label>III</label><figDesc></figDesc><table><row><cell>.</cell><cell cols="5">PRECISION, RECALL, F1, AND JACCARD SIMILARITY</cell></row><row><cell cols="6">SCORES FOR THE EIGHT MODELS ON ANNOTATIONS FROM FIVE</cell></row><row><cell></cell><cell cols="3">ONTOLOGIES IN CRAFT.</cell><cell></cell><cell></cell></row><row><cell>Model</cell><cell>Ontology</cell><cell>Precision</cell><cell>Recall</cell><cell>F1</cell><cell>Jaccard Similarity</cell></row><row><cell>CW-BiGRU</cell><cell>Cell</cell><cell>0.92</cell><cell>0.92</cell><cell>0.92</cell><cell>0.925</cell></row><row><cell>CW-BiGRU</cell><cell>Protein</cell><cell>0.91</cell><cell>0.90</cell><cell>0.90</cell><cell>0.917</cell></row><row><cell>CW-BiGRU</cell><cell>CHEBI</cell><cell>0.86</cell><cell>0.87</cell><cell>0.86</cell><cell>0.882</cell></row><row><cell>CW-BiGRU</cell><cell>GO</cell><cell>0.84</cell><cell>0.84</cell><cell>0.83</cell><cell>0.843</cell></row><row><cell>CW-BiGRU</cell><cell>Sequence</cell><cell>0.83</cell><cell>0.86</cell><cell>0.84</cell><cell>0.864</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1">August 7-10, 2018  </note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2">ICBO 2018August 7-10, 2018</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Gene ontology: tool for the unification of biology</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ashburner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Ball</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Blake</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Botstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Butler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Cherry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Davis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Dolinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Dwight</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Eppig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature genetics</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">25</biblScope>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Moving the mountain: analysis of the effort required to transform comparative anatomy into computable anatomy</title>
		<author>
			<persName><forename type="first">W</forename><surname>Dahdul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">A</forename><surname>Dececchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ibrahim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lapp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mabee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Database</title>
		<imprint>
			<biblScope unit="volume">2015</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Ncbo annotator: semantic annotation of biomedical data</title>
		<author>
			<persName><forename type="first">J</forename><surname>Clement</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nigam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cherie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Musen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Callendar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Storey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference, Poster and Demo session</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The monarch initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Mungall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Mcmurry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Köhler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Balhoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Borromeo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brush</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Carbon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Conlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Dunn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Engelstad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nucleic acids research</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="issue">D1</biblScope>
			<biblScope unit="page" from="D712" to="D722" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Text mining and ontologies in biomedicine: making sense of raw text</title>
		<author>
			<persName><forename type="first">I</forename><surname>Spasic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ananiadou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mcnaught</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Briefings in bioinformatics</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="239" to="251" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Charaparser+ eq: Performance evaluation without gold standard</title>
		<author>
			<persName><forename type="first">H</forename><surname>Cui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Dahdul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Dececchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ibrahim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mabee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Balhoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gopalakrishnan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the Association for Information Science and Technology</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="10" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Neural architectures for named entity recognition</title>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ballesteros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Subramanian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kawakami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dyer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1603.01360</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Conditional random fields: Probabilistic models for segmenting and labelling sequence data</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lafferty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICML</title>
				<imprint>
			<date type="published" when="2001">2001. 2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Deep learning with word embeddings improves biomedical named entity recognition</title>
		<author>
			<persName><forename type="first">M</forename><surname>Habibi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Weber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Neves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Wiegandt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Leser</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">14</biblScope>
			<biblScope unit="page" from="37" to="48" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Long short-term memory rnn for biomedical named entity recognition</title>
		<author>
			<persName><forename type="first">C</forename><surname>Lyu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ji</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC bioinformatics</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">462</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Cross-type biomedical named entity recognition with deep multi-task learning</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zitnik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Langlotz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1801.09851</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Concept annotation in the craft corpus</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Eckert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Evans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Garcia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shipley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Sitnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">A</forename><surname>Baumgartner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">B</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Verspoor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Blake</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC bioinformatics</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">161</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Semantic similarity in biomedical ontologies</title>
		<author>
			<persName><forename type="first">C</forename><surname>Pesquita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Faria</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">O</forename><surname>Falcao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Couto</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PLoS computational biology</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page">e1000443</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Learning internal representations by error propagation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">E</forename><surname>Rumelhart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Williams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">California Univ San Diego La Jolla Inst for Cognitive Science</title>
		<imprint>
			<date type="published" when="1985">1985</date>
			<publisher>Tech. Rep</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Fundamentals of neural networks</title>
		<author>
			<persName><forename type="first">L</forename><surname>Faucett</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Architecture, Algorithms</title>
				<imprint>
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Long short-term memory</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hochreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural computation</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="1735" to="1780" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Learning phrase representations using rnn encoder-decoder for statistical machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Van Merriënboer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gulcehre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bahdanau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bougares</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schwenk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1406.1078</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
