<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>M. Arguello Casteleiro</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>G. Demetriou</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>W. Read</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M.J. Fernandez-Prieto</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D. Maseda-Fernandez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>G. Nenadic</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J. Klein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J. Keane</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R. Stevens</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut National de la Santé et de la Recherche Medicale (INSERM) U1048</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Manchester Institute of Biotechnology, University of Manchester</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Midcheshire Hospital Foundation Trust</institution>
          ,
          <addr-line>NHS England</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Computer Science, University of Manchester</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>School of Languages, University of Salford</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Background: Automatic identification of gene and protein names from biomedical publications can help curators and researchers to keep up with the findings published in the scientific literature. As of today, this is a challenging task related to information retrieval, and in the realm of Big Data Analytics. Objectives: To investigate the feasibility of using word embeddings (i.e. distributed word representations) from Deep Learning algorithms together with terms from the Cardiovascular Disease Ontology (CVDO) as a step to identifying omics information encoded in the biomedical literature. Methods: Word embeddings were generated using the neural language models CBOW and Skip-gram with an input of more than 14 million PubMed citations (titles and abstracts) corresponding to articles published between 2000 and 2016. Then the abstracts of selected papers from the sysVASC systematic review were manually annotated with gene/protein names. We set up two experiments that used the word embeddings to produce term variants for gene/protein names: the first experiment used the terms manually annotated from the papers; the second experiment enriched/expanded the annotated terms using terms from the human-readable labels of key classes (gene/proteins) from the CVDO ontology. CVDO is formalised in the W3C Web Ontology Language (OWL) and contains 172,121 UniProt Knowledgebase protein classes related to human and 86,792 UniProtKB protein classes related to mouse. The hypothesis is that by enriching the original annotated terms, a better context is provided, and therefore, it is easier to obtain suitable (full and/or partial) term variants for gene/protein names from word embeddings. Results: From the papers manually annotated, a list of 107 terms (gene/protein names) was acquired. As part of the word embeddings generated from CBOW and Skip-gram, a lexicon with more than 9 million terms was created. Using the cosine similarity metric, a list of the 12 top-ranked terms was generated from word embeddings for query terms present in the generated lexicon. Domain experts evaluated a total of 1968 pairs of terms and classified the retrieved terms as: TV (term variant); PTV (partial term variant); and NTV (non term variant, meaning none of the previous two categories). In experiment I, Skip-gram finds the double amount of (full and/or partial) term variants for gene/protein names as compared with CBOW. Using Skip-gram, the weighted Cohen's Kappa inter-annotator agreement for two domain experts was 0.80 for the first experiment and 0.74 for the second experiment. In the first experiment, suitable (full and/or partial) term variants were found for 65 of the 107 terms. In the second experiment, the number increased to 100. Conclusion: This study demonstrates the benefits of using terms from the CVDO ontology classes to obtain more pertinent term variants for gene/protein names from word embeddings generated from an unannotated corpus with more than 14 million PubMed citations. As the terms variants are induced from the biomedical literature, they can facilitate data tagging and semantic indexing tasks. Overall, our study explores the feasibility of obtaining methods that scale when dealing with big data, and which enable automation of deep semantic analysis and markup of textual information from unannotated biomedical literature. * Contact: robert.stevens@manchester.ac.uk</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        According to the World Health Organisation cardiovascular
diseases (CVDs) are the number one cause of death globally [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The
SysVASC project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] seeks to provide a comprehensive systems
medicine approach to elucidate pathological mechanisms for CVD,
which will yield molecular targets for therapeutic intervention. To
achieve this aim it is necessary to gather and integrate data from
omics (e.g. genomics, transcriptomics, proteomics and
metabolomics) experiments.
      </p>
      <p>The CVD ontology (CVDO) is developed as part of the
sysVASC project to provide the infrastructure to integrate omics data
that encapsulate findings published in the scientific literature. The
CVDO ontology has 172,121 UniProtKB protein classes related to
human, and 86,792 UniProtKB protein classes related to mouse. Of
these, so far a total of only 8,196 UniProtKB protein classes (i.e.
reviewed Swiss-Prot; unreviewed TrEMB; along with Isoform
sequences) from mouse and human have been identified as of
potential interest to the sysVASC project. An important part of the
manually curated effort is to tie experimental findings to the
biomedical scientific literature. However, even a project like
sysVASC cannot afford to have a team of researchers or curators who
can survey the literature regularly and deal with the fundamental
task of identifying gene and protein names as a preliminary step to
identify the omics information encoded in the biomedical text.</p>
      <p>
        PubMed queries were the starting point of a systematic literature
review performed for sysVASC to obtain the omics studies that
underpins CVDO and the CVD Knowledge Base (CVDKB).
PubMed is a database from the U.S. National Library of Medicine
(NLM) with millions of citations from MEDLINE, life science
journals, and online books. In June 2016, PubMed contained 26
million citations with an average of 1.5 papers added per minute
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Keeping CVDKB up-to-date is a challenge shared with
systematic reviews that aim to keep updated with the best evidence
reported in the literature. As of today, searching through
biomedical literature and appraising information from relevant documents
is extremely time consuming [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4,5,6</xref>
        ]. Furthermore, omics is a
demanding area, where the irregularities and ambiguities in gene and
protein nomenclature remain a challenge [
        <xref ref-type="bibr" rid="ref7 ref8">7,8</xref>
        ]. Krauthammera and
Nenadic [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] highlight: “successful term identification is key to
getting access to the stored literature information, as it is the terms
(and their relationships) that convey knowledge across scientific
articles”. The identification of biological entities in the field of
systems biology has proven difficult due to term variation and term
ambiguity [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], because a concept can be expressed by various
realisations (a.k.a. term variants). A large-scale database such as
MEDLINE/PubMed contains longer words and phrases (e.g.
“serum amyloid A-1 protein”) as well as shorter forms like
abbreviations or acronyms (e.g. “SAA”). Finding all the term variants in
text is important for improving the results of information retrieval
(IR) systems like the PubMed search engine, which traditionally
rely on keyword-based approaches. Therefore, the number of
documents retrieved is prone to change when using acronyms instead
of and/or in combination with full terms [
        <xref ref-type="bibr" rid="ref11 ref12">11,12</xref>
        ].
      </p>
      <p>
        This paper investigates the feasibility of using Deep Learning,
an emerging area of artificial neural networks, for identifying gene
and protein names of interest for sysVASC in biomedical text.
More specifically, we propose to use the two neural language
models Skip-gram and CBOW (Continuous Bag-of-Words) of Mikolov
et al. [
        <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
        ] to produce word embeddings, which are distributed
word representations typically induced using neural language
models. These word embeddings can be traced back to PubMed
citations, and can be also linked to the CVDO classes formalised in the
CVD Ontology represented in the W3C Web Ontology Language
(OWL) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>
        In terms of information/knowledge extraction from texts, over the
years, the knowledge engineering (KE) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] paradigm has lost
popularity in favour of the machine learning (ML) paradigm. ML
algorithms learn input-output relations from examples with the
goal of interpreting new inputs; therefore, the performance of ML
methods is heavily dependent on the choice of data representation
(or features) to which they are applied [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Representing words as
continuous vectors has a long history where different types of
models have been proposed to estimate continuous representation
of words and create distributional semantic models (DSMs). DSMs
derive representations for words in such a way that words
occurring in similar contexts will have similar representations, and
therefore, the context needs to be defined. Some examples of
context in DSMs include: Latent Semantic Analysis (LSA) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] which
generally uses an entire document as a context (i.e. word-document
models), and Hyperspace Analog to Language (HAL) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] which
uses a sliding word window as a context (i.e. sliding window
models). More recently, Random Indexing [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] has emerged as a
promising alternative to LSA. LSA, HAL, and Random Indexing are
spatially motivated DSMs. Examples of probabilistic DSMs are
Probabilistic LSA (PLSA) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and Latent Dirichlet Allocation
(LDA) [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. While spatial DSMs compare terms using distance
metrics in high-dimensional space [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], probabilistic DSMs
measure similarity between terms according to the degree to which they
share the same topic distributions [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Most DSMs have high
computational and storage cost associated with building the model
or modifying it due to the huge number of dimensions involved
when a large corpus is modelled [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Although neural models are
not new in DSMs, recent advances in artificial neural networks
(ANNs) make feasible the derivation of words from corpora of
billions of words: hence the growing interest in Deep Learning and
the neural language models CBOW and Skip-gram of Mikolov et
al. [
        <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
        ].
      </p>
      <p>
        In a relatively short time, CBOW and Skip-gram have gained
popularity to the point of being used for benchmarking word
embeddings [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] or as baseline models for performance comparisons
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. We propose applying Mikolov et al. [
        <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
        ] neural language
models, which can be trained to produce high-quality word
embeddings on English Wikipedia [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], to automatically extract terms
(gene and protein nomenclature) from 14,056,761 free-text
unannotated MEDLINE/PubMed citations (title and abstract). Our
hypothesis is that word embeddings of high quality should generate
useful lists of term variants. As of today, the application of
Mikolov et al. [
        <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
        ] CBOW and Skip-gram to the biomedical
literature remains largely unexplored with only some pioneering
work [
        <xref ref-type="bibr" rid="ref27 ref28">27,28</xref>
        ].
3
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>METHODS</title>
      <sec id="sec-4-1">
        <title>The CVD Ontology and its Knowledge Base</title>
        <p>
          The CVD ontology (CVDO) provides the infrastructure to
integrate the omics data from multiple biological resources, such as the
UniProt Knowledgebase (UniProtKB) [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], the miRBase [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] from
EMBL-EBI, and the Human Metabolome Database (HMDB) [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ].
At the core of CVDO is the Ontology for Biomedical
Investigations (OBI) [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] along with other reference ontologies produced by
the OBO Consortium, such as the Protein Ontology (PRO) [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ],
the Sequence Ontology (SO) [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ], the three Gene Ontology (GO)
sub-ontologies [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ], Chemical Entities of Biological Interest
Ontology (ChEBI) [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ], NCBI Taxonomy Ontology [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], the Cell
Ontology (CL) [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ], the Uber Anatomy Ontology (UBERON)
[
          <xref ref-type="bibr" rid="ref39">39</xref>
          ], Phenotypic Quality Ontology (PATO) [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ], and Relationship
Ontology (RO) [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ].
        </p>
        <p>
          In terms of knowledge modelling, CVDO shares the
protein/gene representation used in the Proteasix Ontology (PxO) [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>PubMed: from XML to RDF datasets</title>
        <p>
          Through the ftp server from the U.S. NLM we downloaded the
MEDLINE/PubMed baseline files for 2015 and also the update
files up to 8th June 2016. We created a processing pipeline written
in Python that allows the conversion of the downloaded PubMed
XML files into W3C Resource Description Framework (RDF) [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ]
datasets. This pipeline can also be reused to process the results of
PubMed searches.
        </p>
        <p>
          We performed a mapping between the PubMed XML elements
[
          <xref ref-type="bibr" rid="ref44">44</xref>
          ] and terms from the Dublin Core Metadata Initiative (DCMI),
which has been taken up globally and has a publicly available RDF
Schema [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ].
        </p>
        <p>
          When pre-processing the textual input for Mikolov et al. [
          <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
          ]
CBOW and Skip-gram, it is common practice systematically to
lower-case the text and to remove all numbers. However, this is
unsuitable when dealing with protein/gene names, because critical
information will be lost. To further illustrate this: for human,
nonhuman primates, chickens, and domestic species, gene symbols
contain three to six alphanumeric characters that are all in
uppercase (e.g. OLR1), while for mice and rats the first letter alone is in
uppercase (e.g. Olr1). Therefore, we have introduced some ad hoc
rules as part of the pre-processing to guarantee that protein/gene
names are preserved.
3.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Deep Learning: word embeddings using word2vec</title>
        <p>
          This study looks at neural language models, i.e. distributed
representation of words learnt by artificial neural networks (ANNs). We
adopted the new log-linear models that try to minimise
computational complexity. The CBOW and Skip-gram model architecture
[
          <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
          ] is similar to the probabilistic feedforward neural network
language model (NNLM). The feedforward NNLM proposed by
Bengio et al. [
          <xref ref-type="bibr" rid="ref46">46</xref>
          ] consists of input, projection, hidden, and output
layers. In the CBOW and Skip-gram model architecture, the
nonlinear hidden layer is removed and the projection layer is shared
for all the words, so all words are projected into the same position
(their vectors are averaged) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The Skip-gram model uses the
current word to predict surrounding words, while the CBOW
model predicts the current word based on the context.
        </p>
        <p>
          The basic Skip-gram formulation uses the softmax function [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
The hierarchical softmax is a computationally efficient
approximation of the full softmax. If W is the number of words in the lexicon,
hierarchical softmax only needs to evaluate about log2(W) output
nodes to obtain the probability distribution, instead of needing to
evaluate W output nodes.
        </p>
        <p>
          word2vec [
          <xref ref-type="bibr" rid="ref47">47</xref>
          ] is the software package used in this study. It was
initially released as open software and is faster than its Python
counterpart implementation from Gensim [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ]. Using word2vec
and out of CBOW and Skip-gram with hierarchical softmax we
obtain: 1) a lexicon (i.e. a list of terms, typically multi-words) in
textual format that is constructed from the input data; and 2) the
resulting vectors of the neural DSM in binary mode. In
distributional semantics a well-known similarity measure is cosine
similarity, i.e. the cosine of the angle between two vectors of n
dimensions. If the cosine is close to zero, the two vectors are considered
dissimilar, while if it is close to one, this indicates a high similarity
between the two vectors.
3.4
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Integrating CVDO and word embeddings</title>
        <p>
          The terms from the word embeddings lexicon can be traced back to
PubMed citations. Among these terms, there are suitable (full
and/or partial) term variants for gene/protein names that can also
be linked to the CVDO classes in the CVD ontology. To perform
the linkage between word embedding terms and CVDO classes, we
looked at the Simple Knowledge Organization System (SKOS)
[
          <xref ref-type="bibr" rid="ref49">49</xref>
          ], which is a W3C standard aimed at leveraging the power of
linked data. In SKOS there are three properties to attach lexical
labels to conceptual resources [
          <xref ref-type="bibr" rid="ref49">49</xref>
          ]: 1) the preferred lexical label
(i.e. skos:prefLabel); 2) the alternative lexical label (i.e.
skos:altLabel) for synonyms and acronyms; and 3) the hidden
lexical label (i.e. skos:hiddenLabel) for including misspelled variants
of other lexical variants or a string for text-based indexing. All of
these can be considered annotation properties (i.e. owl:Annotation
Property), and allow limited linguistic information only. In this
study, we propose using skos:hiddenLabel to store plausible term
variants derived from word embeddings for gene and protein
classes from the CVD ontology.
3.5
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>Experimental setup</title>
        <p>A gold standard was created using 25 papers that meet the
inclusion and exclusion criteria of the sysVASC systematic review
performed. The original PubMed query was: “coronary heart disease
AND (proteomics OR proteome OR transcriptomics OR
transcriptome OR metabolomics OR metabolome OR omics)”. Out of all the
paper abstracts, a total of 107 terms were manually annotated as
protein/gene names. Each term was mapped to a CVD ontology
class to uniquely identify the conceptual entity (gene/protein) to
which the annotated term refers. This can be seen as term
standardisation process. Table 1 illustrates the mapping performed.
superoxide dismutase 3 (P08294) Extracellular superoxide dismutase
-Cu-Zn- [SOD3]
The few examples from Table 1 show the lack of standardisation of
the field, and illustrate some of the alternative terms from the
literature that refer to the conceptual entities (gene/protein) of interest.</p>
        <p>In this study we conducted two experiments:
• Experiment I – we use the annotated terms from selected
papers of the sysVASC systematic review alone (as they
appear in the paper abstracts/titles) to obtain the list of 12
top-ranked terms (highest cosine value) from the CBOW
and Skip-gram word embeddings. These are candidate term
variants.
• Experiment II – we enriched/expanded the original
annotated terms with terms that appear in the CVDO classes and
again produced a list of 12 top-ranked candidate term
variants from CBOW and Skip-gram word embeddings.
3.6</p>
      </sec>
      <sec id="sec-4-6">
        <title>Human assessment</title>
        <p>
          Domain experts assessed all the lists of 12 top-ranked candidate
term variants obtained for experiment I and II using CBOW and
Skip-gram.
3.6.1 Evaluation guidelines We established a strict criterion to
mark the list of candidate terms produced by the word2vec word
embeddings. Following Nenadic et al. [
          <xref ref-type="bibr" rid="ref50">50</xref>
          ] a candidate term was
marked as term variant (TV for short) only when the term fell
within the following types of term variation: a) orthographic; b)
morphological; c) lexical; d) structural; or e) acronyms and
abbreviations. Considering the biomedical value of phraseological
expressions (e.g. “ankyrin-B_gene” or “CBS_deficiency”) we
marked them as partial term variant (PTV for short); however,
they had to refer to the same biomedical concept, i.e. protein or
gene name.
3.6.2 Inter-annotator agreement Two domain experts
following the above-mentioned evaluation guidelines assigned a simple
key code for each candidate term variant: TV, PTV, and NTV (non
term variant, meaning none of the previous two categories). The
inter-annotator agreement is based on the Kappa measure [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ],
widely used for inter-annotator agreement on classification tasks;
Kappa K is defined as K = (Pr(a) – Pr(e))/(1 – Pr(e)), where Pr(a)
is the relative observed agreement between annotators, and Pr(e) is
the chance agreement.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>RESULTS</title>
      <p>Using a VM with 100 GB RAM and 32 CPU(s) at 4.0 GHz, we
obtained the word embeddings from the unannotated PubMed
corpus of 14,056,761 free-text MEDLINE/PubMed citations (title and
abstract) for Skip-gram (much slower than CBOW) after 17 hours
of processing. Due to the lack of space, we show here only some of
the results obtained.</p>
      <p>Both CBOW and Skip-gram used the same input and generate
the same lexicon, however, the resulting vectors of the neural DSM
in binary mode were different. Hence, the 12 top-ranked terms for
an input query term are likely to differ. For experiment I, only 77
of the 107 terms belong to the lexicon generated. For experiment
II, as the CVD ontology is used to provide more context for each
term, only 3 terms out of the 107 remained without a valid entry in
the lexicon. For experiment I, two domain experts (rater A and B)
assessed the 924 pairs of terms corresponding to 77 query terms.
For experiment II, there was 87 query terms that include terms
from the human-readable labels of key classes (gene/proteins) from
the CVDO ontology and considering multiple alternatives, and
thus, the same two domain experts assessed 1044 pairs of terms.</p>
      <p>To illustrate qualitatively the results obtained; Table 4 (right
column) shows the term annotated (TV, PTV, and NTV) by rater B
in experiment II using Skip-gram for ORL1, which is a gene
symbol. From experiment I using also Skip-gram, no suitable term
variants were found. It should be noted that some of the candidate
terms listed in Table 4 are well-known aliases of the gene symbol,
such as LOX-1.</p>
      <p>We observed that some of the protein names annotated from
sysVASC systematic review papers, like “annexin 4”, can not
produce a suitable term variant as they do not appear as such in the
generated lexicon generated by CBOW and Skip-gram. However,
by enriching them with terms from the CVDO ontology, it is
feasible to obtain suitable term variants. For example, “annexin 4” can
be mapped to the full protein name “Annexin A4”, which has
UniProt Accession number P09525 and gene symbol ANXA4. Indeed,
within the level of observed agreement among the two domain
experts, we can safely say that in the first experiment, suitable (full
and/or partial) term variants were found for 65 of the 107 terms. In
the second experiment, the number increased to 100. Hence, only 7
out of the total 107 remain without suitable (full and/or partial)
term variants.</p>
      <p>We also observed that the median of the rank (i.e. position in the
list of 12 top-ranked terms) for a TV agreed by rater A and B is 3
in both experiment I and II using Skip-gram. In other words, within
the level of observed agreement a TV is likely to appear in the first
three positions of the 12 top-ranked terms.
5</p>
    </sec>
    <sec id="sec-6">
      <title>DISCUSSION</title>
      <p>
        CBOW and Skip-gram have become the state-of-the-art for
generating word embeddings. From a quantitative point of view, this
study shows that using Skip-gram the number of term variants (TV
and/or PTV) for proteins/genes is substantially increased in
comparison with CBOW. For experiment II, i.e. when the terms
annotated from the sysVASC systematic review papers are
enriched/expanded with terms from the CVD ontology, the number
of suitable (full and/or partial) term variants for gene/protein
increases. The explanation seems quite straightforward as the
Skipgram model takes the word window as a context and predicts
surrounding words given the current word [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. With the aid of the
CVD ontology, we can get terms that provide a more pertinent
context by: a) enriching a gene symbol with parts of the protein
name; or b) including more than one term related to a protein
name. Hence the better results, which in our case means more term
variants.
      </p>
      <p>
        Detecting term variants can be useful for a variety of curation
and annotation tasks. As for both experiment I and II, the observed
agreed TV are likely to appear in the first three positions of the 12
top-ranked terms; this finding can be the basis of a systematic
approach to obtain plausible term variants for the 258,913
UniProtKB protein classes from the CVD ontology. As we proposed in
section 3.4, plausible term variants from word embeddings can be
easily stored in the CVD ontology by means of the annotation
property skos:hiddenLabel. Therefore, when querying the CVD
ontology using the query language SPARQL 1.1 [
        <xref ref-type="bibr" rid="ref52">52</xref>
        ] for a protein
that may appear in the biomedical literature, it possible to use both
rdf:label and skos:hiddenLabel. However, the terms stored in the
skos:hiddenLabel are more likely to give pertinent results because
they are derived from the word embeddings obtained from the 14
million PubMed citations (titles and abstracts), i.e. from the
biomedical literature itself. Furthermore, by having transformed
PubMed citations into RDF datasets, it is feasible to annotate a
PubMed citation not only with MeSH headings/descriptors, or
keywords from authors, but also with the terms for the lexicon
generated by CBOW and Skip-gram. Thus, it is possible to envision
more sophisticated SPARQL 1.1 SELECT queries that are able to
retrieve the PubMed citations themselves. Furthermore, from a
computational point of view the process described here is
affordable and sustainable: new PubMed citations can be converted into
RDF on a daily basis; Skip-gram can re-generate the lexicon and
the vectors in less than a day for 14 million PubMed citations
(titles and abstracts); terms from the human-readable labels of key
classes (gene/proteins) from the CVDO ontology can be used as
query terms to retrieved the top-ranked terms from the word
embeddings re-generated, where the three top-ranked terms (plausible
term variants) can be stored as literal values of skos:hiddenLabel.
Hence, periodical updates are feasible.
      </p>
      <p>Although text mining technology has made great strides in
extracting biomedical terminology from unstructured text sources,
the task of normalising (grounding) the extracted terms to
commonly used identifiers in ontologies or taxonomies is still quite
demanding. Identifying equivalent text realisations for the same
biomedical concept can be useful for (i) improving the quality of
information in curated resources such as UniProt or the Gene
Ontology, and (ii) for linking the information in these resources back
to the original text sources; this is helpful when a greater context
needs to be explored or for keeping up-to-date with the published
literature.</p>
      <p>Another potential application is in the area of query expansion
for Information Retrieval (IR). Although query enhancement using
synonyms is commonly deployed by many of today’s IR systems,
it is often more difficult to deal with cases of orthographic
variations or when new acronyms/abbreviations are introduced for new
terms. Identifying term variants can be a way of ameliorating the
effect of the classical problem of IR returning either too much or
too little for a user query.</p>
      <p>Lastly, text mining developers, especially those dealing with
rule-based systems, can benefit from unsupervised automated
techniques such as the one described in this paper, for building
terminological resources from large untagged corpora. Such
resources include both terminology lexica and grammars, either
manually developed or compiled via grammar induction
techniques. The usefulness of this approach for specific annotation
tasks will be the subject of future work.</p>
      <p>
        From a research perspective, data integration is not a new
challenge in the life sciences. Gomez-Cabrero et al. [
        <xref ref-type="bibr" rid="ref53">53</xref>
        ] state: “there is
a need for improved (and novel) annotation standards and
requirements in data repositories to enable better integration and
reuse of publically available data”. To the best of our knowledge,
this is the first time that word embeddings from Deep Learning and
an ontology in OWL have been put together with the aim of
linking ontology classes to terms derived from a large corpus of
biomedical literature in an unsupervised way and without the need of
having the corpus annotated.
6
      </p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>This study demonstrates the benefits of using terms from the
CVDO ontology classes to obtain more pertinent term variants for
gene/protein names from word embeddings generated from an
unannotated corpus with more than 14 million PubMed citations.
As the terms variants are induced from the biomedical literature,
they can facilitate data tagging and semantic indexing tasks.
Overall, our study explores the feasibility of obtaining methods that
scale when dealing with big data, and which enable automation of
deep semantic analysis and markup of textual information from
unannotated biomedical literature.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGEMENTS</title>
      <p>To Prof Iain Buchan and Stephen Walker for useful discussions;
and to Timothy Furmston for helping with the software and
einfrastructure.</p>
      <p>Funding: This work was supported by a grant from the European
Union Seventh Framework Programme (FP7/2007-2013) for
sysVASC project under grant agreement number 603288.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>World</given-names>
            <surname>Health</surname>
          </string-name>
          Organization -
          <article-title>Cardiovascular diseases (CVDs)</article-title>
          . Available at http://www.who.int/cardiovascular_diseases/en/.
          <source>Accessed 16 June</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] sysVASC project, http://cordis.europa.eu/project/rcn/ 111200_en.
          <source>html. Accessed 16 June</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <fpage>PubMed</fpage>
          - Detailed Indexing Statistics:
          <fpage>1965</fpage>
          -
          <lpage>2015</lpage>
          , https://www.nlm.nih.gov/bsd/index_stats_comp.html.
          <source>Accessed 16 June</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ely</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osheroff</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ebell</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chambliss</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinson</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stevermer</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pifer</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <article-title>Obstacles to answering doctors' questions about patient care with evidence: qualitative study</article-title>
          .
          <source>Bmj</source>
          ,
          <volume>324</volume>
          (
          <issue>7339</issue>
          ), p.
          <volume>710</volume>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Sarker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mollá</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and Paris, C.:
          <article-title>Automatic evidence quality prediction to support evidence-based decision making</article-title>
          .
          <source>Artificial intelligence in medicine</source>
          ,
          <volume>64</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>89</fpage>
          -
          <lpage>103</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Hristovski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dinevski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kastrin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rindflesch</surname>
          </string-name>
          , T.C.:
          <article-title>Biomedical question answering using semantic relations</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ), p.
          <volume>1</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Tanabe</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wilbur</surname>
          </string-name>
          , W.J.:
          <article-title>Tagging gene and protein names in biomedical text</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>18</volume>
          (
          <issue>8</issue>
          ), pp.
          <fpage>1124</fpage>
          -
          <lpage>1132</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Garten</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coulet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Altman</surname>
          </string-name>
          , R.B.:
          <article-title>Recent progress in automatically extracting information from the pharmacogenomic literature</article-title>
          .
          <source>Pharmacogenomics</source>
          ,
          <volume>11</volume>
          (
          <issue>10</issue>
          ), pp.
          <fpage>1467</fpage>
          -
          <lpage>1489</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Krauthammer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Nenadic</surname>
          </string-name>
          , G.:
          <article-title>Term identification in the biomedical literature</article-title>
          .
          <source>Journal of biomedical informatics</source>
          ,
          <volume>37</volume>
          (
          <issue>6</issue>
          ), pp.
          <fpage>512</fpage>
          -
          <lpage>526</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kell</surname>
            ,
            <given-names>D.B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>J.I.</given-names>
          </string-name>
          :
          <article-title>Text mining and its potential applications in systems biology</article-title>
          . Trends in biotechnology,
          <volume>24</volume>
          (
          <issue>12</issue>
          ), pp.
          <fpage>571</fpage>
          -
          <lpage>579</lpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Federiuk</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          ,
          <year>1999</year>
          .
          <article-title>The effect of abbreviations on MEDLINE searching</article-title>
          .
          <source>Academic emergency medicine</source>
          ,
          <volume>6</volume>
          (
          <issue>4</issue>
          ), pp.
          <fpage>292</fpage>
          -
          <lpage>296</lpage>
          . (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Wren</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adar</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garner</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Altman</surname>
          </string-name>
          , R.B.:
          <article-title>Biomedical term mapping databases</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <source>33(suppl 1)</source>
          , pp.
          <fpage>D289</fpage>
          -
          <lpage>D293</lpage>
          . (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Motik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grau</surname>
            ,
            <given-names>B. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fokoue</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Owl 2 web ontology language: Profiles</article-title>
          .
          <source>W3C recommendation</source>
          ,
          <volume>27</volume>
          ,
          <issue>61</issue>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Studer</surname>
          </string-name>
          ,
          <string-name>
            <surname>Rudi</surname>
            , V. Richard Benjamins, and
            <given-names>Dieter</given-names>
          </string-name>
          <string-name>
            <surname>Fensel</surname>
          </string-name>
          .
          <article-title>"Knowledge engineering: principles and methods." Data &amp; knowledge engineering 25</article-title>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>197</lpage>
          (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Editorial introduction to the Neural Networks special issue on Deep Learning of Representations. Neural networks: the official journal of the International Neural Network Society (</article-title>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T. K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          , S. T.:
          <article-title>A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge</article-title>
          .
          <source>Psychological review</source>
          ,
          <volume>104</volume>
          (
          <issue>2</issue>
          ),
          <volume>211</volume>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Lund</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Burgess</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Producing</surname>
          </string-name>
          high
          <article-title>-dimensional semantic spaces from lexical co-occurrence</article-title>
          .
          <source>Behavior Research Methods</source>
          , Instruments, &amp;
          <string-name>
            <surname>Computers</surname>
          </string-name>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ),
          <fpage>203</fpage>
          -
          <lpage>208</lpage>
          (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Kanerva</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kristofersson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Holst</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Random indexing of text samples for latent semantic analysis</article-title>
          .
          <source>In Proc. of the cognitive science society</source>
          (Vol.
          <volume>1036</volume>
          ). Mahwah, NJ: Erlbaum. (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Probabilistic latent semantic indexing</article-title>
          .
          <source>In Proc. of ACM SIGIR conference on Research and development in information retrieval. ACM</source>
          . pp.
          <fpage>50</fpage>
          -
          <lpage>57</lpage>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M. I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>the Journal of machine Learning research, 3</source>
          , pp.
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Widdows</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Empirical distributional semantics: methods and biomedical applications</article-title>
          .
          <source>Journal of biomedical informatics</source>
          ,
          <volume>42</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>390</fpage>
          -
          <lpage>405</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Jonnalagadda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leaman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Cohen,
          <string-name>
            <given-names>T.</given-names>
            , &amp;
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          :
          <article-title>A distributional semantics approach to simultaneous recognition of multiple classes of named entities</article-title>
          .
          <source>In Computational Linguistics and Intelligent Text Processing</source>
          . Springer Berlin Heidelberg. pp.
          <fpage>224</fpage>
          -
          <lpage>235</lpage>
          . (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shankar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Efficient non-parametric estimation of multiple embeddings per word in vector space</article-title>
          .
          <source>EMNLP</source>
          <year>2014</year>
          :
          <fpage>1059</fpage>
          -
          <lpage>1069</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kang</surname>
          </string-name>
          , L.:
          <article-title>A novel word embedding learning model using the dissociation between nouns and verbs</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>171</volume>
          , pp.
          <fpage>1108</fpage>
          -
          <lpage>1117</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Pyysalo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ginter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakoski</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Distributional semantics resources for biomedical text processing</article-title>
          .
          <source>In Proc. of Languages in Biology and Medicine</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Minarro-Giménez</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marín-Alonso</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Samwald</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Exploring the application of deep learning techniques on medical text corpora</article-title>
          .
          <source>Studies in health technology and informatics, 205</source>
          , pp.
          <fpage>584</fpage>
          -
          <lpage>588</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Apweiler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bairoch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barker</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boeckmann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gasteiger</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magrane</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.J.:</given-names>
          </string-name>
          <article-title>UniProt: the universal protein knowledgebase</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <source>32(suppl 1)</source>
          , pp.
          <fpage>D115</fpage>
          -
          <lpage>D119</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Griffiths-Jones</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grocock</surname>
          </string-name>
          , R.J.,
          <string-name>
            <surname>Van Dongen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bateman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Enright</surname>
            ,
            <given-names>A.J.:</given-names>
          </string-name>
          <article-title>miRBase: microRNA sequences, targets and gene nomenclature</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <source>34(suppl 1)</source>
          , pp.
          <fpage>D140</fpage>
          -
          <lpage>D144</lpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Wishart</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tzur</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knox</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eisner</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Young</surname>
          </string-name>
          , N., Cheng, D.,
          <string-name>
            <surname>Jewell</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arndt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sawhney</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Fung</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>HMDB: the human metabolome database</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <source>35(suppl 1)</source>
          , pp.
          <fpage>D521</fpage>
          -
          <lpage>D526</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>[32] OBI, http://www.obofoundry.org/ontology/obi.html</mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>[33] PRO, http://www.obofoundry.org/ontology/pr.html</mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>[34] SO, http://www.obofoundry.org/ontology/so.html</mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>[35] GO, http://www.obofoundry.org/ontology/go.html</mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>[36] ChEBI, http://www.obofoundry.org/ontology/chebi.html</mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>[37] NCBI, http://www.obofoundry.org/ontology/ncbitaxon.html</mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>[38] CL, http://www.obofoundry.org/ontology/cl.html</mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>[39] UBERON, http://www.obofoundry.org/ontology/uberon.html</mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>[40] PATO, http://www.obofoundry.org/ontology/pato.html</mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>[41] RO, http://www.obofoundry.org/ontology/ro.html</mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Arguello</given-names>
            <surname>Casteleiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            and
            <surname>Stevens</surname>
          </string-name>
          , R.:
          <article-title>The Proteasix Ontology</article-title>
          .
          <source>Journal of biomedical semantics</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ) (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <surname>Klyne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Carroll</surname>
            ,
            <given-names>J.J.:</given-names>
          </string-name>
          <article-title>Resource description framework (RDF): Concepts and abstract syntax (</article-title>
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <article-title>MEDLINE/PubMed XML Data Elements</article-title>
          , https://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>[45] DCMI, http://dublincore.org/schemas/rdfs/</mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ducharme</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vincent</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Janvin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A neural probabilistic language model</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          , pp.
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>[47] word2vec, http://code.google.com/p/word2vec/</mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <surname>Rehurek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and Sojka,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Software framework for topic modelling with large corpora</article-title>
          .
          <source>In Proc. of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <surname>Miles</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bechhofer</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>SKOS simple knowledge organization system reference</article-title>
          .
          <source>W3C recommendation</source>
          ,
          <volume>18</volume>
          ,
          <issue>W3C</issue>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <surname>Nenadic</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>McNaught</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Enhancing automatic term recognition through recognition of variation</article-title>
          .
          <source>In Proc. of Computational Linguistics</source>
          (p.
          <fpage>604</fpage>
          ).
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>A coefficient of agreement for nominal scales</article-title>
          .
          <source>Educational and Psychological Measurement</source>
          ,
          <volume>20</volume>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>46</lpage>
          (
          <year>1960</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Prud'hommeaux</surname>
          </string-name>
          , E.:
          <article-title>SPARQL 1.1 query language</article-title>
          .
          <source>W3C Recommendation</source>
          ,
          <volume>21</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <surname>Gomez-Cabrero</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abugessaisa</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teschendorff</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Merkenschlager</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gisel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballestar</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>BongcamRudloff</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Conesa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Tegnér</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Data integration in the era of omics: current and future challenges</article-title>
          .
          <source>BMC systems biology</source>
          ,
          <volume>8</volume>
          (
          <issue>2</issue>
          ), p.
          <volume>1</volume>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>