=Paper= {{Paper |id=Vol-1692/paperA |storemode=property |title=Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations |pdfUrl=https://ceur-ws.org/Vol-1692/paperA.pdf |volume=Vol-1692 |authors=Mercedes Argüello Casteleiro,George Demetriou,Warren J. Read,Maria Jesus Fernandez Prieto,Diego Maseda-Fernandez,Goran Nenadic,Julie Klein,John A. Keane,Robert Stevens |dblpUrl=https://dblp.org/rec/conf/odls/CasteleiroDRPMN16 }} ==Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations== https://ceur-ws.org/Vol-1692/paperA.pdf
Deep Learning meets Semantic Web: A feasibility study with
the Cardiovascular Disease Ontology and PubMed citations
M. Arguello Casteleiro1, G. Demetriou1, W. Read1, M.J. Fernandez-Prieto2, D.
Maseda-Fernandez3, G. Nenadic1,4, J. Klein5, J. Keane1,4 and R. Stevens1
1
  School of Computer Science, University of Manchester, UK
2
  School of Languages, University of Salford, UK
3
  Midcheshire Hospital Foundation Trust, NHS England, UK
4
  Manchester Institute of Biotechnology, University of Manchester, UK
5
  Institut National de la Santé et de la Recherche Medicale (INSERM) U1048, Toulouse, France


ABSTRACT                                                                domain experts was 0.80 for the first experiment and 0.74 for the
Background: Automatic identification of gene and protein                second experiment. In the first experiment, suitable (full and/or
names from biomedical publications can help curators and re-            partial) term variants were found for 65 of the 107 terms. In the
searchers to keep up with the findings published in the scientific      second experiment, the number increased to 100.
literature. As of today, this is a challenging task related to infor-   Conclusion: This study demonstrates the benefits of using
mation retrieval, and in the realm of Big Data Analytics.               terms from the CVDO ontology classes to obtain more pertinent
Objectives: To investigate the feasibility of using word embed-         term variants for gene/protein names from word embeddings
dings (i.e. distributed word representations) from Deep Learning        generated from an unannotated corpus with more than 14 million
algorithms together with terms from the Cardiovascular Disease          PubMed citations. As the terms variants are induced from the
Ontology (CVDO) as a step to identifying omics information en-          biomedical literature, they can facilitate data tagging and seman-
coded in the biomedical literature.                                     tic indexing tasks. Overall, our study explores the feasibility of
Methods: Word embeddings were generated using the neural                obtaining methods that scale when dealing with big data, and
language models CBOW and Skip-gram with an input of more                which enable automation of deep semantic analysis and markup
than 14 million PubMed citations (titles and abstracts) corre-          of textual information from unannotated biomedical literature.
sponding to articles published between 2000 and 2016. Then              * Contact: robert.stevens@manchester.ac.uk
the abstracts of selected papers from the sysVASC systematic
review were manually annotated with gene/protein names. We
                                                                        1    INTRODUCTION
set up two experiments that used the word embeddings to pro-
                                                                        According to the World Health Organisation cardiovascular dis-
duce term variants for gene/protein names: the first experiment
                                                                        eases (CVDs) are the number one cause of death globally [1]. The
used the terms manually annotated from the papers; the second
                                                                        SysVASC project [2] seeks to provide a comprehensive systems
experiment enriched/expanded the annotated terms using terms
                                                                        medicine approach to elucidate pathological mechanisms for CVD,
from the human-readable labels of key classes (gene/proteins)           which will yield molecular targets for therapeutic intervention. To
from the CVDO ontology. CVDO is formalised in the W3C Web               achieve this aim it is necessary to gather and integrate data from
Ontology Language (OWL) and contains 172,121 UniProt                    omics (e.g. genomics, transcriptomics, proteomics and metabolom-
Knowledgebase protein classes related to human and 86,792               ics) experiments.
UniProtKB protein classes related to mouse. The hypothesis is              The CVD ontology (CVDO) is developed as part of the sys-
that by enriching the original annotated terms, a better context is     VASC project to provide the infrastructure to integrate omics data
provided, and therefore, it is easier to obtain suitable (full and/or   that encapsulate findings published in the scientific literature. The
partial) term variants for gene/protein names from word embed-          CVDO ontology has 172,121 UniProtKB protein classes related to
                                                                        human, and 86,792 UniProtKB protein classes related to mouse. Of
dings.
                                                                        these, so far a total of only 8,196 UniProtKB protein classes (i.e.
Results: From the papers manually annotated, a list of 107
                                                                        reviewed Swiss-Prot; unreviewed TrEMB; along with Isoform
terms (gene/protein names) was acquired. As part of the word
                                                                        sequences) from mouse and human have been identified as of po-
embeddings generated from CBOW and Skip-gram, a lexicon                 tential interest to the sysVASC project. An important part of the
with more than 9 million terms was created. Using the cosine            manually curated effort is to tie experimental findings to the bio-
similarity metric, a list of the 12 top-ranked terms was generated      medical scientific literature. However, even a project like sys-
from word embeddings for query terms present in the generated           VASC cannot afford to have a team of researchers or curators who
lexicon. Domain experts evaluated a total of 1968 pairs of terms        can survey the literature regularly and deal with the fundamental
and classified the retrieved terms as: TV (term variant); PTV           task of identifying gene and protein names as a preliminary step to
(partial term variant); and NTV (non term variant, meaning none         identify the omics information encoded in the biomedical text.
of the previous two categories). In experiment I, Skip-gram finds          PubMed queries were the starting point of a systematic literature
                                                                        review performed for sysVASC to obtain the omics studies that
the double amount of (full and/or partial) term variants for
                                                                        underpins CVDO and the CVD Knowledge Base (CVDKB). Pub-
gene/protein names as compared with CBOW. Using Skip-gram,
                                                                        Med is a database from the U.S. National Library of Medicine
the weighted Cohen’s Kappa inter-annotator agreement for two



                                                                                                                                           1
M. Arguello Casteleiro et al



(NLM) with millions of citations from MEDLINE, life science               (LDA) [22]. While spatial DSMs compare terms using distance
journals, and online books. In June 2016, PubMed contained 26             metrics in high-dimensional space [23], probabilistic DSMs meas-
million citations with an average of 1.5 papers added per minute          ure similarity between terms according to the degree to which they
[3]. Keeping CVDKB up-to-date is a challenge shared with sys-             share the same topic distributions [23]. Most DSMs have high
tematic reviews that aim to keep updated with the best evidence           computational and storage cost associated with building the model
reported in the literature. As of today, searching through biomedi-       or modifying it due to the huge number of dimensions involved
cal literature and appraising information from relevant documents         when a large corpus is modelled [29]. Although neural models are
is extremely time consuming [4,5,6]. Furthermore, omics is a de-          not new in DSMs, recent advances in artificial neural networks
manding area, where the irregularities and ambiguities in gene and        (ANNs) make feasible the derivation of words from corpora of
protein nomenclature remain a challenge [7,8]. Krauthammera and           billions of words: hence the growing interest in Deep Learning and
Nenadic [9] highlight: “successful term identification is key to          the neural language models CBOW and Skip-gram of Mikolov et
getting access to the stored literature information, as it is the terms   al. [13,14].
(and their relationships) that convey knowledge across scientific            In a relatively short time, CBOW and Skip-gram have gained
articles”. The identification of biological entities in the field of      popularity to the point of being used for benchmarking word em-
systems biology has proven difficult due to term variation and term       beddings [25] or as baseline models for performance comparisons
ambiguity [10], because a concept can be expressed by various             [26]. We propose applying Mikolov et al. [13,14] neural language
realisations (a.k.a. term variants). A large-scale database such as       models, which can be trained to produce high-quality word em-
MEDLINE/PubMed contains longer words and phrases (e.g. “se-               beddings on English Wikipedia [25], to automatically extract terms
rum amyloid A-1 protein”) as well as shorter forms like abbrevia-         (gene and protein nomenclature) from 14,056,761 free-text unan-
tions or acronyms (e.g. “SAA”). Finding all the term variants in          notated MEDLINE/PubMed citations (title and abstract). Our hy-
text is important for improving the results of information retrieval      pothesis is that word embeddings of high quality should generate
(IR) systems like the PubMed search engine, which traditionally           useful lists of term variants. As of today, the application of
rely on keyword-based approaches. Therefore, the number of doc-           Mikolov et al. [13,14] CBOW and Skip-gram to the biomedical
uments retrieved is prone to change when using acronyms instead           literature remains largely unexplored with only some pioneering
of and/or in combination with full terms [11,12].                         work [27,28].
   This paper investigates the feasibility of using Deep Learning,
an emerging area of artificial neural networks, for identifying gene
                                                                          3     METHODS
and protein names of interest for sysVASC in biomedical text.
More specifically, we propose to use the two neural language mod-
els Skip-gram and CBOW (Continuous Bag-of-Words) of Mikolov               3.1    The CVD Ontology and its Knowledge Base
et al. [13,14] to produce word embeddings, which are distributed          The CVD ontology (CVDO) provides the infrastructure to inte-
word representations typically induced using neural language mod-         grate the omics data from multiple biological resources, such as the
els. These word embeddings can be traced back to PubMed cita-             UniProt Knowledgebase (UniProtKB) [29], the miRBase [30] from
tions, and can be also linked to the CVDO classes formalised in the       EMBL-EBI, and the Human Metabolome Database (HMDB) [31].
CVD Ontology represented in the W3C Web Ontology Language                 At the core of CVDO is the Ontology for Biomedical Investiga-
(OWL) [15].                                                               tions (OBI) [32] along with other reference ontologies produced by
                                                                          the OBO Consortium, such as the Protein Ontology (PRO) [33],
                                                                          the Sequence Ontology (SO) [34], the three Gene Ontology (GO)
2    APPROACH
                                                                          sub-ontologies [35], Chemical Entities of Biological Interest On-
In terms of information/knowledge extraction from texts, over the         tology (ChEBI) [36], NCBI Taxonomy Ontology [37], the Cell
years, the knowledge engineering (KE) [16] paradigm has lost              Ontology (CL) [38], the Uber Anatomy Ontology (UBERON)
popularity in favour of the machine learning (ML) paradigm. ML            [39], Phenotypic Quality Ontology (PATO) [40], and Relationship
algorithms learn input-output relations from examples with the            Ontology (RO) [41].
goal of interpreting new inputs; therefore, the performance of ML            In terms of knowledge modelling, CVDO shares the pro-
methods is heavily dependent on the choice of data representation         tein/gene representation used in the Proteasix Ontology (PxO) [42].
(or features) to which they are applied [17]. Representing words as
continuous vectors has a long history where different types of            3.2    PubMed: from XML to RDF datasets
models have been proposed to estimate continuous representation
                                                                          Through the ftp server from the U.S. NLM we downloaded the
of words and create distributional semantic models (DSMs). DSMs
                                                                          MEDLINE/PubMed baseline files for 2015 and also the update
derive representations for words in such a way that words occur-
                                                                          files up to 8th June 2016. We created a processing pipeline written
ring in similar contexts will have similar representations, and
                                                                          in Python that allows the conversion of the downloaded PubMed
therefore, the context needs to be defined. Some examples of con-
                                                                          XML files into W3C Resource Description Framework (RDF) [43]
text in DSMs include: Latent Semantic Analysis (LSA) [18] which
                                                                          datasets. This pipeline can also be reused to process the results of
generally uses an entire document as a context (i.e. word-document
                                                                          PubMed searches.
models), and Hyperspace Analog to Language (HAL) [19] which
                                                                             We performed a mapping between the PubMed XML elements
uses a sliding word window as a context (i.e. sliding window mod-
                                                                          [44] and terms from the Dublin Core Metadata Initiative (DCMI),
els). More recently, Random Indexing [20] has emerged as a prom-
                                                                          which has been taken up globally and has a publicly available RDF
ising alternative to LSA. LSA, HAL, and Random Indexing are
                                                                          Schema [45].
spatially motivated DSMs. Examples of probabilistic DSMs are
                                                                             When pre-processing the textual input for Mikolov et al. [13,14]
Probabilistic LSA (PLSA) [21] and Latent Dirichlet Allocation
                                                                          CBOW and Skip-gram, it is common practice systematically to


2
             Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations



lower-case the text and to remove all numbers. However, this is             study, we propose using skos:hiddenLabel to store plausible term
unsuitable when dealing with protein/gene names, because critical           variants derived from word embeddings for gene and protein clas-
information will be lost. To further illustrate this: for human, non-       ses from the CVD ontology.
human primates, chickens, and domestic species, gene symbols
contain three to six alphanumeric characters that are all in upper-         3.5     Experimental setup
case (e.g. OLR1), while for mice and rats the first letter alone is in
                                                                            A gold standard was created using 25 papers that meet the inclu-
uppercase (e.g. Olr1). Therefore, we have introduced some ad hoc
                                                                            sion and exclusion criteria of the sysVASC systematic review per-
rules as part of the pre-processing to guarantee that protein/gene          formed. The original PubMed query was: “coronary heart disease
names are preserved.
                                                                            AND (proteomics OR proteome OR transcriptomics OR transcrip-
                                                                            tome OR metabolomics OR metabolome OR omics)”. Out of all the
3.3     Deep Learning: word embeddings using word2vec                       paper abstracts, a total of 107 terms were manually annotated as
This study looks at neural language models, i.e. distributed repre-         protein/gene names. Each term was mapped to a CVD ontology
sentation of words learnt by artificial neural networks (ANNs). We          class to uniquely identify the conceptual entity (gene/protein) to
adopted the new log-linear models that try to minimise computa-             which the annotated term refers. This can be seen as term standard-
tional complexity. The CBOW and Skip-gram model architecture                isation process. Table 1 illustrates the mapping performed.
[13,14] is similar to the probabilistic feedforward neural network
language model (NNLM). The feedforward NNLM proposed by
                                                                            Table 1. Example of terms from PubMed abstract/title (left column)
Bengio et al. [46] consists of input, projection, hidden, and output
                                                                            mapped to labels for protein/gene from the CVD ontology (right column).
layers. In the CBOW and Skip-gram model architecture, the non-
linear hidden layer is removed and the projection layer is shared
for all the words, so all words are projected into the same position              Term   UniProt AC ((UniProt AC) protein name [gene symbol]
(their vectors are averaged) [14]. The Skip-gram model uses the
                                                                            alpha-1-antitrypsin      (P01009) Alpha-1-antitrypsin [SERPINA1]
current word to predict surrounding words, while the CBOW mod-              α(1)-antitrypsin
el predicts the current word based on the context.
   The basic Skip-gram formulation uses the softmax function [14].          annexin 4                (P09525) Annexin A4 [ANXA4]
The hierarchical softmax is a computationally efficient approxima-
tion of the full softmax. If W is the number of words in the lexicon,       superoxide dismutase 3   (P08294) Extracellular superoxide dismutase
                                                                                                     -Cu-Zn- [SOD3]
hierarchical softmax only needs to evaluate about log2(W) output
                                                                            The few examples from Table 1 show the lack of standardisation of
nodes to obtain the probability distribution, instead of needing to
                                                                            the field, and illustrate some of the alternative terms from the liter-
evaluate W output nodes.
                                                                            ature that refer to the conceptual entities (gene/protein) of interest.
   word2vec [47] is the software package used in this study. It was
                                                                              In this study we conducted two experiments:
initially released as open software and is faster than its Python
                                                                              •     Experiment I – we use the annotated terms from selected
counterpart implementation from Gensim [48]. Using word2vec
                                                                                    papers of the sysVASC systematic review alone (as they
and out of CBOW and Skip-gram with hierarchical softmax we
                                                                                    appear in the paper abstracts/titles) to obtain the list of 12
obtain: 1) a lexicon (i.e. a list of terms, typically multi-words) in
                                                                                    top-ranked terms (highest cosine value) from the CBOW
textual format that is constructed from the input data; and 2) the
                                                                                    and Skip-gram word embeddings. These are candidate term
resulting vectors of the neural DSM in binary mode. In distribu-
                                                                                    variants.
tional semantics a well-known similarity measure is cosine similar-
                                                                              •     Experiment II – we enriched/expanded the original annotat-
ity, i.e. the cosine of the angle between two vectors of n dimen-
                                                                                    ed terms with terms that appear in the CVDO classes and
sions. If the cosine is close to zero, the two vectors are considered
                                                                                    again produced a list of 12 top-ranked candidate term vari-
dissimilar, while if it is close to one, this indicates a high similarity
                                                                                    ants from CBOW and Skip-gram word embeddings.
between the two vectors.

                                                                            3.6     Human assessment
3.4     Integrating CVDO and word embeddings
                                                                            Domain experts assessed all the lists of 12 top-ranked candidate
The terms from the word embeddings lexicon can be traced back to
                                                                            term variants obtained for experiment I and II using CBOW and
PubMed citations. Among these terms, there are suitable (full
                                                                            Skip-gram.
and/or partial) term variants for gene/protein names that can also
be linked to the CVDO classes in the CVD ontology. To perform
                                                                            3.6.1 Evaluation guidelines We established a strict criterion to
the linkage between word embedding terms and CVDO classes, we
                                                                            mark the list of candidate terms produced by the word2vec word
looked at the Simple Knowledge Organization System (SKOS)
                                                                            embeddings. Following Nenadic et al. [50] a candidate term was
[49], which is a W3C standard aimed at leveraging the power of
                                                                            marked as term variant (TV for short) only when the term fell
linked data. In SKOS there are three properties to attach lexical
                                                                            within the following types of term variation: a) orthographic; b)
labels to conceptual resources [49]: 1) the preferred lexical label
                                                                            morphological; c) lexical; d) structural; or e) acronyms and abbre-
(i.e. skos:prefLabel); 2) the alternative lexical label (i.e.
                                                                            viations. Considering the biomedical value of phraseological ex-
skos:altLabel) for synonyms and acronyms; and 3) the hidden lexi-
                                                                            pressions (e.g. “ankyrin-B_gene” or “CBS_deficiency”) we
cal label (i.e. skos:hiddenLabel) for including misspelled variants
                                                                            marked them as partial term variant (PTV for short); however,
of other lexical variants or a string for text-based indexing. All of
                                                                            they had to refer to the same biomedical concept, i.e. protein or
these can be considered annotation properties (i.e. owl:Annotation
                                                                            gene name.
Property), and allow limited linguistic information only. In this



                                                                                                                                                      3
M. Arguello Casteleiro et al



3.6.2 Inter-annotator agreement Two domain experts follow-                    To illustrate qualitatively the results obtained; Table 4 (right
ing the above-mentioned evaluation guidelines assigned a simple            column) shows the term annotated (TV, PTV, and NTV) by rater B
key code for each candidate term variant: TV, PTV, and NTV (non            in experiment II using Skip-gram for ORL1, which is a gene sym-
term variant, meaning none of the previous two categories). The            bol. From experiment I using also Skip-gram, no suitable term
inter-annotator agreement is based on the Kappa measure [51],              variants were found. It should be noted that some of the candidate
widely used for inter-annotator agreement on classification tasks;         terms listed in Table 4 are well-known aliases of the gene symbol,
Kappa K is defined as K = (Pr(a) – Pr(e))/(1 – Pr(e)), where Pr(a)         such as LOX-1.
is the relative observed agreement between annotators, and Pr(e) is
the chance agreement.
                                                                           Table 4. Experiment II using Skip-gram: 12 top-ranked nearest neighbours
                                                                           by cosine similarity marked by rater B (TV, PTV, and NTV) for the query
4    RESULTS                                                               terms “oxidized_low-density_lipoprotein receptor_1” “OLR1”.

Using a VM with 100 GB RAM and 32 CPU(s) at 4.0 GHz, we
obtained the word embeddings from the unannotated PubMed cor-                    Term                                         Cosine similarity
pus of 14,056,761 free-text MEDLINE/PubMed citations (title and
                                                                               lectin-like_oxidized_low-density_lipoprotein     0.688603 - TV
abstract) for Skip-gram (much slower than CBOW) after 17 hours                 (LOX-1)_is                                       0.672042 - PTV
of processing. Due to the lack of space, we show here only some of             atherosclerosis_we_investigated                  0.669050 - NTV
the results obtained.                                                          receptor-1                                       0.664891 - NTV
   Both CBOW and Skip-gram used the same input and generate                    lectin-like_oxidized_LDL_receptor-1              0.663988 - TV
the same lexicon, however, the resulting vectors of the neural DSM             lOX-1_is                                         0.660110 - NTV
                                                                               human_atherosclerotic_lesions                    0.657075 - NTV
in binary mode were different. Hence, the 12 top-ranked terms for              oxidized_low-density_lipoprotein_(ox-LDL)        0.655515 - NTV
an input query term are likely to differ. For experiment I, only 77            oxidized_low-density_lipoprotein_(oxLDL)         0.654965 - PTV
of the 107 terms belong to the lexicon generated. For experiment               (LOX-1)                                          0.652099 - TV
II, as the CVD ontology is used to provide more context for each               proatherosclerotic                               0.651571 - NTV
term, only 3 terms out of the 107 remained without a valid entry in            receptor-1_(LOX-1)_is                            0.649000 - PTV
the lexicon. For experiment I, two domain experts (rater A and B)
assessed the 924 pairs of terms corresponding to 77 query terms.              We observed that some of the protein names annotated from
For experiment II, there was 87 query terms that include terms             sysVASC systematic review papers, like “annexin 4”, can not pro-
from the human-readable labels of key classes (gene/proteins) from         duce a suitable term variant as they do not appear as such in the
the CVDO ontology and considering multiple alternatives, and               generated lexicon generated by CBOW and Skip-gram. However,
thus, the same two domain experts assessed 1044 pairs of terms.            by enriching them with terms from the CVDO ontology, it is feasi-
                                                                           ble to obtain suitable term variants. For example, “annexin 4” can
Table 2. Experiment I: number of terms classified as TV (Term Variants);   be mapped to the full protein name “Annexin A4”, which has Uni-
PTV (Partial TV); and NTV (non TV) by rater A.                             Prot Accession number P09525 and gene symbol ANXA4. Indeed,
                                                                           within the level of observed agreement among the two domain
                                                                           experts, we can safely say that in the first experiment, suitable (full
Model         Term Variant     Partial TV         Non Term Variant
                                                                           and/or partial) term variants were found for 65 of the 107 terms. In
                                                                           the second experiment, the number increased to 100. Hence, only 7
CBOW             77            93                 754
Skip-gram       151            194                579
                                                                           out of the total 107 remain without suitable (full and/or partial)
                                                                           term variants.
                                                                              We also observed that the median of the rank (i.e. position in the
Table 3. Experiment II using Skip-gram: number of terms classified as TV   list of 12 top-ranked terms) for a TV agreed by rater A and B is 3
(Term Variant); PTV (Partial TV); and NTV (non TV) by rater A and B.       in both experiment I and II using Skip-gram. In other words, within
                                                                           the level of observed agreement a TV is likely to appear in the first
                                                                           three positions of the 12 top-ranked terms.
Domain Expert     Term Variant       Partial TV         Non Term Variant

Rater A               194               240                 610            5      DISCUSSION
Rater B               161               238                 645
                                                                           CBOW and Skip-gram have become the state-of-the-art for gener-
                                                                           ating word embeddings. From a quantitative point of view, this
   Table 2 summarises the number of terms classified as TV, PTV,           study shows that using Skip-gram the number of term variants (TV
and NTV for rater A in experiment I using CBOW and Skip-gram.              and/or PTV) for proteins/genes is substantially increased in com-
It is easy to derive from Table 2 that Skip-gram is better suited for      parison with CBOW. For experiment II, i.e. when the terms anno-
the task of finding suitable (full and/or partial) term variants for       tated from the sysVASC systematic review papers are en-
gene/protein names. The observed agreement (i.e. the portion of            riched/expanded with terms from the CVD ontology, the number
terms classified as TV, PTV, or NTV on which the two domain                of suitable (full and/or partial) term variants for gene/protein in-
experts agree) for experiment I with Skip-gram was 0.80 using              creases. The explanation seems quite straightforward as the Skip-
weighted Cohen’s Kappa measure [51]. Tables 3 summarised the               gram model takes the word window as a context and predicts sur-
number of terms classified as TV, PTV and NTV for rater A and B            rounding words given the current word [14]. With the aid of the
for experiment II using Skip-gram. The inter-annotator agreement           CVD ontology, we can get terms that provide a more pertinent
was 0.74 using the weighted Cohen’s Kappa measure [51].                    context by: a) enriching a gene symbol with parts of the protein



4
             Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations



name; or b) including more than one term related to a protein           niques. The usefulness of this approach for specific annotation
name. Hence the better results, which in our case means more term       tasks will be the subject of future work.
variants.                                                                  From a research perspective, data integration is not a new chal-
   Detecting term variants can be useful for a variety of curation      lenge in the life sciences. Gomez-Cabrero et al. [53] state: “there is
and annotation tasks. As for both experiment I and II, the observed     a need for improved (and novel) annotation standards and re-
agreed TV are likely to appear in the first three positions of the 12   quirements in data repositories to enable better integration and
top-ranked terms; this finding can be the basis of a systematic ap-     reuse of publically available data”. To the best of our knowledge,
proach to obtain plausible term variants for the 258,913 Uni-           this is the first time that word embeddings from Deep Learning and
ProtKB protein classes from the CVD ontology. As we proposed in         an ontology in OWL have been put together with the aim of link-
section 3.4, plausible term variants from word embeddings can be        ing ontology classes to terms derived from a large corpus of bio-
easily stored in the CVD ontology by means of the annotation            medical literature in an unsupervised way and without the need of
property skos:hiddenLabel. Therefore, when querying the CVD             having the corpus annotated.
ontology using the query language SPARQL 1.1 [52] for a protein
that may appear in the biomedical literature, it possible to use both
                                                                        6    CONCLUSION
rdf:label and skos:hiddenLabel. However, the terms stored in the
skos:hiddenLabel are more likely to give pertinent results because      This study demonstrates the benefits of using terms from the
they are derived from the word embeddings obtained from the 14          CVDO ontology classes to obtain more pertinent term variants for
million PubMed citations (titles and abstracts), i.e. from the bio-     gene/protein names from word embeddings generated from an
medical literature itself. Furthermore, by having transformed Pub-      unannotated corpus with more than 14 million PubMed citations.
Med citations into RDF datasets, it is feasible to annotate a Pub-      As the terms variants are induced from the biomedical literature,
Med citation not only with MeSH headings/descriptors, or key-           they can facilitate data tagging and semantic indexing tasks. Over-
words from authors, but also with the terms for the lexicon gener-      all, our study explores the feasibility of obtaining methods that
ated by CBOW and Skip-gram. Thus, it is possible to envision            scale when dealing with big data, and which enable automation of
more sophisticated SPARQL 1.1 SELECT queries that are able to           deep semantic analysis and markup of textual information from
retrieve the PubMed citations themselves. Furthermore, from a           unannotated biomedical literature.
computational point of view the process described here is afforda-
ble and sustainable: new PubMed citations can be converted into         ACKNOWLEDGEMENTS
RDF on a daily basis; Skip-gram can re-generate the lexicon and
the vectors in less than a day for 14 million PubMed citations (ti-     To Prof Iain Buchan and Stephen Walker for useful discussions;
tles and abstracts); terms from the human-readable labels of key        and to Timothy Furmston for helping with the software and e-
classes (gene/proteins) from the CVDO ontology can be used as           infrastructure.
query terms to retrieved the top-ranked terms from the word em-         Funding: This work was supported by a grant from the European
beddings re-generated, where the three top-ranked terms (plausible      Union Seventh Framework Programme (FP7/2007-2013) for sys-
term variants) can be stored as literal values of skos:hiddenLabel.     VASC project under grant agreement number 603288.
Hence, periodical updates are feasible.
   Although text mining technology has made great strides in ex-        REFERENCES
tracting biomedical terminology from unstructured text sources,
                                                                        [1] World Health Organization – Cardiovascular diseases (CVDs).
the task of normalising (grounding) the extracted terms to com-
                                                                          Available at http://www.who.int/cardiovascular_diseases/en/.
monly used identifiers in ontologies or taxonomies is still quite
                                                                          Accessed 16 June 2016.
demanding. Identifying equivalent text realisations for the same
                                                                        [2]     sysVASC       project,   http://cordis.europa.eu/project/rcn/
biomedical concept can be useful for (i) improving the quality of
                                                                          111200_en.html. Accessed 16 June 2016.
information in curated resources such as UniProt or the Gene On-
                                                                        [3] PubMed – Detailed            Indexing Statistics: 1965-2015,
tology, and (ii) for linking the information in these resources back
                                                                          https://www.nlm.nih.gov/bsd/index_stats_comp.html. Accessed
to the original text sources; this is helpful when a greater context
                                                                          16 June 2016.
needs to be explored or for keeping up-to-date with the published
                                                                        [4] Ely, J.W., Osheroff, J.A., Ebell, M.H., Chambliss, M.L.,
literature.
                                                                          Vinson, D.C., Stevermer, J.J. and Pifer, E.A.: Obstacles to an-
   Another potential application is in the area of query expansion
                                                                          swering doctors' questions about patient care with evidence:
for Information Retrieval (IR). Although query enhancement using
                                                                          qualitative study. Bmj, 324(7339), p.710 (2002).
synonyms is commonly deployed by many of today’s IR systems,
                                                                        [5] Sarker, A., Mollá, D. and Paris, C.: Automatic evidence quality
it is often more difficult to deal with cases of orthographic varia-
                                                                          prediction to support evidence-based decision making. Artificial
tions or when new acronyms/abbreviations are introduced for new
                                                                          intelligence in medicine, 64(2), pp.89-103 (2015).
terms. Identifying term variants can be a way of ameliorating the
                                                                        [6] Hristovski, D., Dinevski, D., Kastrin, A. and Rindflesch, T.C.:
effect of the classical problem of IR returning either too much or
                                                                          Biomedical question answering using semantic relations. BMC
too little for a user query.
                                                                          bioinformatics, 16(1), p. 1 (2015).
   Lastly, text mining developers, especially those dealing with
                                                                        [7] Tanabe, L. and Wilbur, W.J.: Tagging gene and protein names
rule-based systems, can benefit from unsupervised automated
                                                                          in biomedical text. Bioinformatics, 18(8), pp.1124-1132 (2002).
techniques such as the one described in this paper, for building
                                                                        [8] Garten, Y., Coulet, A. and Altman, R.B.: Recent progress in
terminological resources from large untagged corpora. Such re-
                                                                          automatically extracting information from the pharmacogenomic
sources include both terminology lexica and grammars, either
                                                                          literature. Pharmacogenomics, 11(10), pp.1467-1489 (2010).
manually developed or compiled via grammar induction tech-



                                                                                                                                            5
M. Arguello Casteleiro et al



[9] Krauthammer, M. and Nenadic, G.: Term identification in the        [27] Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., & Ananiadou,
  biomedical literature. Journal of biomedical informatics, 37(6),       S.: Distributional semantics resources for biomedical text pro-
  pp.512-526 (2004).                                                     cessing. In Proc. of Languages in Biology and Medicine (2013).
[10] Ananiadou, S., Kell, D.B. and Tsujii, J.I.: Text mining and its   [28] Minarro-Giménez, J. A., Marín-Alonso, O., & Samwald, M.:
  potential applications in systems biology. Trends in biotechnol-       Exploring the application of deep learning techniques on medi-
  ogy, 24(12), pp.571-579 (2006).                                        cal text corpora. Studies in health technology and informatics,
[11] Federiuk, C.S., 1999. The effect of abbreviations on                205, pp. 584-588 (2013).
  MEDLINE searching. Academic emergency medicine, 6(4),                [29] Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeck-
  pp.292-296. (1999).                                                    mann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Ma-
[12] Wren, J.D., Chang, J.T., Pustejovsky, J., Adar, E., Garner,         grane, M. and Martin, M.J.: UniProt: the universal protein
  H.R. and Altman, R.B.: Biomedical term mapping databases.              knowledgebase. Nucleic acids research, 32(suppl 1), pp.D115-
  Nucleic acids research, 33(suppl 1), pp.D289-D293. (2005).             D119 (2004).
[13] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean,     [30] Griffiths-Jones, S., Grocock, R.J., Van Dongen, S., Bateman,
  J.: Distributed representations of words and phrases and their         A. and Enright, A.J.: miRBase: microRNA sequences, targets
  compositionality. In Advances in Neural Information Processing         and gene nomenclature. Nucleic acids research, 34(suppl 1),
  Systems, pp. 3111-3119 (2013).                                         pp.D140-D144 (2006).
[14] Mikolov, T., Chen, K., Corrado, G., & Dean, J.: Efficient         [31] Wishart, D.S., Tzur, D., Knox, C., Eisner, R., Guo, A.C.,
  estimation of word representations in vector space. arXiv pre-         Young, N., Cheng, D., Jewell, K., Arndt, D., Sawhney, S. and
  print arXiv:1301.3781 (2013).                                          Fung, C.: HMDB: the human metabolome database. Nucleic ac-
[15] Motik, B., Grau, B. C., Horrocks, I., Wu, Z., Fokoue, A., &         ids research, 35(suppl 1), pp. D521-D526 (2007).
  Lutz, C.: Owl 2 web ontology language: Profiles. W3C recom-          [32] OBI, http://www.obofoundry.org/ontology/obi.html
  mendation, 27, 61 (2009).                                            [33] PRO, http://www.obofoundry.org/ontology/pr.html
[16] Studer, Rudi, V. Richard Benjamins, and Dieter Fensel.            [34] SO, http://www.obofoundry.org/ontology/so.html
  "Knowledge engineering: principles and methods." Data &              [35] GO, http://www.obofoundry.org/ontology/go.html
  knowledge engineering 25, no. 1, pp. 161-197 (1998).                 [36] ChEBI, http://www.obofoundry.org/ontology/chebi.html
[17] Bengio, Y., & Lee, H.: Editorial introduction to the Neural       [37] NCBI, http://www.obofoundry.org/ontology/ncbitaxon.html
  Networks special issue on Deep Learning of Representations.          [38] CL, http://www.obofoundry.org/ontology/cl.html
  Neural networks: the official journal of the International Neural    [39] UBERON, http://www.obofoundry.org/ontology/uberon.html
  Network Society (2014).                                              [40] PATO, http://www.obofoundry.org/ontology/pato.html
[18] Landauer, T. K., & Dumais, S. T.: A solution to Plato's prob-     [41] RO, http://www.obofoundry.org/ontology/ro.html
  lem: The latent semantic analysis theory of acquisition, induc-      [42] Arguello Casteleiro, M., Klein, J. and Stevens, R.: The Prote-
  tion, and representation of knowledge. Psychological review,           asix Ontology. Journal of biomedical semantics, 7(1) (2016).
  104(2), 211 (1997).                                                  [43] Klyne, G. and Carroll, J.J.: Resource description framework
[19] Lund, K., & Burgess, C.: Producing high-dimensional seman-          (RDF): Concepts and abstract syntax (2006).
  tic spaces from lexical co-occurrence. Behavior Research Meth-       [44]      MEDLINE/PubMed            XML        Data       Elements,
  ods, Instruments, & Computers, 28(2), 203-208 (1996).                  https://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html
[20] Kanerva, P., Kristofersson, J., & Holst, A.: Random indexing      [45] DCMI, http://dublincore.org/schemas/rdfs/
  of text samples for latent semantic analysis. In Proc. of the cog-   [46] Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C.: A neu-
  nitive science society (Vol. 1036). Mahwah, NJ: Erlbaum.               ral probabilistic language model. The Journal of Machine Learn-
  (2000).                                                                ing Research, 3, pp. 1137-1155 (2003).
[21] Hofmann, T.: Probabilistic latent semantic indexing. In Proc.     [47] word2vec, http://code.google.com/p/word2vec/
  of ACM SIGIR conference on Research and development in in-           [48] Rehurek, R. and Sojka, P.: Software framework for topic
  formation retrieval. ACM. pp. 50-57 (1999).                            modelling with large corpora. In Proc. of the LREC 2010 Work-
[22] Blei, D. M., Ng, A. Y., & Jordan, M. I.: Latent dirichlet allo-     shop on New Challenges for NLP Frameworks (2010).
  cation. the Journal of machine Learning research, 3, pp. 993-        [49] Miles, A. and Bechhofer, S.: SKOS simple knowledge organi-
  1022 (2003).                                                           zation system reference. W3C recommendation, 18, W3C
[23] Cohen, T., & Widdows, D.: Empirical distributional seman-           (2009).
  tics: methods and biomedical applications. Journal of biomedical     [50] Nenadic, G., Ananiadou, S. and McNaught, J.: Enhancing
  informatics, 42(2), pp. 390-405 (2009).                                automatic term recognition through recognition of variation. In
[24] Jonnalagadda, S., Leaman, R., Cohen, T., & Gonzalez, G.: A          Proc. of Computational Linguistics (p. 604). Association for
  distributional semantics approach to simultaneous recognition of       Computational Linguistics (2004).
  multiple classes of named entities. In Computational Linguistics     [51] Cohen, J. A coefficient of agreement for nominal scales. Edu-
  and Intelligent Text Processing. Springer Berlin Heidelberg. pp.       cational and Psychological Measurement, 20, pp. 37-46 (1960).
  224-235. (2010).                                                     [52] Harris, S., Seaborne, A. and Prud’hommeaux, E.: SPARQL
[25] Neelakantan, A., Shankar, J., Passos, A. and McCallum, A.:          1.1 query language. W3C Recommendation, 21 (2013).
  Efficient non-parametric estimation of multiple embeddings per       [53] Gomez-Cabrero, D., Abugessaisa, I., Maier, D., Teschendorff,
  word in vector space. EMNLP 2014: 1059-1069 (2014).                    A., Merkenschlager, M., Gisel, A., Ballestar, E., Bongcam-
[26] Hu, B., Tang, B., Chen, Q. and Kang, L.: A novel word em-           Rudloff, E., Conesa, A. and Tegnér, J.: Data integration in the
  bedding learning model using the dissociation between nouns            era of omics: current and future challenges. BMC systems biolo-
  and verbs. Neurocomputing, 171, pp.1108-1117 (2016).                   gy, 8(2), p.1 (2014).



6