=Paper=
{{Paper
|id=Vol-1692/paperA
|storemode=property
|title=Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations
|pdfUrl=https://ceur-ws.org/Vol-1692/paperA.pdf
|volume=Vol-1692
|authors=Mercedes Argüello Casteleiro,George Demetriou,Warren J. Read,Maria Jesus Fernandez Prieto,Diego Maseda-Fernandez,Goran Nenadic,Julie Klein,John A. Keane,Robert Stevens
|dblpUrl=https://dblp.org/rec/conf/odls/CasteleiroDRPMN16
}}
==Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations==
Deep Learning meets Semantic Web: A feasibility study with
the Cardiovascular Disease Ontology and PubMed citations
M. Arguello Casteleiro1, G. Demetriou1, W. Read1, M.J. Fernandez-Prieto2, D.
Maseda-Fernandez3, G. Nenadic1,4, J. Klein5, J. Keane1,4 and R. Stevens1
1
School of Computer Science, University of Manchester, UK
2
School of Languages, University of Salford, UK
3
Midcheshire Hospital Foundation Trust, NHS England, UK
4
Manchester Institute of Biotechnology, University of Manchester, UK
5
Institut National de la Santé et de la Recherche Medicale (INSERM) U1048, Toulouse, France
ABSTRACT domain experts was 0.80 for the first experiment and 0.74 for the
Background: Automatic identification of gene and protein second experiment. In the first experiment, suitable (full and/or
names from biomedical publications can help curators and re- partial) term variants were found for 65 of the 107 terms. In the
searchers to keep up with the findings published in the scientific second experiment, the number increased to 100.
literature. As of today, this is a challenging task related to infor- Conclusion: This study demonstrates the benefits of using
mation retrieval, and in the realm of Big Data Analytics. terms from the CVDO ontology classes to obtain more pertinent
Objectives: To investigate the feasibility of using word embed- term variants for gene/protein names from word embeddings
dings (i.e. distributed word representations) from Deep Learning generated from an unannotated corpus with more than 14 million
algorithms together with terms from the Cardiovascular Disease PubMed citations. As the terms variants are induced from the
Ontology (CVDO) as a step to identifying omics information en- biomedical literature, they can facilitate data tagging and seman-
coded in the biomedical literature. tic indexing tasks. Overall, our study explores the feasibility of
Methods: Word embeddings were generated using the neural obtaining methods that scale when dealing with big data, and
language models CBOW and Skip-gram with an input of more which enable automation of deep semantic analysis and markup
than 14 million PubMed citations (titles and abstracts) corre- of textual information from unannotated biomedical literature.
sponding to articles published between 2000 and 2016. Then * Contact: robert.stevens@manchester.ac.uk
the abstracts of selected papers from the sysVASC systematic
review were manually annotated with gene/protein names. We
1 INTRODUCTION
set up two experiments that used the word embeddings to pro-
According to the World Health Organisation cardiovascular dis-
duce term variants for gene/protein names: the first experiment
eases (CVDs) are the number one cause of death globally [1]. The
used the terms manually annotated from the papers; the second
SysVASC project [2] seeks to provide a comprehensive systems
experiment enriched/expanded the annotated terms using terms
medicine approach to elucidate pathological mechanisms for CVD,
from the human-readable labels of key classes (gene/proteins) which will yield molecular targets for therapeutic intervention. To
from the CVDO ontology. CVDO is formalised in the W3C Web achieve this aim it is necessary to gather and integrate data from
Ontology Language (OWL) and contains 172,121 UniProt omics (e.g. genomics, transcriptomics, proteomics and metabolom-
Knowledgebase protein classes related to human and 86,792 ics) experiments.
UniProtKB protein classes related to mouse. The hypothesis is The CVD ontology (CVDO) is developed as part of the sys-
that by enriching the original annotated terms, a better context is VASC project to provide the infrastructure to integrate omics data
provided, and therefore, it is easier to obtain suitable (full and/or that encapsulate findings published in the scientific literature. The
partial) term variants for gene/protein names from word embed- CVDO ontology has 172,121 UniProtKB protein classes related to
human, and 86,792 UniProtKB protein classes related to mouse. Of
dings.
these, so far a total of only 8,196 UniProtKB protein classes (i.e.
Results: From the papers manually annotated, a list of 107
reviewed Swiss-Prot; unreviewed TrEMB; along with Isoform
terms (gene/protein names) was acquired. As part of the word
sequences) from mouse and human have been identified as of po-
embeddings generated from CBOW and Skip-gram, a lexicon tential interest to the sysVASC project. An important part of the
with more than 9 million terms was created. Using the cosine manually curated effort is to tie experimental findings to the bio-
similarity metric, a list of the 12 top-ranked terms was generated medical scientific literature. However, even a project like sys-
from word embeddings for query terms present in the generated VASC cannot afford to have a team of researchers or curators who
lexicon. Domain experts evaluated a total of 1968 pairs of terms can survey the literature regularly and deal with the fundamental
and classified the retrieved terms as: TV (term variant); PTV task of identifying gene and protein names as a preliminary step to
(partial term variant); and NTV (non term variant, meaning none identify the omics information encoded in the biomedical text.
of the previous two categories). In experiment I, Skip-gram finds PubMed queries were the starting point of a systematic literature
review performed for sysVASC to obtain the omics studies that
the double amount of (full and/or partial) term variants for
underpins CVDO and the CVD Knowledge Base (CVDKB). Pub-
gene/protein names as compared with CBOW. Using Skip-gram,
Med is a database from the U.S. National Library of Medicine
the weighted Cohen’s Kappa inter-annotator agreement for two
1
M. Arguello Casteleiro et al
(NLM) with millions of citations from MEDLINE, life science (LDA) [22]. While spatial DSMs compare terms using distance
journals, and online books. In June 2016, PubMed contained 26 metrics in high-dimensional space [23], probabilistic DSMs meas-
million citations with an average of 1.5 papers added per minute ure similarity between terms according to the degree to which they
[3]. Keeping CVDKB up-to-date is a challenge shared with sys- share the same topic distributions [23]. Most DSMs have high
tematic reviews that aim to keep updated with the best evidence computational and storage cost associated with building the model
reported in the literature. As of today, searching through biomedi- or modifying it due to the huge number of dimensions involved
cal literature and appraising information from relevant documents when a large corpus is modelled [29]. Although neural models are
is extremely time consuming [4,5,6]. Furthermore, omics is a de- not new in DSMs, recent advances in artificial neural networks
manding area, where the irregularities and ambiguities in gene and (ANNs) make feasible the derivation of words from corpora of
protein nomenclature remain a challenge [7,8]. Krauthammera and billions of words: hence the growing interest in Deep Learning and
Nenadic [9] highlight: “successful term identification is key to the neural language models CBOW and Skip-gram of Mikolov et
getting access to the stored literature information, as it is the terms al. [13,14].
(and their relationships) that convey knowledge across scientific In a relatively short time, CBOW and Skip-gram have gained
articles”. The identification of biological entities in the field of popularity to the point of being used for benchmarking word em-
systems biology has proven difficult due to term variation and term beddings [25] or as baseline models for performance comparisons
ambiguity [10], because a concept can be expressed by various [26]. We propose applying Mikolov et al. [13,14] neural language
realisations (a.k.a. term variants). A large-scale database such as models, which can be trained to produce high-quality word em-
MEDLINE/PubMed contains longer words and phrases (e.g. “se- beddings on English Wikipedia [25], to automatically extract terms
rum amyloid A-1 protein”) as well as shorter forms like abbrevia- (gene and protein nomenclature) from 14,056,761 free-text unan-
tions or acronyms (e.g. “SAA”). Finding all the term variants in notated MEDLINE/PubMed citations (title and abstract). Our hy-
text is important for improving the results of information retrieval pothesis is that word embeddings of high quality should generate
(IR) systems like the PubMed search engine, which traditionally useful lists of term variants. As of today, the application of
rely on keyword-based approaches. Therefore, the number of doc- Mikolov et al. [13,14] CBOW and Skip-gram to the biomedical
uments retrieved is prone to change when using acronyms instead literature remains largely unexplored with only some pioneering
of and/or in combination with full terms [11,12]. work [27,28].
This paper investigates the feasibility of using Deep Learning,
an emerging area of artificial neural networks, for identifying gene
3 METHODS
and protein names of interest for sysVASC in biomedical text.
More specifically, we propose to use the two neural language mod-
els Skip-gram and CBOW (Continuous Bag-of-Words) of Mikolov 3.1 The CVD Ontology and its Knowledge Base
et al. [13,14] to produce word embeddings, which are distributed The CVD ontology (CVDO) provides the infrastructure to inte-
word representations typically induced using neural language mod- grate the omics data from multiple biological resources, such as the
els. These word embeddings can be traced back to PubMed cita- UniProt Knowledgebase (UniProtKB) [29], the miRBase [30] from
tions, and can be also linked to the CVDO classes formalised in the EMBL-EBI, and the Human Metabolome Database (HMDB) [31].
CVD Ontology represented in the W3C Web Ontology Language At the core of CVDO is the Ontology for Biomedical Investiga-
(OWL) [15]. tions (OBI) [32] along with other reference ontologies produced by
the OBO Consortium, such as the Protein Ontology (PRO) [33],
the Sequence Ontology (SO) [34], the three Gene Ontology (GO)
2 APPROACH
sub-ontologies [35], Chemical Entities of Biological Interest On-
In terms of information/knowledge extraction from texts, over the tology (ChEBI) [36], NCBI Taxonomy Ontology [37], the Cell
years, the knowledge engineering (KE) [16] paradigm has lost Ontology (CL) [38], the Uber Anatomy Ontology (UBERON)
popularity in favour of the machine learning (ML) paradigm. ML [39], Phenotypic Quality Ontology (PATO) [40], and Relationship
algorithms learn input-output relations from examples with the Ontology (RO) [41].
goal of interpreting new inputs; therefore, the performance of ML In terms of knowledge modelling, CVDO shares the pro-
methods is heavily dependent on the choice of data representation tein/gene representation used in the Proteasix Ontology (PxO) [42].
(or features) to which they are applied [17]. Representing words as
continuous vectors has a long history where different types of 3.2 PubMed: from XML to RDF datasets
models have been proposed to estimate continuous representation
Through the ftp server from the U.S. NLM we downloaded the
of words and create distributional semantic models (DSMs). DSMs
MEDLINE/PubMed baseline files for 2015 and also the update
derive representations for words in such a way that words occur-
files up to 8th June 2016. We created a processing pipeline written
ring in similar contexts will have similar representations, and
in Python that allows the conversion of the downloaded PubMed
therefore, the context needs to be defined. Some examples of con-
XML files into W3C Resource Description Framework (RDF) [43]
text in DSMs include: Latent Semantic Analysis (LSA) [18] which
datasets. This pipeline can also be reused to process the results of
generally uses an entire document as a context (i.e. word-document
PubMed searches.
models), and Hyperspace Analog to Language (HAL) [19] which
We performed a mapping between the PubMed XML elements
uses a sliding word window as a context (i.e. sliding window mod-
[44] and terms from the Dublin Core Metadata Initiative (DCMI),
els). More recently, Random Indexing [20] has emerged as a prom-
which has been taken up globally and has a publicly available RDF
ising alternative to LSA. LSA, HAL, and Random Indexing are
Schema [45].
spatially motivated DSMs. Examples of probabilistic DSMs are
When pre-processing the textual input for Mikolov et al. [13,14]
Probabilistic LSA (PLSA) [21] and Latent Dirichlet Allocation
CBOW and Skip-gram, it is common practice systematically to
2
Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations
lower-case the text and to remove all numbers. However, this is study, we propose using skos:hiddenLabel to store plausible term
unsuitable when dealing with protein/gene names, because critical variants derived from word embeddings for gene and protein clas-
information will be lost. To further illustrate this: for human, non- ses from the CVD ontology.
human primates, chickens, and domestic species, gene symbols
contain three to six alphanumeric characters that are all in upper- 3.5 Experimental setup
case (e.g. OLR1), while for mice and rats the first letter alone is in
A gold standard was created using 25 papers that meet the inclu-
uppercase (e.g. Olr1). Therefore, we have introduced some ad hoc
sion and exclusion criteria of the sysVASC systematic review per-
rules as part of the pre-processing to guarantee that protein/gene formed. The original PubMed query was: “coronary heart disease
names are preserved.
AND (proteomics OR proteome OR transcriptomics OR transcrip-
tome OR metabolomics OR metabolome OR omics)”. Out of all the
3.3 Deep Learning: word embeddings using word2vec paper abstracts, a total of 107 terms were manually annotated as
This study looks at neural language models, i.e. distributed repre- protein/gene names. Each term was mapped to a CVD ontology
sentation of words learnt by artificial neural networks (ANNs). We class to uniquely identify the conceptual entity (gene/protein) to
adopted the new log-linear models that try to minimise computa- which the annotated term refers. This can be seen as term standard-
tional complexity. The CBOW and Skip-gram model architecture isation process. Table 1 illustrates the mapping performed.
[13,14] is similar to the probabilistic feedforward neural network
language model (NNLM). The feedforward NNLM proposed by
Table 1. Example of terms from PubMed abstract/title (left column)
Bengio et al. [46] consists of input, projection, hidden, and output
mapped to labels for protein/gene from the CVD ontology (right column).
layers. In the CBOW and Skip-gram model architecture, the non-
linear hidden layer is removed and the projection layer is shared
for all the words, so all words are projected into the same position Term UniProt AC ((UniProt AC) protein name [gene symbol]
(their vectors are averaged) [14]. The Skip-gram model uses the
alpha-1-antitrypsin (P01009) Alpha-1-antitrypsin [SERPINA1]
current word to predict surrounding words, while the CBOW mod- α(1)-antitrypsin
el predicts the current word based on the context.
The basic Skip-gram formulation uses the softmax function [14]. annexin 4 (P09525) Annexin A4 [ANXA4]
The hierarchical softmax is a computationally efficient approxima-
tion of the full softmax. If W is the number of words in the lexicon, superoxide dismutase 3 (P08294) Extracellular superoxide dismutase
-Cu-Zn- [SOD3]
hierarchical softmax only needs to evaluate about log2(W) output
The few examples from Table 1 show the lack of standardisation of
nodes to obtain the probability distribution, instead of needing to
the field, and illustrate some of the alternative terms from the liter-
evaluate W output nodes.
ature that refer to the conceptual entities (gene/protein) of interest.
word2vec [47] is the software package used in this study. It was
In this study we conducted two experiments:
initially released as open software and is faster than its Python
• Experiment I – we use the annotated terms from selected
counterpart implementation from Gensim [48]. Using word2vec
papers of the sysVASC systematic review alone (as they
and out of CBOW and Skip-gram with hierarchical softmax we
appear in the paper abstracts/titles) to obtain the list of 12
obtain: 1) a lexicon (i.e. a list of terms, typically multi-words) in
top-ranked terms (highest cosine value) from the CBOW
textual format that is constructed from the input data; and 2) the
and Skip-gram word embeddings. These are candidate term
resulting vectors of the neural DSM in binary mode. In distribu-
variants.
tional semantics a well-known similarity measure is cosine similar-
• Experiment II – we enriched/expanded the original annotat-
ity, i.e. the cosine of the angle between two vectors of n dimen-
ed terms with terms that appear in the CVDO classes and
sions. If the cosine is close to zero, the two vectors are considered
again produced a list of 12 top-ranked candidate term vari-
dissimilar, while if it is close to one, this indicates a high similarity
ants from CBOW and Skip-gram word embeddings.
between the two vectors.
3.6 Human assessment
3.4 Integrating CVDO and word embeddings
Domain experts assessed all the lists of 12 top-ranked candidate
The terms from the word embeddings lexicon can be traced back to
term variants obtained for experiment I and II using CBOW and
PubMed citations. Among these terms, there are suitable (full
Skip-gram.
and/or partial) term variants for gene/protein names that can also
be linked to the CVDO classes in the CVD ontology. To perform
3.6.1 Evaluation guidelines We established a strict criterion to
the linkage between word embedding terms and CVDO classes, we
mark the list of candidate terms produced by the word2vec word
looked at the Simple Knowledge Organization System (SKOS)
embeddings. Following Nenadic et al. [50] a candidate term was
[49], which is a W3C standard aimed at leveraging the power of
marked as term variant (TV for short) only when the term fell
linked data. In SKOS there are three properties to attach lexical
within the following types of term variation: a) orthographic; b)
labels to conceptual resources [49]: 1) the preferred lexical label
morphological; c) lexical; d) structural; or e) acronyms and abbre-
(i.e. skos:prefLabel); 2) the alternative lexical label (i.e.
viations. Considering the biomedical value of phraseological ex-
skos:altLabel) for synonyms and acronyms; and 3) the hidden lexi-
pressions (e.g. “ankyrin-B_gene” or “CBS_deficiency”) we
cal label (i.e. skos:hiddenLabel) for including misspelled variants
marked them as partial term variant (PTV for short); however,
of other lexical variants or a string for text-based indexing. All of
they had to refer to the same biomedical concept, i.e. protein or
these can be considered annotation properties (i.e. owl:Annotation
gene name.
Property), and allow limited linguistic information only. In this
3
M. Arguello Casteleiro et al
3.6.2 Inter-annotator agreement Two domain experts follow- To illustrate qualitatively the results obtained; Table 4 (right
ing the above-mentioned evaluation guidelines assigned a simple column) shows the term annotated (TV, PTV, and NTV) by rater B
key code for each candidate term variant: TV, PTV, and NTV (non in experiment II using Skip-gram for ORL1, which is a gene sym-
term variant, meaning none of the previous two categories). The bol. From experiment I using also Skip-gram, no suitable term
inter-annotator agreement is based on the Kappa measure [51], variants were found. It should be noted that some of the candidate
widely used for inter-annotator agreement on classification tasks; terms listed in Table 4 are well-known aliases of the gene symbol,
Kappa K is defined as K = (Pr(a) – Pr(e))/(1 – Pr(e)), where Pr(a) such as LOX-1.
is the relative observed agreement between annotators, and Pr(e) is
the chance agreement.
Table 4. Experiment II using Skip-gram: 12 top-ranked nearest neighbours
by cosine similarity marked by rater B (TV, PTV, and NTV) for the query
4 RESULTS terms “oxidized_low-density_lipoprotein receptor_1” “OLR1”.
Using a VM with 100 GB RAM and 32 CPU(s) at 4.0 GHz, we
obtained the word embeddings from the unannotated PubMed cor- Term Cosine similarity
pus of 14,056,761 free-text MEDLINE/PubMed citations (title and
lectin-like_oxidized_low-density_lipoprotein 0.688603 - TV
abstract) for Skip-gram (much slower than CBOW) after 17 hours (LOX-1)_is 0.672042 - PTV
of processing. Due to the lack of space, we show here only some of atherosclerosis_we_investigated 0.669050 - NTV
the results obtained. receptor-1 0.664891 - NTV
Both CBOW and Skip-gram used the same input and generate lectin-like_oxidized_LDL_receptor-1 0.663988 - TV
the same lexicon, however, the resulting vectors of the neural DSM lOX-1_is 0.660110 - NTV
human_atherosclerotic_lesions 0.657075 - NTV
in binary mode were different. Hence, the 12 top-ranked terms for oxidized_low-density_lipoprotein_(ox-LDL) 0.655515 - NTV
an input query term are likely to differ. For experiment I, only 77 oxidized_low-density_lipoprotein_(oxLDL) 0.654965 - PTV
of the 107 terms belong to the lexicon generated. For experiment (LOX-1) 0.652099 - TV
II, as the CVD ontology is used to provide more context for each proatherosclerotic 0.651571 - NTV
term, only 3 terms out of the 107 remained without a valid entry in receptor-1_(LOX-1)_is 0.649000 - PTV
the lexicon. For experiment I, two domain experts (rater A and B)
assessed the 924 pairs of terms corresponding to 77 query terms. We observed that some of the protein names annotated from
For experiment II, there was 87 query terms that include terms sysVASC systematic review papers, like “annexin 4”, can not pro-
from the human-readable labels of key classes (gene/proteins) from duce a suitable term variant as they do not appear as such in the
the CVDO ontology and considering multiple alternatives, and generated lexicon generated by CBOW and Skip-gram. However,
thus, the same two domain experts assessed 1044 pairs of terms. by enriching them with terms from the CVDO ontology, it is feasi-
ble to obtain suitable term variants. For example, “annexin 4” can
Table 2. Experiment I: number of terms classified as TV (Term Variants); be mapped to the full protein name “Annexin A4”, which has Uni-
PTV (Partial TV); and NTV (non TV) by rater A. Prot Accession number P09525 and gene symbol ANXA4. Indeed,
within the level of observed agreement among the two domain
experts, we can safely say that in the first experiment, suitable (full
Model Term Variant Partial TV Non Term Variant
and/or partial) term variants were found for 65 of the 107 terms. In
the second experiment, the number increased to 100. Hence, only 7
CBOW 77 93 754
Skip-gram 151 194 579
out of the total 107 remain without suitable (full and/or partial)
term variants.
We also observed that the median of the rank (i.e. position in the
Table 3. Experiment II using Skip-gram: number of terms classified as TV list of 12 top-ranked terms) for a TV agreed by rater A and B is 3
(Term Variant); PTV (Partial TV); and NTV (non TV) by rater A and B. in both experiment I and II using Skip-gram. In other words, within
the level of observed agreement a TV is likely to appear in the first
three positions of the 12 top-ranked terms.
Domain Expert Term Variant Partial TV Non Term Variant
Rater A 194 240 610 5 DISCUSSION
Rater B 161 238 645
CBOW and Skip-gram have become the state-of-the-art for gener-
ating word embeddings. From a quantitative point of view, this
Table 2 summarises the number of terms classified as TV, PTV, study shows that using Skip-gram the number of term variants (TV
and NTV for rater A in experiment I using CBOW and Skip-gram. and/or PTV) for proteins/genes is substantially increased in com-
It is easy to derive from Table 2 that Skip-gram is better suited for parison with CBOW. For experiment II, i.e. when the terms anno-
the task of finding suitable (full and/or partial) term variants for tated from the sysVASC systematic review papers are en-
gene/protein names. The observed agreement (i.e. the portion of riched/expanded with terms from the CVD ontology, the number
terms classified as TV, PTV, or NTV on which the two domain of suitable (full and/or partial) term variants for gene/protein in-
experts agree) for experiment I with Skip-gram was 0.80 using creases. The explanation seems quite straightforward as the Skip-
weighted Cohen’s Kappa measure [51]. Tables 3 summarised the gram model takes the word window as a context and predicts sur-
number of terms classified as TV, PTV and NTV for rater A and B rounding words given the current word [14]. With the aid of the
for experiment II using Skip-gram. The inter-annotator agreement CVD ontology, we can get terms that provide a more pertinent
was 0.74 using the weighted Cohen’s Kappa measure [51]. context by: a) enriching a gene symbol with parts of the protein
4
Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations
name; or b) including more than one term related to a protein niques. The usefulness of this approach for specific annotation
name. Hence the better results, which in our case means more term tasks will be the subject of future work.
variants. From a research perspective, data integration is not a new chal-
Detecting term variants can be useful for a variety of curation lenge in the life sciences. Gomez-Cabrero et al. [53] state: “there is
and annotation tasks. As for both experiment I and II, the observed a need for improved (and novel) annotation standards and re-
agreed TV are likely to appear in the first three positions of the 12 quirements in data repositories to enable better integration and
top-ranked terms; this finding can be the basis of a systematic ap- reuse of publically available data”. To the best of our knowledge,
proach to obtain plausible term variants for the 258,913 Uni- this is the first time that word embeddings from Deep Learning and
ProtKB protein classes from the CVD ontology. As we proposed in an ontology in OWL have been put together with the aim of link-
section 3.4, plausible term variants from word embeddings can be ing ontology classes to terms derived from a large corpus of bio-
easily stored in the CVD ontology by means of the annotation medical literature in an unsupervised way and without the need of
property skos:hiddenLabel. Therefore, when querying the CVD having the corpus annotated.
ontology using the query language SPARQL 1.1 [52] for a protein
that may appear in the biomedical literature, it possible to use both
6 CONCLUSION
rdf:label and skos:hiddenLabel. However, the terms stored in the
skos:hiddenLabel are more likely to give pertinent results because This study demonstrates the benefits of using terms from the
they are derived from the word embeddings obtained from the 14 CVDO ontology classes to obtain more pertinent term variants for
million PubMed citations (titles and abstracts), i.e. from the bio- gene/protein names from word embeddings generated from an
medical literature itself. Furthermore, by having transformed Pub- unannotated corpus with more than 14 million PubMed citations.
Med citations into RDF datasets, it is feasible to annotate a Pub- As the terms variants are induced from the biomedical literature,
Med citation not only with MeSH headings/descriptors, or key- they can facilitate data tagging and semantic indexing tasks. Over-
words from authors, but also with the terms for the lexicon gener- all, our study explores the feasibility of obtaining methods that
ated by CBOW and Skip-gram. Thus, it is possible to envision scale when dealing with big data, and which enable automation of
more sophisticated SPARQL 1.1 SELECT queries that are able to deep semantic analysis and markup of textual information from
retrieve the PubMed citations themselves. Furthermore, from a unannotated biomedical literature.
computational point of view the process described here is afforda-
ble and sustainable: new PubMed citations can be converted into ACKNOWLEDGEMENTS
RDF on a daily basis; Skip-gram can re-generate the lexicon and
the vectors in less than a day for 14 million PubMed citations (ti- To Prof Iain Buchan and Stephen Walker for useful discussions;
tles and abstracts); terms from the human-readable labels of key and to Timothy Furmston for helping with the software and e-
classes (gene/proteins) from the CVDO ontology can be used as infrastructure.
query terms to retrieved the top-ranked terms from the word em- Funding: This work was supported by a grant from the European
beddings re-generated, where the three top-ranked terms (plausible Union Seventh Framework Programme (FP7/2007-2013) for sys-
term variants) can be stored as literal values of skos:hiddenLabel. VASC project under grant agreement number 603288.
Hence, periodical updates are feasible.
Although text mining technology has made great strides in ex- REFERENCES
tracting biomedical terminology from unstructured text sources,
[1] World Health Organization – Cardiovascular diseases (CVDs).
the task of normalising (grounding) the extracted terms to com-
Available at http://www.who.int/cardiovascular_diseases/en/.
monly used identifiers in ontologies or taxonomies is still quite
Accessed 16 June 2016.
demanding. Identifying equivalent text realisations for the same
[2] sysVASC project, http://cordis.europa.eu/project/rcn/
biomedical concept can be useful for (i) improving the quality of
111200_en.html. Accessed 16 June 2016.
information in curated resources such as UniProt or the Gene On-
[3] PubMed – Detailed Indexing Statistics: 1965-2015,
tology, and (ii) for linking the information in these resources back
https://www.nlm.nih.gov/bsd/index_stats_comp.html. Accessed
to the original text sources; this is helpful when a greater context
16 June 2016.
needs to be explored or for keeping up-to-date with the published
[4] Ely, J.W., Osheroff, J.A., Ebell, M.H., Chambliss, M.L.,
literature.
Vinson, D.C., Stevermer, J.J. and Pifer, E.A.: Obstacles to an-
Another potential application is in the area of query expansion
swering doctors' questions about patient care with evidence:
for Information Retrieval (IR). Although query enhancement using
qualitative study. Bmj, 324(7339), p.710 (2002).
synonyms is commonly deployed by many of today’s IR systems,
[5] Sarker, A., Mollá, D. and Paris, C.: Automatic evidence quality
it is often more difficult to deal with cases of orthographic varia-
prediction to support evidence-based decision making. Artificial
tions or when new acronyms/abbreviations are introduced for new
intelligence in medicine, 64(2), pp.89-103 (2015).
terms. Identifying term variants can be a way of ameliorating the
[6] Hristovski, D., Dinevski, D., Kastrin, A. and Rindflesch, T.C.:
effect of the classical problem of IR returning either too much or
Biomedical question answering using semantic relations. BMC
too little for a user query.
bioinformatics, 16(1), p. 1 (2015).
Lastly, text mining developers, especially those dealing with
[7] Tanabe, L. and Wilbur, W.J.: Tagging gene and protein names
rule-based systems, can benefit from unsupervised automated
in biomedical text. Bioinformatics, 18(8), pp.1124-1132 (2002).
techniques such as the one described in this paper, for building
[8] Garten, Y., Coulet, A. and Altman, R.B.: Recent progress in
terminological resources from large untagged corpora. Such re-
automatically extracting information from the pharmacogenomic
sources include both terminology lexica and grammars, either
literature. Pharmacogenomics, 11(10), pp.1467-1489 (2010).
manually developed or compiled via grammar induction tech-
5
M. Arguello Casteleiro et al
[9] Krauthammer, M. and Nenadic, G.: Term identification in the [27] Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., & Ananiadou,
biomedical literature. Journal of biomedical informatics, 37(6), S.: Distributional semantics resources for biomedical text pro-
pp.512-526 (2004). cessing. In Proc. of Languages in Biology and Medicine (2013).
[10] Ananiadou, S., Kell, D.B. and Tsujii, J.I.: Text mining and its [28] Minarro-Giménez, J. A., Marín-Alonso, O., & Samwald, M.:
potential applications in systems biology. Trends in biotechnol- Exploring the application of deep learning techniques on medi-
ogy, 24(12), pp.571-579 (2006). cal text corpora. Studies in health technology and informatics,
[11] Federiuk, C.S., 1999. The effect of abbreviations on 205, pp. 584-588 (2013).
MEDLINE searching. Academic emergency medicine, 6(4), [29] Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeck-
pp.292-296. (1999). mann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Ma-
[12] Wren, J.D., Chang, J.T., Pustejovsky, J., Adar, E., Garner, grane, M. and Martin, M.J.: UniProt: the universal protein
H.R. and Altman, R.B.: Biomedical term mapping databases. knowledgebase. Nucleic acids research, 32(suppl 1), pp.D115-
Nucleic acids research, 33(suppl 1), pp.D289-D293. (2005). D119 (2004).
[13] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, [30] Griffiths-Jones, S., Grocock, R.J., Van Dongen, S., Bateman,
J.: Distributed representations of words and phrases and their A. and Enright, A.J.: miRBase: microRNA sequences, targets
compositionality. In Advances in Neural Information Processing and gene nomenclature. Nucleic acids research, 34(suppl 1),
Systems, pp. 3111-3119 (2013). pp.D140-D144 (2006).
[14] Mikolov, T., Chen, K., Corrado, G., & Dean, J.: Efficient [31] Wishart, D.S., Tzur, D., Knox, C., Eisner, R., Guo, A.C.,
estimation of word representations in vector space. arXiv pre- Young, N., Cheng, D., Jewell, K., Arndt, D., Sawhney, S. and
print arXiv:1301.3781 (2013). Fung, C.: HMDB: the human metabolome database. Nucleic ac-
[15] Motik, B., Grau, B. C., Horrocks, I., Wu, Z., Fokoue, A., & ids research, 35(suppl 1), pp. D521-D526 (2007).
Lutz, C.: Owl 2 web ontology language: Profiles. W3C recom- [32] OBI, http://www.obofoundry.org/ontology/obi.html
mendation, 27, 61 (2009). [33] PRO, http://www.obofoundry.org/ontology/pr.html
[16] Studer, Rudi, V. Richard Benjamins, and Dieter Fensel. [34] SO, http://www.obofoundry.org/ontology/so.html
"Knowledge engineering: principles and methods." Data & [35] GO, http://www.obofoundry.org/ontology/go.html
knowledge engineering 25, no. 1, pp. 161-197 (1998). [36] ChEBI, http://www.obofoundry.org/ontology/chebi.html
[17] Bengio, Y., & Lee, H.: Editorial introduction to the Neural [37] NCBI, http://www.obofoundry.org/ontology/ncbitaxon.html
Networks special issue on Deep Learning of Representations. [38] CL, http://www.obofoundry.org/ontology/cl.html
Neural networks: the official journal of the International Neural [39] UBERON, http://www.obofoundry.org/ontology/uberon.html
Network Society (2014). [40] PATO, http://www.obofoundry.org/ontology/pato.html
[18] Landauer, T. K., & Dumais, S. T.: A solution to Plato's prob- [41] RO, http://www.obofoundry.org/ontology/ro.html
lem: The latent semantic analysis theory of acquisition, induc- [42] Arguello Casteleiro, M., Klein, J. and Stevens, R.: The Prote-
tion, and representation of knowledge. Psychological review, asix Ontology. Journal of biomedical semantics, 7(1) (2016).
104(2), 211 (1997). [43] Klyne, G. and Carroll, J.J.: Resource description framework
[19] Lund, K., & Burgess, C.: Producing high-dimensional seman- (RDF): Concepts and abstract syntax (2006).
tic spaces from lexical co-occurrence. Behavior Research Meth- [44] MEDLINE/PubMed XML Data Elements,
ods, Instruments, & Computers, 28(2), 203-208 (1996). https://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html
[20] Kanerva, P., Kristofersson, J., & Holst, A.: Random indexing [45] DCMI, http://dublincore.org/schemas/rdfs/
of text samples for latent semantic analysis. In Proc. of the cog- [46] Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C.: A neu-
nitive science society (Vol. 1036). Mahwah, NJ: Erlbaum. ral probabilistic language model. The Journal of Machine Learn-
(2000). ing Research, 3, pp. 1137-1155 (2003).
[21] Hofmann, T.: Probabilistic latent semantic indexing. In Proc. [47] word2vec, http://code.google.com/p/word2vec/
of ACM SIGIR conference on Research and development in in- [48] Rehurek, R. and Sojka, P.: Software framework for topic
formation retrieval. ACM. pp. 50-57 (1999). modelling with large corpora. In Proc. of the LREC 2010 Work-
[22] Blei, D. M., Ng, A. Y., & Jordan, M. I.: Latent dirichlet allo- shop on New Challenges for NLP Frameworks (2010).
cation. the Journal of machine Learning research, 3, pp. 993- [49] Miles, A. and Bechhofer, S.: SKOS simple knowledge organi-
1022 (2003). zation system reference. W3C recommendation, 18, W3C
[23] Cohen, T., & Widdows, D.: Empirical distributional seman- (2009).
tics: methods and biomedical applications. Journal of biomedical [50] Nenadic, G., Ananiadou, S. and McNaught, J.: Enhancing
informatics, 42(2), pp. 390-405 (2009). automatic term recognition through recognition of variation. In
[24] Jonnalagadda, S., Leaman, R., Cohen, T., & Gonzalez, G.: A Proc. of Computational Linguistics (p. 604). Association for
distributional semantics approach to simultaneous recognition of Computational Linguistics (2004).
multiple classes of named entities. In Computational Linguistics [51] Cohen, J. A coefficient of agreement for nominal scales. Edu-
and Intelligent Text Processing. Springer Berlin Heidelberg. pp. cational and Psychological Measurement, 20, pp. 37-46 (1960).
224-235. (2010). [52] Harris, S., Seaborne, A. and Prud’hommeaux, E.: SPARQL
[25] Neelakantan, A., Shankar, J., Passos, A. and McCallum, A.: 1.1 query language. W3C Recommendation, 21 (2013).
Efficient non-parametric estimation of multiple embeddings per [53] Gomez-Cabrero, D., Abugessaisa, I., Maier, D., Teschendorff,
word in vector space. EMNLP 2014: 1059-1069 (2014). A., Merkenschlager, M., Gisel, A., Ballestar, E., Bongcam-
[26] Hu, B., Tang, B., Chen, Q. and Kang, L.: A novel word em- Rudloff, E., Conesa, A. and Tegnér, J.: Data integration in the
bedding learning model using the dissociation between nouns era of omics: current and future challenges. BMC systems biolo-
and verbs. Neurocomputing, 171, pp.1108-1117 (2016). gy, 8(2), p.1 (2014).
6