=Paper= {{Paper |id=Vol-2275/paper6 |storemode=property |title=Ontology-driven metadata enrichment for genomic datasets |pdfUrl=https://ceur-ws.org/Vol-2275/paper6.pdf |volume=Vol-2275 |authors=Anna Bernasconi,Arif Canakoglu,Andrea Colombo,Stefano Ceri |dblpUrl=https://dblp.org/rec/conf/swat4ls/0002CCC18 }} ==Ontology-driven metadata enrichment for genomic datasets== https://ceur-ws.org/Vol-2275/paper6.pdf
         Ontology-Driven Metadata Enrichment
                 for Genomic Datasets

 Anna Bernasconi?[0000−0001−8016−5750] , Arif Canakoglu?[0000−0003−4528−6586] ,
          Andrea Colombo, and Stefano Ceri[0000−0003−0671−2415]

              Politecnico di Milano, Via Ponzio 34/5, 20133, Milano, Italy
            {anna.bernasconi,arif.canakoglu,stefano.ceri}@polimi.it
                          andrea55.colombo@mail.polimi.it



        Abstract. Data-driven genomic research requires accessing several repos-
        itories of genomic datasets, produced by international consortia, which
        provide open access to extremely valuable and well curated biological
        content. The associated metadata, describing experimental and biologi-
        cal conditions, are highly heterogeneous; consequently, dataset collection
        and integration is difficult – it requires data conversions and term match-
        ing which needs to be done by humans, with biological expertise.
        In this paper, we present a method and tools for ontology-driven meta-
        data enrichment. We select few relevant features which are provided by
        most repositories, and then we comparatively evaluate several search
        services providing ontological access, eventually associating each feature
        with the specific ontologies which are most suited to describe them. We
        also provide an expert validation of the approach. The method and tools
        are deployed in a large repository of open data, which will be soon avail-
        able to the research community.

        Keywords: Data Integration · Genomic Datasets · Metadata Annota-
        tion · Open Data · Bioinformatics.


1     Introduction
With the growth of diversity and complexity of scientific databases, the role of
metadata – describing their content and data production process – is becom-
ing more relevant. In particular, genomic computing often requires collecting
datasets from multiple heterogeneous sources; unfortunately, metadata describ-
ing datasets across such sources are structured differently, they are often in-
compatible or incomplete. This raises a huge problem of data integration, which
can be solved through ontological mediation, bridging the sources and enabling
metadata interoperability.
   In this paper, we describe metadata enrichment, which is the process of
annotating existing structured metadata with ontological terms, their definitions,
synonyms, ancestors, and descendants, to instrument a semantically enriched
search of datasets linked to such metadata. Metadata enrichment is performed
?
    These two authors contributed equally to this work.
2       A. Bernasconi et al.

at the end of a data integration procedure for data loading, cleaning and mapping
that is outside of the scope of this paper.
    Metadata are converted to fit the format of a Genomic Conceptual Model
(GCM, [2]), gathering the most important properties shared between heteroge-
neous sources. GCM is centered on the item entity (representing an experimen-
tal unit stored as a file of genomic regions) and organized as a four-pointed-star
whose parts describe connected aspects about biology, technology, extraction,
and management of the item.
    Among all GCM attributes, we call “ontological” the ones that are present
in all sources and require ontological agreement, thus are worthy of enrichment.
These are: the Platform of items, i.e., the NGS platform used for sequencing,
the Ethnicity and Species of donors, i.e., the individual of the organism from
which the biological material is derived; the Disease, storing information about
the pathology investigated with the sample; the Tissue and CellLine of sam-
ples, which distinguish the kind of biological material used for the experiment;
the Technique or assay used to produce the genomic experiment (e.g., “Chip-
seq”, “miRNA-seq”, “Genotyping Array”); the specific Feature or aspect de-
scribed by the experiment (e.g., “Copy Number Variation”,“Histone Modifica-
tion”,“Transcription Factor”); and the Target gene or protein of experiments
(e.g., “CTCF”, “MYC”).
    In this paper, we propose a metadata enrichment system, specific for genomic
datasets, with a four-fold contribution: 1. description of the existing services to
search ontologies related to biomedical content; 2. scoring and selection of the
service and ontologies most relevant for our data; 3. organization of ontological
knowledge in a well-structured taxonomy; 4. production of a tool for ontological
annotations extraction. As a first integration effort, we include three important
data sources used in the genomic research community, namely: Genomic Data
Commons (GDC, [9]), with over 310,000 files covering aspects of cancer genomics;
the Encyclopedia of DNA Elements (ENCODE, [6]), with almost 420,000 files
related to functional DNA sequences and regulatory elements controlling gene
expression; Roadmap Epigenomics Project (REP, [13]) containing around 2,000
datasets related to genetic variation.
    The paper is structured as follows. Section 2 presents our solution to the prob-
lem of selecting appropriate search services and ontologies to annotate metadata.
Section 3 describes how the enrichment procedure works and how we validated
the process. Section 4 overviews related work, and finally Section 5 concludes
the paper.


2   Search Service and Ontology Selection
First, we present the four most used and well-known ontology search services in
literature (see Section 2.1), and how we score them (see Section 2.2) in order
to select the most appropriate search service for our project (see Section 2.3).
Next, we compare the ontologies provided by that search service, and select the
specific ontology that is most suitable to annotate values for each ontological
attribute (see Section 2.4).
                Ontology-Driven Metadata Enrichment for Genomic Datasets        3

2.1    Ontology Search Services

Ontological access to genomic data is well supported by several search services,
which are capable in turn to integrate a high number of ontologies. Therefore,
we are initially concerned in choosing the best search service, that will then be
used within our system as broker to the underlying ontologies. We consider four
different search services, which appear suitable for our purpose.
    BioPortal [19] is a repository of biomedical ontologies and terminologies
whose access is provided through a Web portal and Web services. We exploit its
term search service, an endpoint which takes a free text input and provides a
result in json format, listing a (configurable) number of annotations to ontolog-
ical terms, showing different degrees of matching with the free text. These can
be considered as possible annotations for the input text. Each term is identi-
fied by the pair hontology, idi, describing the code which references the ontology
inside the BioPortal system and an identification number which references the
term inside the ontology. A term also contains a single preferred label and its
synonyms. An annotation is composed by a term and a match type: “PREF” if
the match with the term is established with the preferred label or “SYN” if the
match is with one of the term synonyms.
    Ontology Recommender [15] is a BioPortal service that receives a free
text or a list of keywords and suggests a set of ontologies appropriate for anno-
tating the indicated terms, considered all together. The structure of annotations
is identical to BioPortal’s. Additionally, Recommender provides four scores that
reflect how well the ontology (set) annotates the input data: Coverage, measures
with which extent the ontology represents the input; Acceptance, indicates how
well-known and trusted the ontology is by the biomedical community; Detail,
shows the level of specification provided by the ontology for the input data;
Specialization, indicates how specialized the ontology is w.r.t the input data
domain.
    Ontology Lookup Service (OLS, [12]) provides ontology search, visu-
alization, and ontology-based services. The accepted input is a keyword, the
provided result is a list of annotations, similar to the other services but not
including a match type. In the API request, a fieldList parameter can be used
to specify the specific elements to be included in the output along with other
formatting preferences.
    Zooma1 is a service from OLS which provides mappings between textual
input and a manually curated repository of text-to-ontology-term mappings. If
no mappings are found, it uses the basic OLS search. In addition to the usual
annotation information, Zooma also returns a confidence label associated to the
annotation, ranging from HIGH to LOW.
    We exclude other important ontology search portals such as HeTOP [8] and
UMLS [3], as they are more focused on multilingual support and medical ter-
minologies, therefore do not include many ontologies that are important to an-
notate our values. Also the NCBO Annotator [10] is not considered since its
functionalities are completely covered by the Ontology Recommender.
1
    https://www.ebi.ac.uk/spot/zooma/
4         A. Bernasconi et al.

2.2     Scoring
Every search service provides a search API, which is repeatedly used for the
score evaluation. For each API call we store: the used service; the attribute from
GCM characterizing the values (the “type” of the values); the original raw value
deriving from the GCM, imported through the mapping phase; possible parsed
values deriving from a simple syntactic pre-processing of raw values (e.g., removal
of punctuation, split of long expressions. . . ); the hontology,ontology idi pair,
uniquely identifying an ontological term in a service; pref label and synonym,
respectively the textual expression primarily used for the term and its alternative
versions; score, textual information regarding the goodness of a match, directly
retrieved from the services, if available.
    In total, we performed 1,783 API calls to each of the four services, correspond-
ing to 1,299 original values to be enriched; some of these were splitted during
a pre-processing phase. As a result, we retrieved 1,783 interesting matches from
BioPortal, 885 from Recommender, 1,782 from OLS, and 1,779 from ZOOMA,
all of which were used for the following processing after calculating our scores.
    Starting from the retrieved information, we calculate the match score as a
measure of how well a term matches a value, by using a scoring system that is
specifically designed for the task, which is next described. The general formula
returning the match score value, shown in Eq. 1, subtracts from an initial max-
imum number (10, when there is a perfect match with a pref label, 9 with a
synonym) a penalty measuring how the raw value differs from the label retrieved
from the services:

              match score(raw, label) = {10, 9} − distance(raw, label)          (1)
    To compute the distance, we use a modified version of Needleman-Wunsch
algorithm [16], a protein and nucleotide sequence alignment algorithm which is
widely used in bioinformatics. In the original algorithm, the input is represented
by two strings whose letters need to be aligned. The letters may have a “match”,
a “mismatch” or an “indel” (i.e., adding a gap in one of the strings). In our
modified version, we define each word as a distinct letter of the original algorithm
and we add another type of mismatch, i.e., the swap. All in all, the total distance
is calculated as a sum of distances between words:
    – Match: Two words are the same, then their distance is 0
    – Mismatch: Two words are different, then their distance is 2.5
    – Swap: Two consecutive words traded places, then their distance is 0.5
    – Delete: One word is deleted from the raw, then their distance is 2
    – Insert: A new word is added to the raw then their distance distance is 1
The indicated distance values are chosen in such a way that the number of
deletions is minimized (i.e., we penalize a label which does not include a word
present in raw ) and the swap is preferred to indel and mismatch. For example,
for the raw “breast invasive carcinoma”, the label “invasive breast carcinoma”
(i.e., one swap) is considered better than “breast carcinoma” (i.e., one deletion).
    Additional calculated scores are: onto suitability, a measure of how
much an ontology is adequate for a given attribute, calculated as the average
                Ontology-Driven Metadata Enrichment for Genomic Datasets           5

match score over all raw values for that attribute; onto acceptance, a measure
of how well-known and trusted the ontology is by the biomedical community,
computed through Recommender Web Services [15]2 ; the overall score, ob-
tained by multiplying each raw value match score by a weighted average of the
two measures typical of the ontology.

2.3    Service Evaluation
Table 1 describes the obtained results. The “Service Properties” part contains an
overview of service properties. BioPortal and Recommender provide a match type
(MT) in their APIs response, which means that they specify if the input text is
more similar to the preferred label rather than to one of the synonyms associated
to a term. Recommender offers the additional function of searching for multiple
key-words at the same time (MK) and consequently suggests a minimal set of
ontologies suitable for annotating the maximum possible number of key-words.
This function is also offered by ZOOMA which, however, in practice just per-
forms multiple single key-word requests and lists all results at the same time.
Only Recommender executes a good attempt of annotating free texts (FT). Bio-
Portal’s set of ontologies is much broader than OLS’ since minor efforts are also
included. ZOOMA exploits search results from OLS but also provides results
coming from previous manual curation works as an additional service to the
user.

         Table 1. Summary of Ontology Search Services as of October 1, 2018

                                     BioPortal   Recommender OLS         ZOOMA
                   Search properties MT          MT,MK,FT    -           MK
Service Properties Num. of ontologies 728        728         214         214
                   Previous curation no          no          no          yes
                1st best match       ncit c4029 ncit c4029   ncit c4029 efo 0001416
Example scoring 2nd best match       efo 0001416 None        efo 0001416 None
of “cervical    3rd best match       doid 3702 None          ncit c136651 None
adenocarcinoma” Occurrence score     1           0.5         1            0.5
                Coverage score       1           1           1            1
                    Occurrence       83.17%      46.97%      90.54%      75.96%
Aggregated scores
                    Coverage         100.00%     49.88%      99.94%      99.78%




    The “Example scoring” part contains an example of how services are re-
warded based on the matching terms they find. To evaluate the match, we use
the overall score described in Section 2.2. When the disease-related text “cer-
vical adenocarcinoma” is searched, BioPortal suggests, on top of others, the three
terms “ncit c4029”, “efo 0001416”, and “doid 3702”, while Recommender just
2
    It is derived from the number of visits to the ontology page in BioPortal and the
    presence or absence of the ontology in UMLS [3].
6       A. Bernasconi et al.

provides one result, “ncit c4029”. Our algorithm for Occurrence computes the
set of terms which occur the highest amount of times in the top three matches
of the services (in this case [“ncit c4029”,“efo 0001416”]) and assigns a weighted
reward (1 if the set only contains one entry, 0.5 if it contains 2, and so on) to the
services which include that term in the top results. Indeed BioPortal scores 1
since it contains both top results, while Recommender scores 0.5 since it contains
just one. Coverage is 1 when the service provides at least one result, 0 otherwise.
    We use as scores for service selection the average Occurrence and Coverage
over all the searched raw values. On this basis, OLS is selected as the best suited
search service to pursue the enrichment annotations in our system.

2.4   Ontology Selection
Based on the overall score described in Section 2.2, we also aggregate results
over specific attributes and ontologies. This calculation produces, as a result,
one top ontology for each attribute. Since most of the times only one ontology
does not provide an acceptable coverage for all the values belonging to that
attribute, we use an algorithm to compute a small set of ontologies to annotate
values from an attribute. Such algorithm first tries to match values only with
the first ontology, then tries to match only the ones left unmatched with the
following ontologies, until a fixed point for coverage is found. If the computational
costs become too high, the algorithm can be stopped at a predefined threshold
coverage, considered acceptable. In our case we set the threshold equal to 95%.
    The resulting choice of ontolgies sets is: OBI for Platform, NCIT for Ethnicity,
NCBITaxon for Species, NCIT for Disease, UBERON for Tissue, {EFO,CL} for
CellLine, NCIT for Feature, and OGG for Target. All the above choices meet the
set threshold. Our best choice for the attribute Technique is the set {OBI,EFO},
but for this attribute we are not able to achieve the coverage threshold, as we
reach a best coverage of 85.7%.


3     Metadata Enrichment
After selecting such sets, we proceed with the enrichment of the values contained
in the ontological attributes of the GCM. Section 3.1 presents the relational
schema which supports this phase. Section 3.2 describes the enrichment pro-
cess and Section 3.3 shows how the automatic annotation is aided by curators
intervention. Finally, Section 3.4 overviews the expert validation.

3.1   Relational Schema
Figure 1 describes the logic schema of the relational database. The Genomic
Conceptual Model frame contains the tables from the GCM (of which we only
show in detail the ones which have ontological attributes). The Local Knowl-
edge Base (LKB) frame stores all the information retrieved from OLS services
and relevant to annotate our values. The main tables are: vocabulary (storing
the reference term ids), synonym (containing synonyms of the preferred label in
                                Ontology-Driven Metadata Enrichment for Genomic Datasets                                                                                    7

   the vocabulary), reference (identifiers of equivalent terms in alternative ontolo-
   gies), ontology (dimension table for used ontologies), and relationship (repre-
   senting links between terms in the ontology). The Expert Support frame contains
   the tables used to contain information for expert users. Each GCM ontological
   attribute X is equipped with a companion-attribute X tid, which references the
   ontological term in the vocabulary table (e.g., Platform with value “Illumina
   Human Methylation 450” is associated to Platform tid = 10, representing the
   vocabulary object OBI 0001870, taken from the Ontology of Biomedical Investi-
   gations [1]). The Vocabulary table is the central entity of the LKB schema. The
   tid column is the primary key which is referenced by all other tables in LKB
   and from the tables in GCM. Also tables from the LKB and from the Expert
   tables are linked using tid s.


         relationship                 reference                 synonym                   donor                          biosample
     tid_parent    int PK         tid       int PK      tid         int PK       donor_id          int    PK   biosample_id      int       PK             replicate
     tid_child     int PK         source    text PK     label       text PK      species           text   N    donor_id          int       FK
     rel_type      text PK        code      text PK     type        text PK      species_tid       int    N    tissue            text      N      ...                 ...
                                                                                 ethnicity         text   N    tissue_tid        int       N
                                                                                 ethnicity_tid     int    N    cell_line         text      N
            ontology                                   vocabulary                ...               ...         cell_line_tid     int       N
                                                                                                               disease           text      N        replicate2item
     source         text   PK                     tid             int PK                                       disease_tid       int       N      ...                 ...
     title          text   N                      source          text FK                                      ...               ...
     description    text   N                      code            text
     url            text   N                      pref_label      text
                                                  description     text
                                                                                                                                           item
                                                  iri             text                   experiment_type
                                                                                                                          item_id                       int    PK
                                                                                 experiment_type_id        int    PK
                                                                                                                          experiment_type_id            int    FK
                                            Local Knowledge Base                 technique                 text   N
                                                                                                                          dataset_id                    int    FK
                                                                                 technique_tid             int    N
                                                                                                                          platform                      text   N
                                                                                 feature                   text   N
                                                                                                                          platform_tid                  int    N
                                                                                 feature_tid               int    N
      choices_for_curator                                                                                                 ...                           ...
                                                                                 target                    text   N
    id               int PK                                                      target_tid                int    N
    resolved         bool                                                        ...                       ...                                            case2item
                                 curator_preference       expert_feedback
    table            text                                                                                                                         ...                 ...
    column           text       id           int PK username         text   PK
    tid              int FK     table        text   raw_value        text   PK
    raw_value        text       column       text   table            text   PK
                                raw_value    text   column           text   PK           dataset
    parsed_value     text N
    label            text N     source       text   tid              int    FK     ...              ...                   project                        case_study
    source           text N     code         text   rating           int                                                                          ...                 ...
                                                                                                                   ...               ...
    code             text N
    iri              text N
    provenance       text                              Expert Support                                                       Genomic Conceptual Model




     Fig. 1. Relational schema for tables of the GCM, LKB and user feedback routines.




   3.2        Enrichment Process

   To enrich the values contained in the ontological attributes of the GCM, we iter-
   ate over all values associated to a tid column. For each value we call OLS services
   with the ontologies sets indicated in Section 2.4. When a best match score,
   calculated as in Eq. 1, is found and is above the threshold 5.0, we select the
   corresponding term and proceed with the annotation, otherwise the decision is
   delegated to data curators (see Section 3.3).
       Once the term has been selected, we populate the tables of the LKB with all
   the information derived from OLS regarding the term: description, iri, synonyms,
   xrefs, hypernyms and hyponyms (both of is a and part of kinds). The depths



Powered by Vertabelo, Design Your Database Online, http://vertabelo.com
gcm_lkb_expert 2018-09-28 10:53 PostgreSQL 9.x
8       A. Bernasconi et al.

of ancestors and descendants retrieved from the ontology are configurable by
constant specification. The automatic enrichment process currently annotates
about 83% of the total raw values, while the remaining are handled using a
manual curation procedure.

3.3   Biologists Support
We propose two procedures which allow experts curators to support the anno-
tation algorithm; we assume them to be knowledgeable about biological data
management and to be expert in genomic data curation.
    In the first procedure, a curator can examine all cases in which the algorithm
is not able to provide a high quality match (i.e., the service provides either partial
matches with low score or no result). The low scores matches are proposed as
suggestions so that the curator may select one of them. In any case, a manual
annotation can always be provided. The procedure can be configured so that it
also shows the cases with the same score.
    The second procedure is started when a pre-existing annotation is not ade-
quate (i.e., a tid column has been filled with a wrong vocabulary term). In this
case, the curator can invalidate the annotation and provide an alternative.

3.4   Expert Validation
We conducted a validation by engaging six experts with good biological knowl-
edge. For each considered attribute, we presented to them a set of annotations
(i.e., matches between a raw and an ontological term, equipped with its descrip-
tions) automatically produced by the enrichment procedure. We asked them to
rate the associations according to how accurate they are w.r.t their knowledge.


                         Table 2. Expert Validation results

Attribute    Platform Ethnicity Species Disease Tissue CellLine Technique Feature Target
#annot/total    3/4    20/33      4/4    76/97 82/121 191/282 10/14        9/22 738/787
exact          5.56%   67.50% 100.00% 64.17% 88.33% 84.17%      73.33%   61.11% 100.00%
good          27.78%    6.67%    -     6.67% 6.67% 4.17%         8.33%    9.26%    -
acceptable    50.00%   19.17%     -   10.83% 2.50% 4.17%         6.67%   12.96%    -
wrong         16.67%    5.83%    -    17.50% 1.67% 6.67%         6.67%   14.81%    -
do not know       -     0.83%     -   0.83% 0.83% 0.83%         5.00%     1.85%    -



    The questionnaire contains up to 20 matches for each attribute (or less in
the case of Platform, Species, Technique, and Feature, which contain less found
matches), selected randomly from their value pools, therefore considered repre-
sentative of the sets. The test allows five choices: 1. exact, 2. good, 3. ac-
ceptable, 4. wrong, 5. do not know.
    In Table 2, in the first row we indicate, for each attribute, the ratio between
the number of automatically annotated values and the number of their total
distinct values. Then, we show in detail the results from the attributes presented
                Ontology-Driven Metadata Enrichment for Genomic Datasets        9

to experts. The averaged results highlight that in 83.06% of cases the experts
marked as exact or good the examined matches, in 8.81% they rated them as
acceptable, and only the remaining 7.05% were marked as wrong. In the 1.08%
of cases the experts declared they were not able to evaluate the match.

4     Related Works
Many works in the literature consider the problem of recognizing ontological
concepts to perform semantic annotation of data. For example: Bodenreider [4]
proposes a (dated) survey on the use of ontologies in biomedical data manage-
ment and integration; the works [11, 18] debate solutions devoted to data inte-
gration; Giles et al. [7] focus on concept extraction from datasets of a specific
source; the works [14, 5] consider the problem of metadata authoring by using
BioPortal ontology-based recommendations, with a focus on metadata manual
creation and preparation. A number of articles have addressed the problem of
choosing ontologies for semantic enrichment. Among these: Wilkinson et al. [20]
present the FAIR principles, which define a set of characteristics that data re-
sources and infrastructures should exhibit; [17] identify key search factors for
biomedical ontologies to help biomedical experts in selecting the best-suited
ones in their search cases. In Section 2.1 we presented BioPortal [19], Ontology
Recommender [15], Ontology Lookup Service [12] and Zooma, since we believe
UMLS [3], HeTop [8], and Annotator [10] were not suited for our purpose.

5     Conclusion and Future Work
Annotating metadata with terms from ontologies and providing an expansion
to hypernyms and hyponyms allows for easier and semantically flexible dataset
search. We provide selection criteria for choosing among search services and
ontologies, and a user-friendly process for assisting biologists in checking that
suggested terms are indeed acceptable. We also provided an internal validation
of annotations produced by our process. As future work, we intend to improve
the matching algorithm by exploiting the ontology structures and information.
We plan to integrate more sources and test our method on a comprehensive
database.
    The implementation of the metadata enrichment system described in this
paper is is available at: https://github.com/DEIB-GECO/Metadata-Enricher.
It is used in the broader context of a genomic repository, developed within the
GeCo Project3 , which will be available for use in the near future. Enriched meta-
data help users in locating datasets for genomic data extraction and analysis,
either on their original sources or within our repository.

Acknowledgment
This research is funded by the ERC Advanced Grant 693174 GeCo (data-driven
Genomic Computing).
3
    http://www.bioinformatics.deib.polimi.it/geco/
10      A. Bernasconi et al.

References
 1. Bandrowski, A., et al.: The ontology for biomedical investigations. PloS one 11(4),
    e0154556 (2016)
 2. Bernasconi, A., et al.: Conceptual modeling for genomics: Building an integrated
    repository of open data. In: Mayr, H.C., Guizzardi, G., Ma, H., Pastor, O.
    (eds.) Conceptual Modeling. pp. 325–339. Springer International Publishing, Cham
    (2017)
 3. Bodenreider, O.: The unified medical language system (UMLS): integrating
    biomedical terminology. Nucleic acids research 32(suppl 1), D267–D270 (2004)
 4. Bodenreider, O.: Biomedical ontologies in action: role in knowledge management,
    data integration and decision support. Yearbook of Medical Informatics p. 67
    (2008)
 5. Egyedi, A.L., et al.: Embracing semantic technology for better metadata authoring
    in biomedicine. In: Proceedings of SWAT4LS International Conference 2017 (2017)
 6. Consortium ENCODE: An integrated encyclopedia of DNA elements in the human
    genome. Nature 489(7414), 57–74 (2012)
 7. Giles, C.B., et al.: Ale: automated label extraction from GEO metadata. BMC
    Bioinformatics 18(14), 509 (2017)
 8. Grosjean, J., et al.: Health multi-terminology portal: a semantic added-value for
    patient safety. Studies in health technology and informatics 166, 129 (2011)
 9. Jensen, M.A., et al.: The NCI Genomic Data Commons as an engine for precision
    medicine. Blood 130(4), 453–459 (2017)
10. Jonquet, C., Shah, N., Musen, M.: The open biomedical annotator. In: AMIA
    Summit on Translational Bioinformatics. pp. 56–60 (2009)
11. Jonquet, C., et al.: A system for ontology-based annotation of biomedical data.
    In: International Workshop on Data Integration in The Life Sciences. pp. 144–152.
    Springer (2008)
12. Jupp, S., et al.: A new Ontology Lookup Service at EMBL-EBI. In: Malone, J.,
    et al. (eds.) Proceedings of SWAT4LS International Conference 2015 (2015)
13. Kundaje, A., et al.: Integrative analysis of 111 reference human epigenomes. Nature
    518(7539), 317–330 (2015)
14. Martı́nez-Romero, M., et al.: Fast and accurate metadata authoring using ontology-
    based recommendations. In: AMIA Annual Symposium Proceedings. vol. 2017,
    p. 1272. American Medical Informatics Association (2017)
15. Martı́nez-Romero, M., et al.: NCBO Ontology Recommender 2.0: an enhanced ap-
    proach for biomedical ontology recommendation. Journal of Biomedical Semantics
    8(1), 21 (2017)
16. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for
    similarities in the amino acid sequence of two proteins. Journal of molecular biology
    48(3), 443–453 (1970)
17. Oliveira, D., et al.: Where to search top-k biomedical ontologies? Briefings in Bioin-
    formatics p. bby015 (2018)
18. Shah, N.H., et al.: Ontology-driven indexing of public datasets for translational
    bioinformatics. BMC Bioinformatics 10(2), S1 (2009)
19. Whetzel, P.L., et al.: Bioportal: enhanced functionality via new web services from
    the national center for biomedical ontology to access and use ontologies in software
    applications. Nucleic Acids Research 39(suppl 2), W541–W545 (2011)
20. Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data manage-
    ment and stewardship. Scientific Data 3 (2016)