=Paper=
{{Paper
|id=Vol-2137/paper_23.pdf
|storemode=property
|title=Supporting Ontology-Based Standardization of Biomedical Metadata in the CEDAR Workbench
|pdfUrl=https://ceur-ws.org/Vol-2137/paper_23.pdf
|volume=Vol-2137
|authors=Marcos Martínez-Romero,Martin J. O'Connor,Michael Dorf,Jennifer Vendetti,Debra Willrett,Attila L. Egyedi,John Graybeal,Mark A. Musen
|dblpUrl=https://dblp.org/rec/conf/icbo/RomeroODVWEGM17
}}
==Supporting Ontology-Based Standardization of Biomedical Metadata in the CEDAR Workbench==
<pdf width="1500px">https://ceur-ws.org/Vol-2137/paper_23.pdf</pdf>
<pre>
         Supporting Ontology-Based Standardization of Biomedical
                    Metadata in the CEDAR Workbench
        Marcos Martínez-Romero*, Martin J. O’Connor, Michael Dorf, Jennifer Vendetti,
             Debra Willrett, Attila L. Egyedi, John Graybeal, and Mark A. Musen
            Center for Biomedical Informatics Research, Stanford University, 1265 Welch Rd, Stanford, CA 94305, USA


ABSTRACT                                                                         annotations. Even in cases where such annotations can be
    The availability of associated descriptive metadata for scientific da-       entered, scientists have no easy way to find and use terms
tasets is important for discovering and reproducing scientific experiments.      from ontologies to include in their metadata submissions.
The use of ontologies has become a key focus for increasing the quality of
these metadata. Despite the wide availability of biomedical ontologies,
                                                                                 Other difficulties include poor support for on-the-fly term
scientists wishing to use these ontologies when developing metadata              creation when the necessary terms are not found and for
descriptions face a number of practical difficulties. A core difficulty is the   creating custom lists of terms to meet domain-specific
lack of tools for developing ontology-linked metadata specifications that        needs.
can be published and shared. Additional difficulties include the lack of             A variety of tools have been developed to address the
support for defining new terms in cases when no existing terms are found
                                                                                 challenge of metadata quality. Foremost among these are the
and for creating custom term collections to meet domain-specific needs.
To address these problems, we developed tools that allow scientists to           ISA Tools (Rocca-Serra et al., 2010), which allow curators
find terms in ontologies for annotating their data and to dynamically cre-       to create spreadsheet-based submissions for metadata repos-
ate new terms and value sets. This work has been incorporated into a             itories. LinkedISA provides a means to interoperate with
Web-based platform called the CEDAR Workbench. The resulting integrat-           Linked Open Data, effectively adding controlled term link-
ed environment presents a set of highly interactive interfaces for creating      age to templates (González-Beltrán, Maguire, Sansone, &
and publishing ontology-rich metadata specifications.
                                                                                 Rocca-Serra, 2014). A similar spreadsheet-based tool called
                                                                                 RightField (Wolstencroft et al., 2011) provides a mechanism
1    INTRODUCTION                                                                for embedding ontology annotation capabilities in Excel or
In biomedicine, high-quality, standardized metadata are                          Open Office spreadsheets using ontologies from the BioPor-
crucial for facilitating the discovery of scientific datasets                    tal repository (Noy et al., 2009). Annotare (Shankar et al.,
and reproducibility of the corresponding experiments. In the                     2010), which is used to submit experimental data to the Ar-
last few years, the biomedical community has driven the                          rayExpress metadata repository (Parkinson et al., 2005),
development of metadata standards and guidelines for a                           also supports ontology-based suggestions. These tools ad-
variety of experiment types. Scientists use these specifica-                     dress specific issues of metadata quality but they do not
tions to inform their annotation of experimental results                         provide an integrated environment that can support the en-
(Tenenbaum, Sansone, & Haendel, 2014). One of the earli-                         tire metadata specification and submission process for wide-
est examples is the MIAME standard (Brazma et al., 2001),                        ly used biomedical repositories.
which is used to describe metadata about microarray exper-                           The Center for Expanded Data Annotation and Retrieval
iments. These standards and guidelines underpin metadata                         (CEDAR)1 is developing a computational ecosystem to
submissions to many public metadata repositories (Edgar,                         overcome the barriers to creating high-quality metadata in
Domrachev, & Lash, 2002). The BioSharing resource                                biomedicine (Musen et al., 2015). CEDAR provides a suite
(McQuilton et al., 2016) catalogs hundreds of these stand-                       of highly sophisticated tools designed to make the authoring
ardization efforts.                                                              of metadata as natural as possible, while also using ontolo-
   Despite the growing use of standards for defining                             gies to enrich the generated descriptions with standard
metadata and the wide availability of biomedical ontologies,                     terms.
metadata submitted to public repositories rarely use standard                        In this paper, we describe the main features CEDAR de-
terms (Bui & Park, 2006). As a result, finding or reusing the                    veloped to make it possible to easily construct Web-based
metadata is a challenge and understanding the underlying                         metadata-acquisition forms, enrich those forms with ontolo-
experiments can be extremely hard, often requiring signifi-                      gy concepts, and then fill out the forms to create ontology-
cant post-processing of metadata to extract useful content.                      annotated descriptions of scientific experiments.
   A key problem is that scientists face considerable practi-
cal barriers when attempting to link their metadata to ontol-
ogy terms. Submission mechanisms for biomedical reposito-
ries are typically based on spreadsheets, with a variety of ad
hoc formats that rarely support inclusion of ontology-based                      1
                                                                                     https://metadatacenter.org/


                                                                                                                                           1
Martínez-Romero et al.


     Fig. 1. An overview of CEDAR’s metadata authoring workflow. Template authors use the Template Designer tool to create metadata
     templates. The Metadata Editor uses these templates to generate a graphical interface to acquire metadata from scientists. Acquired
     metadata are saved in CEDAR’s Metadata Repository.
                                                                         publication type, etc.) could be grouped together to form a
2       BACKGROUND                                                       publication element, which can then be reused in multiple
The CEDAR Workbench2 is a suite of Web-based tools and                   templates. After a template is created, the Metadata Editor
REST APIs centered on the use of highly-modular metada-                  can be used to automatically generate a forms-based acqui-
ta-acquisition forms called metadata templates (or simply                sition interface for entering metadata for that template. Sci-
templates). These templates define the data attributes—                  entists entering metadata using the Metadata Editor are
termed template fields or fields—needed to describe bio-                 prompted in real time with drop-down lists, auto-completion
medical experiments. For example, an experiment template                 suggestions, and verification hints, significantly reducing
may have an organism field containing the name of the or-                their error rate while speeding metadata entry and repair.
ganism being studied by the experiment (e.g., Homo sapi-                 These prompts are driven by the value constraints specified
ens). The templates may specify lists of permissible values              in templates.
for template fields. The central goal when designing a tem-              2.2    Metadata Repository
plate is to enable the capture of sufficiently precise and
                                                                         Templates and metadata produced by the Workbench are
complete metadata about experimental data to facilitate data
                                                                         stored in CEDAR’s metadata repository. CEDAR incorpo-
discovery, interpretation, and reuse.
                                                                         rates a standardized model of templates and metadata, to-
    The CEDAR Workbench provides three core components
                                                                         gether with Web-based services to store, search, and share
that form a metadata construction pipeline (Fig. 1): (1) a
                                                                         these resources (O’Connor et al., 2016). This model is based
Template Designer, which supports interactive template
                                                                         on the JSON Schema and JSON-LD specifications. It allows
creation; (2) a Metadata Editor, which allows end-users to
                                                                         users to publish their metadata as both JSON-LD and RDF,
fill in templates with metadata; and (3) a Metadata Reposi-
                                                                         thus facilitating interoperation with Linked Open Data.
tory for storing both templates and the metadata created
using those templates. The CEDAR Workbench also allows                   2.3    Support for ontology-based metadata
scientists to upload the metadata created to public biomedi-
                                                                         The CEDAR tools provide mechanisms for structurally de-
cal repositories.
                                                                         scribing templates and publishing metadata created using
2.1        Template Designer and Metadata Editor                         those templates in an open format. To increase the metadata
In the Template Designer, template authors assemble tem-                 quality further, we offer the ability to enrich these descrip-
plates from one or more input fields. There are numerous                 tions with controlled terms from ontologies. We extended
field types available to template authors (e.g., text, para-             the Template Designer and Metadata Editor to let users
graph, e-mail, numeric, and date). Users can also define                 specify semantic content for templates and to easily enter
reusable groups of fields, called elements. For example, the             semantically precise terms in their metadata. These exten-
fields that describe a publication (e.g., authors, title, year,          sions, can help to improve metadata adherence to the FAIR
                                                                         data principles (Wilkinson et al., 2016) and interoperability
                                                                         with Linked Open Data.
2
    https://cedar.metadatacenter.net


2
                                           Supporting Ontology-Based Standardization of Biomedical Metadata in the CEDAR Workbench


    Fig. 2. Screenshot of CEDAR Template Designer’s ontology lookup interface. Here, the user entered the search term publication and
    selected the class Publication from the National Cancer Institute Thesaurus (NCIT). The location of the selected class in the class tree is
    presented, as well as class and ontology details.


3       IMPLEMENTATION                                                      3.1    Class and Property Search
We have enhanced the CEDAR Workbench to provide the                         CEDAR allows template authors to search for ontology
ability to link ontology terms selected from BioPortal to                   terms to annotate their templates, that is, to add type and
biomedical metadata. BioPortal, developed by the National                   property assertions to template elements and fields using
                                                                            ontology classes and properties. Classes and object, data,
Center for Biomedical Ontology (NCBO) (Musen et al.,
                                                                            and annotation properties for performing these annotations
2012), is a popular platform for hosting and sharing biomed-
                                                                            can be selected from terms supplied by BioPortal. Fig. 2
ical ontologies. It provides access to more than 550 ontolo-
                                                                            shows a screenshot of the ontology lookup user interface of
gies, and contains over 8 million classes and 64,000 proper-                the CEDAR Workbench. In the example shown, the tem-
ties. The BioPortal API provides a rich set of operations to                plate author entered the search term publication and then
access and use ontologies. We extended this API to provide                  selected the Publication class from the National Cancer In-
the fine grained, highly interactive class and property                     stitute Thesaurus (NCIT). The interface shows detailed in-
lookup features needed by CEDAR’s term search and selec-                    formation both for the selected class and the associated on-
tion features. To facilitate general use of these features, we              tology, as well as for the position of the class in the class
encapsulated BioPortal’s API as a CEDAR service and                         tree of NCIT.
made it available as a public REST endpoint.3 We now de-
scribe these extensions.                                                    3.2    Value Set creation
                                                                            A value set is a list of possible values for a specific purpose.
                                                                            In the CEDAR Workbench, value sets are a useful mecha-
                                                                            nism to define pick lists of permissible values for template
                                                                            fields. CEDAR works in conjunction with BioPortal to al-
3
 The REST endpoints that provide ontology-based services to the CEDAR       low template authors to dynamically create value sets con-
Workbench are documented at https://terminology.metadatacenter.net/api.     taining the terms in these pick lists. Value sets can contain


                                                                                                                                                  3
Martínez-Romero et al.


                                                                   rowMatch, relatedMatch). Upon creation, a class is imme-
                                                                   diately assigned a unique, provisional IRI.
                                                                      For example, suppose that a user needs to use the ana-
                                                                   tomical term adductor dorsalis. This term is not available in
                                                                   any BioPortal ontology, though the adductor muscle class in
                                                                   the UBERON ontology is a close conceptual match. In this
                                                                   case, the user decides to create an adductor dorsalis class
                                                                   via the CEDAR Workbench and indicate that the new term
                                                                   is a subclass of the adductor muscle UBERON class. Fig. 4
                                                                   shows the class creation interface for this example. The ad-
                                                                   ductor dorsalis class is stored in BioPortal as a CEDAR
                                                                   provisional class and is immediately available to all
                                                                   CEDAR users. Eventually, maintainers of UBERON may
                                                                   decide to incorporate the adductor dorsalis class to the on-
                                                                   tology or may decide to reject it. If the class is added to
                                                                   UBERON, the permanent identifier for the class will be
                                                                   stored as part of the information of the provisional class. If
                                                                   adductor dorsalis is not included in the next version of the
                                                                   ontology, the subclassOf link will removed, but the class
                                                                   will still be valid in CEDAR.
                                                                   3.4    Value Constraints
                                                                   With the above functionality, the system can limit the possi-
                                                                   ble values of a template field to a predefined sets of ontolo-
                                                                   gy terms or value sets. Some template authors may need to
                                                                   define value constraints that go beyond predefined term

Fig. 3. Screenshot showing an example of value set creation. The
user is building a Longitudinal study types value set with terms
from the Clinical Trials Ontology (shown as CTO).

classes from any combination of BioPortal ontologies. Upon
creation, a value set is immediately assigned a unique, pro-
visional IRI. The CEDAR Workbench supports the creation,
retrieval, update, and deletion of these value sets.
   For example, suppose that the template author wishes to
constrain the values of a Study type field to three specific
types of longitudinal studies (prospective study, retrospec-
tive study, and hybrid study). The Clinical Trials Ontology
(CTO) is a good source of these types since it contains 375
study type classes (represented as descendants of the Study
type class). Instead of selecting all these types, the template
author can create a value set containing only the desired
types. Fig. 3 shows a screenshot of value set creation fea-
tures for this example presented in the Template Designer.
Here, the user creates a value set named Longitudinal study
types with three terms selected from CTO.
3.3    Class creation
Despite the vast number of classes and properties available
in biomedical repositories, ontologies often do not contain
the exact term a user requires. To address this problem,
CEDAR allows users to dynamically define new classes and
immediately to use them. When generating a new class,
users can optionally link it to one or several existing classes    Fig. 4. Example of class creation in the CEDAR Workbench. The
by means of the RDFS subclassOf relationship and SKOS              user creates the adductor dorsalis class and links it to the adductor
relationships (closeMatch, exactMatch, broadMatch, nar-            muscle class in the UBERON ontology via the subclassOf relation.


4
                                       Supporting Ontology-Based Standardization of Biomedical Metadata in the CEDAR Workbench


lists. For example, a user may wish to constrain the values           relevant datasets; (2) enhance these templates with ontolo-
of a disorder template field to all subclasses in three specific      gy-based annotations; (3) scientists populate the templates
branches of the DOID ontology rooted at the terms cognitive           with metadata describing their experiments; and (4) submit
disorder, sleep disorder, and dissociative disorder.                  the generated metadata to the appropriate repositories.
    To deal with use cases such as this one, the system effec-            Working together with the LINCS, ImmPort, and AIRR
tively allows template designers to constrain field values to         teams we first used the Template Designer tool to develop a
any combination of (1) all classes in an ontology branch, (2)         basic version of the templates required by each group. We
all classes from a specific ontology, (3) new or existing             then annotated those templates using ontologies. Each pro-
classes, and (4) new or existing value sets. Multiple con-            ject required a slightly different annotation workflow.
straint types can be specified for the same field.                        To annotate ImmPort data, members of the Human Im-
    Users populating templates using the Metadata Editor are          munology Project Consortium (HIPC)7 performed an analy-
presented in real time with a list of choices driven by these         sis of all fields and value constraints in the ImmPort system
value constraints. Fig. 5 shows an example of choices pre-            to identify appropriate controlled-term linkages. They used
sented for a disorder field that has had its values constrained       the Template Designer to comprehensively annotate the
to come from the three DOID ontology disease branches                 ImmPort templates with the controlled terms identified.
described in the earlier example. All terms from these three          They also specified value constraints for controlled-value
branches are combined in real time and presented as a single          fields to ensure that the generated acquisition interfaces re-
list.                                                                 stricted the acquisition of metadata to appropriate terms. In
                                                                      the cases where custom value sets were required for fields,
                                                                      CEDAR used BioPortal’s value set features to let users de-
                                                                      fine these resources. The process for the AIRR community
                                                                      was slightly different, since that community already incor-
                                                                      porated ontology-based annotations as an integral part of
                                                                      their metadata-specification process. All these annotations
                                                                      were available in spreadsheet format and the only required
                                                                      step was to formalize them using the Template Designer.
                                                                      Finally, the LINCS team identified and encoded controlled
                                                                      term linkage for an initial subset of their templates.
                                                                          The system successfully represented all required con-
                                                                      trolled-term annotations for the three groups. We are now
                                                                      completing the metadata submission pipeline for each
                                                                      group. For the LINCS and ImmPort projects, we are submit-
Fig. 5. Screenshot of the Metadata Editor that shows the possible     ting the generated metadata into their community domain
values of a Disorder field in a Study template. This field has been   repositories. The AIRR submission process involves sub-
constrained to accept values from the branches of the DOID ontol-     mitting the generated metadata to the public NCBI Bi-
ogy with roots cognitive disorder, sleep disorder, and dissociative   oSample repository.8 We have completed prototype LINCS
disorder.                                                             and NCBI pipeline submissions and will evaluate the speed,
                                                                      reliability, and completeness of the submission process be-
4    EVALUATION                                                       fore releasing each submission pipelines for public use.
CEDAR is working with several biomedical communities to
perform an initial evaluation of our ontology-based annota-           5      DISCUSSION
tion functionality. This evaluation is being carried out in the       Despite the growing number of ontologies in biomedicine,
context of using CEDAR to develop metadata submission                 scientists rarely select standard terms for describing their
pipelines for three biomedical groups. These groups are (1)           experiments. Consequently, finding scientific datasets and
the LINCS Consortium,4 which is developing a catalog of               understanding the corresponding experiments can be ex-
cellular signatures; (2) ImmPort,5 a portal for immunology-           tremely hard and time-consuming, and often requires con-
related datasets; and (3) the AIRR Community,6 which is               siderable post-processing of metadata to extract relevant
developing standards for describing datasets acquired using           content. A fundamental problem is the lack of convenient
advanced sequencing technologies. In all three cases, the             and openly available tools for linking metadata to ontolo-
workflow is: (1) design metadata templates for each group’s           gies. It takes time and effort to create well-specified metada-
                                                                      ta and scientists often view the task of metadata authoring as
                                                                      a burden that does not bring them any direct benefit.
4
  http://www.lincsproject.org
5                                                                     7
  http://www.immport.org                                                  https://www.immuneprofiling.org/hipc
6                                                                     8
  http://airr-community.org                                               https://www.ncbi.nlm.nih.gov/biosample


                                                                                                                                   5
Martínez-Romero et al.


   The CEDAR Workbench allows template authors to                                     (2014).     linkedISA:     semantic    representation        of     ISA-Tab
make extensive use of ontologies from BioPortal to add type                           experimental metadata. BMC bioinformatics, 15 Suppl 1, S4.
and property assertions to template fields and to constrain                    Martínez-Romero, M., O’Connor, M. J., Shankar, R., Panahiazar, M.,
the values of fields to ontology terms. Once those templates                          Willrett, D., Egyedi, A. L., Gevaert, O., et al. (2017). Fast and
are created, metadata authors can easily use them to gener-                           accurate       metadata        authoring        using    ontology-based
ate rich metadata without needing any understanding of on-                            recommendations. Proceedings of AMIA 2017 Annual Symposium
tology structures. The features described in this paper repre-                        (to appear).
sent a major step toward overcoming the barriers to the                        McQuilton, P., Gonzalez-Beltran, A., Rocca-Serra, P., Thurston, M., Lister,
creation of high-quality metadata in biomedicine. Through                             A., Maguire, E., & Sansone, S.-A. (2016). BioSharing: curated and
our approach, we hope to make it easier, and even fun, for
                                                                                      crowd-sourced metadata standards, databases and data policies in
scientists to annotate their experimental data in ways that
                                                                                      the life sciences. Database: the journal of biological databases and
ensure their value to the scientific community.
                                                                                      curation, 2016.
   We are studying a variety of technologies to further ease
                                                                               Musen, M. A., Bean, C. A., Cheung, K. H., Dumontier, M., Durante, K. A.,
the work of entering metadata. We developed a recommen-
                                                                                      Gevaert, O., Gonzalez-Beltran, A., et al. (2015). The Center for
dation service that identifies common patterns in the
metadata repository and that generates real-time suggestions                          Expanded Data Annotation and Retrieval. Journal of the American
for filling out templates (Martínez-Romero et al., 2017).                             Medical Informatics Association, 22(6), 1148–1152.
This service is the first of a planned set of intelligent author-              Musen, M. A., Noy, N. F., Shah, N. H., Whetzel, P. L., Chute, C. G., Story,
ing components that will also include the extraction and                              M.-A., & Smith, B. (2012). The National Center for Biomedical
semantic annotation of templates and metadata from semi-                              Ontology.      Journal    of   the   American      Medical        Informatics
structured sources, such as spreadsheets, scientific articles,                        Association.
and Web pages.                                                                 Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N.,
   We also plan to develop an ontology enrichment pipeline                            Jonquet, C., et al. (2009). BioPortal: ontologies and integrated data
in which ontology owners receive term requests based on                               resources at the click of a mouse. Nucleic acids research, 37(Web
the new classes created from CEDAR, which could be used                               Server issue), W170-3.
to refine and extend their ontologies. The TermGenie                           O’Connor, M. J., Martinez-Romero, M., Egyedi, A. L., Willrett, D.,
(Dietze et al., 2014) tool for requesting new Gene Ontology                           Graybeal, J., & Musen, M. A. (2016). An open repository model for
classes provides a model for the planned functionality.                               acquiring knowledge about scientific experiments. Proceedings of
                                                                                      the 20th International Conference on Knowledge Engineering and
ACKNOWLEDGMENTS                                                                       Knowledge Management (EKAW2016). (Vol. 10024, pp. 762–777).
CEDAR is supported by the National Institutes of Health                        Parkinson, H., Sarkans, U., Shojatalab, M., Abeygunawardena, N.,
through an NIH Big Data to Knowledge program under                                    Contrino, S., Coulson, R., Farne, A., et al. (2005). ArrayExpress--a
grant 1U54AI117925. NCBO is supported by the NIH                                      public repository for microarray gene expression data at the EBI.
Common Fund under grant U54HG004028. The CEDAR                                        Nucleic acids research, 33(Database issue), D553–D555.
Workbench is available at https://cedar.metadatacenter.net,                    Rocca-Serra, P., Brandizi, M., Maguire, E., Sklyar, N., Taylor, C., Begley,
and on GitHub (https://github.com/metadatacenter).                                    K., Field, D., et al. (2010). ISA software suite: Supporting
                                                                                      standards-compliant experimental annotation and enabling curation
REFERENCES                                                                            at the community level. Bioinformatics, 26(18), 2354.
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P.,          Shankar, R., Parkinson, H., Burdett, T., Hastings, E., Liu, J., Miller, M.,
       Stoeckert, C., Aach, J., et al. (2001). Minimum information about a            Srinivasa, R., et al. (2010). Annotare-a tool for annotating high-
       microarray experiment (MIAME)-toward standards for microarray                  throughput      biomedical     investigations     and   resulting       data.
       data. Nat Genet, 29, 365–371.                                                  Bioinformatics, 26(19), 2470–2471.
Bui, Y., & Park, J.-R. (2006). An assessment of metadata quality: A case       Tenenbaum, J. D., Sansone, S.-A., & Haendel, M. (2014). A sea of
       study of the National Science Digital Library Metadata Repository.             standards for omics data: sink or swim? Journal of the American
       Proceedings of CAIS/ACSI 2006.                                                 Medical Informatics Association, 21(2), 200–203.
Dietze, H., Berardini, T. Z., Foulger, R. E., Hill, D. P., Lomax, J., Osumi-   Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton,
       Sutherland, D., Roncaglia, P., et al. (2014). TermGenie – a web-               M., Baak, A., Blomberg, N., et al. (2016). The FAIR Guiding
       application for pattern-based ontology class generation. Journal of            Principles for scientific data management and stewardship.
       Biomedical Semantics, 5(1), 48.                                                Scientific Data, 3, 160018.
Edgar, R., Domrachev, M., & Lash, A. E. (2002). Gene Expression                Wolstencroft, K., Owen, S., Horridge, M., Krebs, O., Mueller, W., Snoep,
       Omnibus: NCBI gene expression and hybridization array data                     J. L., du Preez, F., et al. (2011). RightField: Embedding ontology
       repository. Nucleic Acids Res, 30(1), 207–210.                                 annotation in spreadsheets. Bioinformatics, 27(14), 2021–2022.
González-Beltrán, A., Maguire, E., Sansone, S.-A., & Rocca-Serra, P.


6

</pre>