=Paper=
{{Paper
|id=Vol-2137/paper_23.pdf
|storemode=property
|title=Supporting Ontology-Based Standardization of Biomedical Metadata in the CEDAR Workbench
|pdfUrl=https://ceur-ws.org/Vol-2137/paper_23.pdf
|volume=Vol-2137
|authors=Marcos Martínez-Romero,Martin J. O'Connor,Michael Dorf,Jennifer Vendetti,Debra Willrett,Attila L. Egyedi,John Graybeal,Mark A. Musen
|dblpUrl=https://dblp.org/rec/conf/icbo/RomeroODVWEGM17
}}
==Supporting Ontology-Based Standardization of Biomedical Metadata in the CEDAR Workbench==
Supporting Ontology-Based Standardization of Biomedical
Metadata in the CEDAR Workbench
Marcos Martínez-Romero*, Martin J. O’Connor, Michael Dorf, Jennifer Vendetti,
Debra Willrett, Attila L. Egyedi, John Graybeal, and Mark A. Musen
Center for Biomedical Informatics Research, Stanford University, 1265 Welch Rd, Stanford, CA 94305, USA
ABSTRACT annotations. Even in cases where such annotations can be
The availability of associated descriptive metadata for scientific da- entered, scientists have no easy way to find and use terms
tasets is important for discovering and reproducing scientific experiments. from ontologies to include in their metadata submissions.
The use of ontologies has become a key focus for increasing the quality of
these metadata. Despite the wide availability of biomedical ontologies,
Other difficulties include poor support for on-the-fly term
scientists wishing to use these ontologies when developing metadata creation when the necessary terms are not found and for
descriptions face a number of practical difficulties. A core difficulty is the creating custom lists of terms to meet domain-specific
lack of tools for developing ontology-linked metadata specifications that needs.
can be published and shared. Additional difficulties include the lack of A variety of tools have been developed to address the
support for defining new terms in cases when no existing terms are found
challenge of metadata quality. Foremost among these are the
and for creating custom term collections to meet domain-specific needs.
To address these problems, we developed tools that allow scientists to ISA Tools (Rocca-Serra et al., 2010), which allow curators
find terms in ontologies for annotating their data and to dynamically cre- to create spreadsheet-based submissions for metadata repos-
ate new terms and value sets. This work has been incorporated into a itories. LinkedISA provides a means to interoperate with
Web-based platform called the CEDAR Workbench. The resulting integrat- Linked Open Data, effectively adding controlled term link-
ed environment presents a set of highly interactive interfaces for creating age to templates (González-Beltrán, Maguire, Sansone, &
and publishing ontology-rich metadata specifications.
Rocca-Serra, 2014). A similar spreadsheet-based tool called
RightField (Wolstencroft et al., 2011) provides a mechanism
1 INTRODUCTION for embedding ontology annotation capabilities in Excel or
In biomedicine, high-quality, standardized metadata are Open Office spreadsheets using ontologies from the BioPor-
crucial for facilitating the discovery of scientific datasets tal repository (Noy et al., 2009). Annotare (Shankar et al.,
and reproducibility of the corresponding experiments. In the 2010), which is used to submit experimental data to the Ar-
last few years, the biomedical community has driven the rayExpress metadata repository (Parkinson et al., 2005),
development of metadata standards and guidelines for a also supports ontology-based suggestions. These tools ad-
variety of experiment types. Scientists use these specifica- dress specific issues of metadata quality but they do not
tions to inform their annotation of experimental results provide an integrated environment that can support the en-
(Tenenbaum, Sansone, & Haendel, 2014). One of the earli- tire metadata specification and submission process for wide-
est examples is the MIAME standard (Brazma et al., 2001), ly used biomedical repositories.
which is used to describe metadata about microarray exper- The Center for Expanded Data Annotation and Retrieval
iments. These standards and guidelines underpin metadata (CEDAR)1 is developing a computational ecosystem to
submissions to many public metadata repositories (Edgar, overcome the barriers to creating high-quality metadata in
Domrachev, & Lash, 2002). The BioSharing resource biomedicine (Musen et al., 2015). CEDAR provides a suite
(McQuilton et al., 2016) catalogs hundreds of these stand- of highly sophisticated tools designed to make the authoring
ardization efforts. of metadata as natural as possible, while also using ontolo-
Despite the growing use of standards for defining gies to enrich the generated descriptions with standard
metadata and the wide availability of biomedical ontologies, terms.
metadata submitted to public repositories rarely use standard In this paper, we describe the main features CEDAR de-
terms (Bui & Park, 2006). As a result, finding or reusing the veloped to make it possible to easily construct Web-based
metadata is a challenge and understanding the underlying metadata-acquisition forms, enrich those forms with ontolo-
experiments can be extremely hard, often requiring signifi- gy concepts, and then fill out the forms to create ontology-
cant post-processing of metadata to extract useful content. annotated descriptions of scientific experiments.
A key problem is that scientists face considerable practi-
cal barriers when attempting to link their metadata to ontol-
ogy terms. Submission mechanisms for biomedical reposito-
ries are typically based on spreadsheets, with a variety of ad
hoc formats that rarely support inclusion of ontology-based 1
https://metadatacenter.org/
1
Martínez-Romero et al.
Fig. 1. An overview of CEDAR’s metadata authoring workflow. Template authors use the Template Designer tool to create metadata
templates. The Metadata Editor uses these templates to generate a graphical interface to acquire metadata from scientists. Acquired
metadata are saved in CEDAR’s Metadata Repository.
publication type, etc.) could be grouped together to form a
2 BACKGROUND publication element, which can then be reused in multiple
The CEDAR Workbench2 is a suite of Web-based tools and templates. After a template is created, the Metadata Editor
REST APIs centered on the use of highly-modular metada- can be used to automatically generate a forms-based acqui-
ta-acquisition forms called metadata templates (or simply sition interface for entering metadata for that template. Sci-
templates). These templates define the data attributes— entists entering metadata using the Metadata Editor are
termed template fields or fields—needed to describe bio- prompted in real time with drop-down lists, auto-completion
medical experiments. For example, an experiment template suggestions, and verification hints, significantly reducing
may have an organism field containing the name of the or- their error rate while speeding metadata entry and repair.
ganism being studied by the experiment (e.g., Homo sapi- These prompts are driven by the value constraints specified
ens). The templates may specify lists of permissible values in templates.
for template fields. The central goal when designing a tem- 2.2 Metadata Repository
plate is to enable the capture of sufficiently precise and
Templates and metadata produced by the Workbench are
complete metadata about experimental data to facilitate data
stored in CEDAR’s metadata repository. CEDAR incorpo-
discovery, interpretation, and reuse.
rates a standardized model of templates and metadata, to-
The CEDAR Workbench provides three core components
gether with Web-based services to store, search, and share
that form a metadata construction pipeline (Fig. 1): (1) a
these resources (O’Connor et al., 2016). This model is based
Template Designer, which supports interactive template
on the JSON Schema and JSON-LD specifications. It allows
creation; (2) a Metadata Editor, which allows end-users to
users to publish their metadata as both JSON-LD and RDF,
fill in templates with metadata; and (3) a Metadata Reposi-
thus facilitating interoperation with Linked Open Data.
tory for storing both templates and the metadata created
using those templates. The CEDAR Workbench also allows 2.3 Support for ontology-based metadata
scientists to upload the metadata created to public biomedi-
The CEDAR tools provide mechanisms for structurally de-
cal repositories.
scribing templates and publishing metadata created using
2.1 Template Designer and Metadata Editor those templates in an open format. To increase the metadata
In the Template Designer, template authors assemble tem- quality further, we offer the ability to enrich these descrip-
plates from one or more input fields. There are numerous tions with controlled terms from ontologies. We extended
field types available to template authors (e.g., text, para- the Template Designer and Metadata Editor to let users
graph, e-mail, numeric, and date). Users can also define specify semantic content for templates and to easily enter
reusable groups of fields, called elements. For example, the semantically precise terms in their metadata. These exten-
fields that describe a publication (e.g., authors, title, year, sions, can help to improve metadata adherence to the FAIR
data principles (Wilkinson et al., 2016) and interoperability
with Linked Open Data.
2
https://cedar.metadatacenter.net
2
Supporting Ontology-Based Standardization of Biomedical Metadata in the CEDAR Workbench
Fig. 2. Screenshot of CEDAR Template Designer’s ontology lookup interface. Here, the user entered the search term publication and
selected the class Publication from the National Cancer Institute Thesaurus (NCIT). The location of the selected class in the class tree is
presented, as well as class and ontology details.
3 IMPLEMENTATION 3.1 Class and Property Search
We have enhanced the CEDAR Workbench to provide the CEDAR allows template authors to search for ontology
ability to link ontology terms selected from BioPortal to terms to annotate their templates, that is, to add type and
biomedical metadata. BioPortal, developed by the National property assertions to template elements and fields using
ontology classes and properties. Classes and object, data,
Center for Biomedical Ontology (NCBO) (Musen et al.,
and annotation properties for performing these annotations
2012), is a popular platform for hosting and sharing biomed-
can be selected from terms supplied by BioPortal. Fig. 2
ical ontologies. It provides access to more than 550 ontolo-
shows a screenshot of the ontology lookup user interface of
gies, and contains over 8 million classes and 64,000 proper- the CEDAR Workbench. In the example shown, the tem-
ties. The BioPortal API provides a rich set of operations to plate author entered the search term publication and then
access and use ontologies. We extended this API to provide selected the Publication class from the National Cancer In-
the fine grained, highly interactive class and property stitute Thesaurus (NCIT). The interface shows detailed in-
lookup features needed by CEDAR’s term search and selec- formation both for the selected class and the associated on-
tion features. To facilitate general use of these features, we tology, as well as for the position of the class in the class
encapsulated BioPortal’s API as a CEDAR service and tree of NCIT.
made it available as a public REST endpoint.3 We now de-
scribe these extensions. 3.2 Value Set creation
A value set is a list of possible values for a specific purpose.
In the CEDAR Workbench, value sets are a useful mecha-
nism to define pick lists of permissible values for template
fields. CEDAR works in conjunction with BioPortal to al-
3
The REST endpoints that provide ontology-based services to the CEDAR low template authors to dynamically create value sets con-
Workbench are documented at https://terminology.metadatacenter.net/api. taining the terms in these pick lists. Value sets can contain
3
Martínez-Romero et al.
rowMatch, relatedMatch). Upon creation, a class is imme-
diately assigned a unique, provisional IRI.
For example, suppose that a user needs to use the ana-
tomical term adductor dorsalis. This term is not available in
any BioPortal ontology, though the adductor muscle class in
the UBERON ontology is a close conceptual match. In this
case, the user decides to create an adductor dorsalis class
via the CEDAR Workbench and indicate that the new term
is a subclass of the adductor muscle UBERON class. Fig. 4
shows the class creation interface for this example. The ad-
ductor dorsalis class is stored in BioPortal as a CEDAR
provisional class and is immediately available to all
CEDAR users. Eventually, maintainers of UBERON may
decide to incorporate the adductor dorsalis class to the on-
tology or may decide to reject it. If the class is added to
UBERON, the permanent identifier for the class will be
stored as part of the information of the provisional class. If
adductor dorsalis is not included in the next version of the
ontology, the subclassOf link will removed, but the class
will still be valid in CEDAR.
3.4 Value Constraints
With the above functionality, the system can limit the possi-
ble values of a template field to a predefined sets of ontolo-
gy terms or value sets. Some template authors may need to
define value constraints that go beyond predefined term
Fig. 3. Screenshot showing an example of value set creation. The
user is building a Longitudinal study types value set with terms
from the Clinical Trials Ontology (shown as CTO).
classes from any combination of BioPortal ontologies. Upon
creation, a value set is immediately assigned a unique, pro-
visional IRI. The CEDAR Workbench supports the creation,
retrieval, update, and deletion of these value sets.
For example, suppose that the template author wishes to
constrain the values of a Study type field to three specific
types of longitudinal studies (prospective study, retrospec-
tive study, and hybrid study). The Clinical Trials Ontology
(CTO) is a good source of these types since it contains 375
study type classes (represented as descendants of the Study
type class). Instead of selecting all these types, the template
author can create a value set containing only the desired
types. Fig. 3 shows a screenshot of value set creation fea-
tures for this example presented in the Template Designer.
Here, the user creates a value set named Longitudinal study
types with three terms selected from CTO.
3.3 Class creation
Despite the vast number of classes and properties available
in biomedical repositories, ontologies often do not contain
the exact term a user requires. To address this problem,
CEDAR allows users to dynamically define new classes and
immediately to use them. When generating a new class,
users can optionally link it to one or several existing classes Fig. 4. Example of class creation in the CEDAR Workbench. The
by means of the RDFS subclassOf relationship and SKOS user creates the adductor dorsalis class and links it to the adductor
relationships (closeMatch, exactMatch, broadMatch, nar- muscle class in the UBERON ontology via the subclassOf relation.
4
Supporting Ontology-Based Standardization of Biomedical Metadata in the CEDAR Workbench
lists. For example, a user may wish to constrain the values relevant datasets; (2) enhance these templates with ontolo-
of a disorder template field to all subclasses in three specific gy-based annotations; (3) scientists populate the templates
branches of the DOID ontology rooted at the terms cognitive with metadata describing their experiments; and (4) submit
disorder, sleep disorder, and dissociative disorder. the generated metadata to the appropriate repositories.
To deal with use cases such as this one, the system effec- Working together with the LINCS, ImmPort, and AIRR
tively allows template designers to constrain field values to teams we first used the Template Designer tool to develop a
any combination of (1) all classes in an ontology branch, (2) basic version of the templates required by each group. We
all classes from a specific ontology, (3) new or existing then annotated those templates using ontologies. Each pro-
classes, and (4) new or existing value sets. Multiple con- ject required a slightly different annotation workflow.
straint types can be specified for the same field. To annotate ImmPort data, members of the Human Im-
Users populating templates using the Metadata Editor are munology Project Consortium (HIPC)7 performed an analy-
presented in real time with a list of choices driven by these sis of all fields and value constraints in the ImmPort system
value constraints. Fig. 5 shows an example of choices pre- to identify appropriate controlled-term linkages. They used
sented for a disorder field that has had its values constrained the Template Designer to comprehensively annotate the
to come from the three DOID ontology disease branches ImmPort templates with the controlled terms identified.
described in the earlier example. All terms from these three They also specified value constraints for controlled-value
branches are combined in real time and presented as a single fields to ensure that the generated acquisition interfaces re-
list. stricted the acquisition of metadata to appropriate terms. In
the cases where custom value sets were required for fields,
CEDAR used BioPortal’s value set features to let users de-
fine these resources. The process for the AIRR community
was slightly different, since that community already incor-
porated ontology-based annotations as an integral part of
their metadata-specification process. All these annotations
were available in spreadsheet format and the only required
step was to formalize them using the Template Designer.
Finally, the LINCS team identified and encoded controlled
term linkage for an initial subset of their templates.
The system successfully represented all required con-
trolled-term annotations for the three groups. We are now
completing the metadata submission pipeline for each
group. For the LINCS and ImmPort projects, we are submit-
Fig. 5. Screenshot of the Metadata Editor that shows the possible ting the generated metadata into their community domain
values of a Disorder field in a Study template. This field has been repositories. The AIRR submission process involves sub-
constrained to accept values from the branches of the DOID ontol- mitting the generated metadata to the public NCBI Bi-
ogy with roots cognitive disorder, sleep disorder, and dissociative oSample repository.8 We have completed prototype LINCS
disorder. and NCBI pipeline submissions and will evaluate the speed,
reliability, and completeness of the submission process be-
4 EVALUATION fore releasing each submission pipelines for public use.
CEDAR is working with several biomedical communities to
perform an initial evaluation of our ontology-based annota- 5 DISCUSSION
tion functionality. This evaluation is being carried out in the Despite the growing number of ontologies in biomedicine,
context of using CEDAR to develop metadata submission scientists rarely select standard terms for describing their
pipelines for three biomedical groups. These groups are (1) experiments. Consequently, finding scientific datasets and
the LINCS Consortium,4 which is developing a catalog of understanding the corresponding experiments can be ex-
cellular signatures; (2) ImmPort,5 a portal for immunology- tremely hard and time-consuming, and often requires con-
related datasets; and (3) the AIRR Community,6 which is siderable post-processing of metadata to extract relevant
developing standards for describing datasets acquired using content. A fundamental problem is the lack of convenient
advanced sequencing technologies. In all three cases, the and openly available tools for linking metadata to ontolo-
workflow is: (1) design metadata templates for each group’s gies. It takes time and effort to create well-specified metada-
ta and scientists often view the task of metadata authoring as
a burden that does not bring them any direct benefit.
4
http://www.lincsproject.org
5 7
http://www.immport.org https://www.immuneprofiling.org/hipc
6 8
http://airr-community.org https://www.ncbi.nlm.nih.gov/biosample
5
Martínez-Romero et al.
The CEDAR Workbench allows template authors to (2014). linkedISA: semantic representation of ISA-Tab
make extensive use of ontologies from BioPortal to add type experimental metadata. BMC bioinformatics, 15 Suppl 1, S4.
and property assertions to template fields and to constrain Martínez-Romero, M., O’Connor, M. J., Shankar, R., Panahiazar, M.,
the values of fields to ontology terms. Once those templates Willrett, D., Egyedi, A. L., Gevaert, O., et al. (2017). Fast and
are created, metadata authors can easily use them to gener- accurate metadata authoring using ontology-based
ate rich metadata without needing any understanding of on- recommendations. Proceedings of AMIA 2017 Annual Symposium
tology structures. The features described in this paper repre- (to appear).
sent a major step toward overcoming the barriers to the McQuilton, P., Gonzalez-Beltran, A., Rocca-Serra, P., Thurston, M., Lister,
creation of high-quality metadata in biomedicine. Through A., Maguire, E., & Sansone, S.-A. (2016). BioSharing: curated and
our approach, we hope to make it easier, and even fun, for
crowd-sourced metadata standards, databases and data policies in
scientists to annotate their experimental data in ways that
the life sciences. Database: the journal of biological databases and
ensure their value to the scientific community.
curation, 2016.
We are studying a variety of technologies to further ease
Musen, M. A., Bean, C. A., Cheung, K. H., Dumontier, M., Durante, K. A.,
the work of entering metadata. We developed a recommen-
Gevaert, O., Gonzalez-Beltran, A., et al. (2015). The Center for
dation service that identifies common patterns in the
metadata repository and that generates real-time suggestions Expanded Data Annotation and Retrieval. Journal of the American
for filling out templates (Martínez-Romero et al., 2017). Medical Informatics Association, 22(6), 1148–1152.
This service is the first of a planned set of intelligent author- Musen, M. A., Noy, N. F., Shah, N. H., Whetzel, P. L., Chute, C. G., Story,
ing components that will also include the extraction and M.-A., & Smith, B. (2012). The National Center for Biomedical
semantic annotation of templates and metadata from semi- Ontology. Journal of the American Medical Informatics
structured sources, such as spreadsheets, scientific articles, Association.
and Web pages. Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N.,
We also plan to develop an ontology enrichment pipeline Jonquet, C., et al. (2009). BioPortal: ontologies and integrated data
in which ontology owners receive term requests based on resources at the click of a mouse. Nucleic acids research, 37(Web
the new classes created from CEDAR, which could be used Server issue), W170-3.
to refine and extend their ontologies. The TermGenie O’Connor, M. J., Martinez-Romero, M., Egyedi, A. L., Willrett, D.,
(Dietze et al., 2014) tool for requesting new Gene Ontology Graybeal, J., & Musen, M. A. (2016). An open repository model for
classes provides a model for the planned functionality. acquiring knowledge about scientific experiments. Proceedings of
the 20th International Conference on Knowledge Engineering and
ACKNOWLEDGMENTS Knowledge Management (EKAW2016). (Vol. 10024, pp. 762–777).
CEDAR is supported by the National Institutes of Health Parkinson, H., Sarkans, U., Shojatalab, M., Abeygunawardena, N.,
through an NIH Big Data to Knowledge program under Contrino, S., Coulson, R., Farne, A., et al. (2005). ArrayExpress--a
grant 1U54AI117925. NCBO is supported by the NIH public repository for microarray gene expression data at the EBI.
Common Fund under grant U54HG004028. The CEDAR Nucleic acids research, 33(Database issue), D553–D555.
Workbench is available at https://cedar.metadatacenter.net, Rocca-Serra, P., Brandizi, M., Maguire, E., Sklyar, N., Taylor, C., Begley,
and on GitHub (https://github.com/metadatacenter). K., Field, D., et al. (2010). ISA software suite: Supporting
standards-compliant experimental annotation and enabling curation
REFERENCES at the community level. Bioinformatics, 26(18), 2354.
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Shankar, R., Parkinson, H., Burdett, T., Hastings, E., Liu, J., Miller, M.,
Stoeckert, C., Aach, J., et al. (2001). Minimum information about a Srinivasa, R., et al. (2010). Annotare-a tool for annotating high-
microarray experiment (MIAME)-toward standards for microarray throughput biomedical investigations and resulting data.
data. Nat Genet, 29, 365–371. Bioinformatics, 26(19), 2470–2471.
Bui, Y., & Park, J.-R. (2006). An assessment of metadata quality: A case Tenenbaum, J. D., Sansone, S.-A., & Haendel, M. (2014). A sea of
study of the National Science Digital Library Metadata Repository. standards for omics data: sink or swim? Journal of the American
Proceedings of CAIS/ACSI 2006. Medical Informatics Association, 21(2), 200–203.
Dietze, H., Berardini, T. Z., Foulger, R. E., Hill, D. P., Lomax, J., Osumi- Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton,
Sutherland, D., Roncaglia, P., et al. (2014). TermGenie – a web- M., Baak, A., Blomberg, N., et al. (2016). The FAIR Guiding
application for pattern-based ontology class generation. Journal of Principles for scientific data management and stewardship.
Biomedical Semantics, 5(1), 48. Scientific Data, 3, 160018.
Edgar, R., Domrachev, M., & Lash, A. E. (2002). Gene Expression Wolstencroft, K., Owen, S., Horridge, M., Krebs, O., Mueller, W., Snoep,
Omnibus: NCBI gene expression and hybridization array data J. L., du Preez, F., et al. (2011). RightField: Embedding ontology
repository. Nucleic Acids Res, 30(1), 207–210. annotation in spreadsheets. Bioinformatics, 27(14), 2021–2022.
González-Beltrán, A., Maguire, E., Sansone, S.-A., & Rocca-Serra, P.
6