=Paper= {{Paper |id=Vol-2029/ks2 |storemode=property |title=Making Online Datasets More Searchable and Accessible: The CEDAR project |pdfUrl=https://ceur-ws.org/Vol-2029/ks2.pdf |volume=Vol-2029 |authors=Mark A. Musen |dblpUrl=https://dblp.org/rec/conf/simbig/Musen17 }} ==Making Online Datasets More Searchable and Accessible: The CEDAR project== https://ceur-ws.org/Vol-2029/ks2.pdf
    Making Online Datasets More Searchable and Accessible: The CEDAR
                                 project

                                       Mark A. Musen
                     Stanford Center for Biomedical Informatics Research
             1265 Welch Road, Room X-215, Stanford, California 94305-5479, USA
                                 musen@stanford.edu


                      Abstract                             collected, and the major steps that the investiga-
                                                           tors followed to perform their study. Good meta-
     Scientists increasingly are archiving their
                                                           data are needed for other scientists to be able to
     data in online repositories to promote open
                                                           search for relevant datasets, to make sense of the
     science and data reuse. The ability to
                                                           data, and to know how to reanalyze the data. The
     find and access datasets that are stored
                                                           problem is that most datasets are annotated with
     in these repositories depends on the qual-
                                                           very poor metadata (Gonalves, R.S., et al., 2017).
     ity of the associated metadata. There is
                                                           Metadata authors are burdened by cumbersome re-
     a growing set of community - developed
                                                           quirements, they receive too little guidance, and
     standards for defining such metadata of-
                                                           the result is that metadata are often riddled with
     ten in the form of metadata templates.
                                                           typographical errors and they often fail to incor-
     The practical difficulties of working with
                                                           porate standard ontological terms when required.
     these templates are tremendous, however.
                                                           There is a clear need for methods to make it easier
     The Center for Expanded Data Annota-
                                                           for scientist to author high-quality metadata and
     tion and Retrieval (CEDAR) is develop-
                                                           to archive their datasets in a manner that will as-
     ing technologies to assist in the manage-
                                                           sure that the data will be findable, accessible, in-
     ment of biomedical metadata. By dis-
                                                           terpretable, and reusable (FAIR (Wilkinson, M.D.,
     covering patterns in existing metadata and
                                                           et al., 2016)). We believe that the fundamental
     by linking templates to biomedical ontolo-
                                                           challenge of the open-science movement is effec-
     gies, CEDAR is assisting the authoring of
                                                           tive annotation of datasets with metadata that are
     new, high-quality metadata. The availabil-
                                                           complete and comprehensive. to use. CEDAR is
     ity of comprehensive and expressive meta-
                                                           committed to the development of tools that make
     data will facilitate data discovery, interop-
                                                           it easy for scientists to create high-quality meta-
     erability, and reuse.
                                                           data (Musen, M.A., et al., 2015).
1    Introduction
                                                           2   The CEDAR Workbench
The past few years have seen an increasing call
for “open science,” where investigators make their         CEDAR is building a suite of tools, known as
data available for public access and reuse (Nosek,         the CEDAR Workbench, that form a pipeline
B.A., et al., 2015). There are obvious opportuni-          for authoring experimental metadata (O’Connor,
ties to make new discoveries by examining, inte-           M.J., et al., 2016). We are working in the area
grating, and analyzing data provided by other sci-         of biomedical science, where there is already a
entists. Funding organizations and journal editors         trend for different scientific communities to spec-
are increasingly insisting that investigators place        ify standardized templates that capture the mini-
their experimental data in public repositories for         mal requirements for metadata related to different
the benefit of the scientific community. The prob-         classes of experiments (Taylor, C.F., Field, D., and
lem, however, is that submitting data to a public          Sansone, S.A., 2008).
repository can be an onerous task that most in-               Metadata Template Repository: We have de-
vestigators would like to avoid. Online datasets           veloped a standardized representation of meta-
need to be supplemented by metadata data about             data templates together with Web-based services
the data that describe the subjects of the exper-          to store, search, and share these templates. Tem-
iment, the conditions under which the data were            plates created using CEDAR technology are stored



                                                      20
in our openly accessible community repository.              dards Working Group, which designs new meta-
Researchers access the repository to search for ap-         data templates and channels experimental datasets
propriate templates to annotate their studies. Web-         to the ImmPort repository. We successfully have
based interfaces and REST APIs enable access to             represented metadata from several hundred stud-
all metadata templates, as well as to all the meta-         ies provided by these groups within the CEDAR
data collected using those templates (O’Connor,             workbench. We also are working with the LINCS
M.J., et al., 2016).                                        project to develop a more robust metadata man-
   Metadata Template Creator and Template                   agement pipeline that supports the authoring of
Editor: Two highly interactive Web-based tools              metadata for a wide range of studies (Vempati,
simplify the process of authoring metadata tem-             U.D., et al., 2014). Collaborations with other sci-
plates. The Template Creator allows users to cre-           entific consortia are in the planning stage, with the
ate, search, and author metadata templates. Using           long-term goal of making all scientific data easier
interactive look-up services linked to the NCBO             to find, access, integrate, and reuse.
BioPortal, template authors can find terms in on-
tologies to annotate their templates and to restrict        Acknowledgments
the values of template fields. The Template Cre-            CEDAR is supported by NIAID grant U54
ator automatically produces a user interface spec-          AI117925 through funds provided by the trans-
ification as it builds a template. The Metadata             NIH Big Data to Knowledge (BD2K) ini-
Editor uses this specification to generate a forms-         tiative.   CEDAR includes participation from
based acquisition interface for acquiring individ-          groups at Stanford University, Yale University,
ual metadata components.                                    the University of Oxford, and Northrop Grum-
   Intelligent Authoring: To ease the burden of             man corporation. Martin J. O’Connor, Mar-
authoring high quality metadata, a recommender              cos Martı́nez-Romero, Attila L. Egyedi, Debra
framework learns associations between data ele-             Willrett, and John Graybeal have contributed
ments and suggests to the user context-sensitive            to the development of the CEDAR Workbench.
metadata values (Martı́nez-Romero, M., et al.,              Additional information about CEDAR is avail-
2017). The system can recommend possible val-               able from the Center’s Web site: http://
ues for metadata elements during the submission             metadatacenter.org.
process as each blank is selected and the user be-
gins to type. The template editor also sorts pos-
sible selections in drop-down windows so that the
                                                            References
terns that occur in the database with the greatest          Bhattacharya, S., Andorf, S., Gomes, L., et al. 2014.
frequency in the context of the other entries that            Imm-Port: disseminating data to the public for
                                                              the future of immunology. Immunologic Research
have already been made into the template appear               58(23):234239.
at the top of the drop-down list. The goal is to
make it as simple as possible for metadata authors          Gonalves, R.S., OConnor, M.J., Martnez-Romero, M.,
to fill in the templates, using as many entries from          et al. 2017. Metadata in the BioSample online
                                                              repository are impaired by numerous anomalies.
standard ontologies as they can, and to do allow              Procedings of SemSci: Enabling Open Semantic Sci-
the authors to do so as quickly and as accurately             ence. International Semantic Web Conference. Vi-
as possible.                                                  enna, Austria.

                                                            Martı́nez-Romero, M., OConnor, M.J., Shankar, R., et
3   Deployment and Evaluation                                al. 2017. Fast and accurate metadata authoring
                                                             using ontology-based recommendations. Proceed-
The CEDAR team includes several community -                  ings of the American Medical Informatics Associa-
based groups who are helping to develop and eval-            tion Annual Symposium. Washington, DC.
uate our current system. These collaborators in-
                                                            McQuilton, P., Gonzalez-Beltran, A., Rocca-Serra, P.,
clude (1) the BioSharing initiative, which catalogs          et al. 2016. Biosharing: curated and crowd-
metadata standards for describing biomedical ex-             sourced metadata standards, databases, and data
periments (McQuilton, P., et al., 2016), (2) Im-             policies in the life sciences. Database 2016, doi:
mPort, a data warehouse of immunology-related                10.1093/database/baw075.
datasets (Bhattacharya, et al., 2014), and (3) the          Musen, M.A., Bean, C.A., Cheung, K.-H., et al. 2015.
Human Immunology Project Consortium Stan-                    The Center for Expanded Data Annotation and Re-




                                                       21
  trieval. Journal of the American Medical Informat-
  ics Association 22(6):11481152.
Nosek, B.A., Alter, G., Banks, G.C., et al. 2015.
  Promoting an open research culture.      Science
  348(6242):14221424.
O’Connor, M.J., Martnez-Romero, M., Egyedi, A.L.,
  et al. 2016. An open repository model for acquir-
  ing knowledge about scientific experiments. Pro-
  ceedings of the 20th International Conference on
  Knowledge Engineering and Knowledge Manage-
  ment. Bologna, Italy.
Taylor, C.F., Field, D., and Sansone, S.A. 2008.
  Promoting coherent minimum reporting guidelines
  for biological and biomedical investigaitons: the
  MIBBI project. Nature Biotechnology 26:889896.
Vempati, U.D., Chung, C., Mader, C., et al. 2014.
  Specifications to describe, model, and integrate
  complex and diverse high-throughput screening data
  from the Library of Integrated Network-based Cel-
  lular Signatures (LINCS). Journal of Bio-molecular
  Screening 19(5):803816.
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et
  al. 2016. The FAIR guiding principles for scientific
  data management and stewardship. Nature Scien-
  tific Data 3:160018.




                                                         22