=Paper=
{{Paper
|id=Vol-2029/ks2
|storemode=property
|title=Making Online Datasets More Searchable and Accessible: The CEDAR project
|pdfUrl=https://ceur-ws.org/Vol-2029/ks2.pdf
|volume=Vol-2029
|authors=Mark A. Musen
|dblpUrl=https://dblp.org/rec/conf/simbig/Musen17
}}
==Making Online Datasets More Searchable and Accessible: The CEDAR project==
Making Online Datasets More Searchable and Accessible: The CEDAR project Mark A. Musen Stanford Center for Biomedical Informatics Research 1265 Welch Road, Room X-215, Stanford, California 94305-5479, USA musen@stanford.edu Abstract collected, and the major steps that the investiga- tors followed to perform their study. Good meta- Scientists increasingly are archiving their data are needed for other scientists to be able to data in online repositories to promote open search for relevant datasets, to make sense of the science and data reuse. The ability to data, and to know how to reanalyze the data. The find and access datasets that are stored problem is that most datasets are annotated with in these repositories depends on the qual- very poor metadata (Gonalves, R.S., et al., 2017). ity of the associated metadata. There is Metadata authors are burdened by cumbersome re- a growing set of community - developed quirements, they receive too little guidance, and standards for defining such metadata of- the result is that metadata are often riddled with ten in the form of metadata templates. typographical errors and they often fail to incor- The practical difficulties of working with porate standard ontological terms when required. these templates are tremendous, however. There is a clear need for methods to make it easier The Center for Expanded Data Annota- for scientist to author high-quality metadata and tion and Retrieval (CEDAR) is develop- to archive their datasets in a manner that will as- ing technologies to assist in the manage- sure that the data will be findable, accessible, in- ment of biomedical metadata. By dis- terpretable, and reusable (FAIR (Wilkinson, M.D., covering patterns in existing metadata and et al., 2016)). We believe that the fundamental by linking templates to biomedical ontolo- challenge of the open-science movement is effec- gies, CEDAR is assisting the authoring of tive annotation of datasets with metadata that are new, high-quality metadata. The availabil- complete and comprehensive. to use. CEDAR is ity of comprehensive and expressive meta- committed to the development of tools that make data will facilitate data discovery, interop- it easy for scientists to create high-quality meta- erability, and reuse. data (Musen, M.A., et al., 2015). 1 Introduction 2 The CEDAR Workbench The past few years have seen an increasing call for “open science,” where investigators make their CEDAR is building a suite of tools, known as data available for public access and reuse (Nosek, the CEDAR Workbench, that form a pipeline B.A., et al., 2015). There are obvious opportuni- for authoring experimental metadata (O’Connor, ties to make new discoveries by examining, inte- M.J., et al., 2016). We are working in the area grating, and analyzing data provided by other sci- of biomedical science, where there is already a entists. Funding organizations and journal editors trend for different scientific communities to spec- are increasingly insisting that investigators place ify standardized templates that capture the mini- their experimental data in public repositories for mal requirements for metadata related to different the benefit of the scientific community. The prob- classes of experiments (Taylor, C.F., Field, D., and lem, however, is that submitting data to a public Sansone, S.A., 2008). repository can be an onerous task that most in- Metadata Template Repository: We have de- vestigators would like to avoid. Online datasets veloped a standardized representation of meta- need to be supplemented by metadata data about data templates together with Web-based services the data that describe the subjects of the exper- to store, search, and share these templates. Tem- iment, the conditions under which the data were plates created using CEDAR technology are stored 20 in our openly accessible community repository. dards Working Group, which designs new meta- Researchers access the repository to search for ap- data templates and channels experimental datasets propriate templates to annotate their studies. Web- to the ImmPort repository. We successfully have based interfaces and REST APIs enable access to represented metadata from several hundred stud- all metadata templates, as well as to all the meta- ies provided by these groups within the CEDAR data collected using those templates (O’Connor, workbench. We also are working with the LINCS M.J., et al., 2016). project to develop a more robust metadata man- Metadata Template Creator and Template agement pipeline that supports the authoring of Editor: Two highly interactive Web-based tools metadata for a wide range of studies (Vempati, simplify the process of authoring metadata tem- U.D., et al., 2014). Collaborations with other sci- plates. The Template Creator allows users to cre- entific consortia are in the planning stage, with the ate, search, and author metadata templates. Using long-term goal of making all scientific data easier interactive look-up services linked to the NCBO to find, access, integrate, and reuse. BioPortal, template authors can find terms in on- tologies to annotate their templates and to restrict Acknowledgments the values of template fields. The Template Cre- CEDAR is supported by NIAID grant U54 ator automatically produces a user interface spec- AI117925 through funds provided by the trans- ification as it builds a template. The Metadata NIH Big Data to Knowledge (BD2K) ini- Editor uses this specification to generate a forms- tiative. CEDAR includes participation from based acquisition interface for acquiring individ- groups at Stanford University, Yale University, ual metadata components. the University of Oxford, and Northrop Grum- Intelligent Authoring: To ease the burden of man corporation. Martin J. O’Connor, Mar- authoring high quality metadata, a recommender cos Martı́nez-Romero, Attila L. Egyedi, Debra framework learns associations between data ele- Willrett, and John Graybeal have contributed ments and suggests to the user context-sensitive to the development of the CEDAR Workbench. metadata values (Martı́nez-Romero, M., et al., Additional information about CEDAR is avail- 2017). The system can recommend possible val- able from the Center’s Web site: http:// ues for metadata elements during the submission metadatacenter.org. process as each blank is selected and the user be- gins to type. The template editor also sorts pos- sible selections in drop-down windows so that the References terns that occur in the database with the greatest Bhattacharya, S., Andorf, S., Gomes, L., et al. 2014. frequency in the context of the other entries that Imm-Port: disseminating data to the public for the future of immunology. Immunologic Research have already been made into the template appear 58(23):234239. at the top of the drop-down list. The goal is to make it as simple as possible for metadata authors Gonalves, R.S., OConnor, M.J., Martnez-Romero, M., to fill in the templates, using as many entries from et al. 2017. Metadata in the BioSample online repository are impaired by numerous anomalies. standard ontologies as they can, and to do allow Procedings of SemSci: Enabling Open Semantic Sci- the authors to do so as quickly and as accurately ence. International Semantic Web Conference. Vi- as possible. enna, Austria. Martı́nez-Romero, M., OConnor, M.J., Shankar, R., et 3 Deployment and Evaluation al. 2017. Fast and accurate metadata authoring using ontology-based recommendations. Proceed- The CEDAR team includes several community - ings of the American Medical Informatics Associa- based groups who are helping to develop and eval- tion Annual Symposium. Washington, DC. uate our current system. These collaborators in- McQuilton, P., Gonzalez-Beltran, A., Rocca-Serra, P., clude (1) the BioSharing initiative, which catalogs et al. 2016. Biosharing: curated and crowd- metadata standards for describing biomedical ex- sourced metadata standards, databases, and data periments (McQuilton, P., et al., 2016), (2) Im- policies in the life sciences. Database 2016, doi: mPort, a data warehouse of immunology-related 10.1093/database/baw075. datasets (Bhattacharya, et al., 2014), and (3) the Musen, M.A., Bean, C.A., Cheung, K.-H., et al. 2015. Human Immunology Project Consortium Stan- The Center for Expanded Data Annotation and Re- 21 trieval. Journal of the American Medical Informat- ics Association 22(6):11481152. Nosek, B.A., Alter, G., Banks, G.C., et al. 2015. Promoting an open research culture. Science 348(6242):14221424. O’Connor, M.J., Martnez-Romero, M., Egyedi, A.L., et al. 2016. An open repository model for acquir- ing knowledge about scientific experiments. Pro- ceedings of the 20th International Conference on Knowledge Engineering and Knowledge Manage- ment. Bologna, Italy. Taylor, C.F., Field, D., and Sansone, S.A. 2008. Promoting coherent minimum reporting guidelines for biological and biomedical investigaitons: the MIBBI project. Nature Biotechnology 26:889896. Vempati, U.D., Chung, C., Mader, C., et al. 2014. Specifications to describe, model, and integrate complex and diverse high-throughput screening data from the Library of Integrated Network-based Cel- lular Signatures (LINCS). Journal of Bio-molecular Screening 19(5):803816. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al. 2016. The FAIR guiding principles for scientific data management and stewardship. Nature Scien- tific Data 3:160018. 22