<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Making Online Datasets More Searchable and Accessible: The CEDAR project</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mark A. Musen</string-name>
          <email>musen@stanford.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Stanford Center for Biomedical Informatics Research 1265</institution>
          <addr-line>Welch Road, Room X-215, Stanford, California 94305-5479</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>20</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>Scientists increasingly are archiving their data in online repositories to promote open science and data reuse. The ability to find and access datasets that are stored in these repositories depends on the quality of the associated metadata. There is a growing set of community - developed standards for defining such metadata often in the form of metadata templates. The practical difficulties of working with these templates are tremendous, however. The Center for Expanded Data Annotation and Retrieval (CEDAR) is developing technologies to assist in the management of biomedical metadata. By discovering patterns in existing metadata and by linking templates to biomedical ontologies, CEDAR is assisting the authoring of new, high-quality metadata. The availability of comprehensive and expressive metadata will facilitate data discovery, interoperability, and reuse.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The past few years have seen an increasing call
for “open science,” where investigators make their
data available for public access and reuse
        <xref ref-type="bibr" rid="ref5 ref6">(Nosek,
B.A., et al., 2015)</xref>
        . There are obvious
opportunities to make new discoveries by examining,
integrating, and analyzing data provided by other
scientists. Funding organizations and journal editors
are increasingly insisting that investigators place
their experimental data in public repositories for
the benefit of the scientific community. The
problem, however, is that submitting data to a public
repository can be an onerous task that most
investigators would like to avoid. Online datasets
need to be supplemented by metadata data about
the data that describe the subjects of the
experiment, the conditions under which the data were
collected, and the major steps that the
investigators followed to perform their study. Good
metadata are needed for other scientists to be able to
search for relevant datasets, to make sense of the
data, and to know how to reanalyze the data. The
problem is that most datasets are annotated with
very poor metadata
        <xref ref-type="bibr" rid="ref2">(Gonalves, R.S., et al., 2017)</xref>
        .
Metadata authors are burdened by cumbersome
requirements, they receive too little guidance, and
the result is that metadata are often riddled with
typographical errors and they often fail to
incorporate standard ontological terms when required.
There is a clear need for methods to make it easier
for scientist to author high-quality metadata and
to archive their datasets in a manner that will
assure that the data will be findable, accessible,
interpretable, and reusable (FAIR
        <xref ref-type="bibr" rid="ref10 ref7">(Wilkinson, M.D.,
et al., 2016)</xref>
        ). We believe that the fundamental
challenge of the open-science movement is
effective annotation of datasets with metadata that are
complete and comprehensive. to use. CEDAR is
committed to the development of tools that make
it easy for scientists to create high-quality
metadata
        <xref ref-type="bibr" rid="ref5 ref6">(Musen, M.A., et al., 2015)</xref>
        .
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>The CEDAR Workbench</title>
      <p>
        CEDAR is building a suite of tools, known as
the CEDAR Workbench, that form a pipeline
for authoring experimental metadata
        <xref ref-type="bibr" rid="ref10 ref7">(O’Connor,
M.J., et al., 2016)</xref>
        . We are working in the area
of biomedical science, where there is already a
trend for different scientific communities to
specify standardized templates that capture the
minimal requirements for metadata related to different
classes of experiments
        <xref ref-type="bibr" rid="ref8">(Taylor, C.F., Field, D., and
Sansone, S.A., 2008)</xref>
        .
      </p>
      <sec id="sec-2-1">
        <title>Metadata Template Repository: We have de</title>
        <p>
          veloped a standardized representation of
metadata templates together with Web-based services
to store, search, and share these templates.
Templates created using CEDAR technology are stored
in our openly accessible community repository.
Researchers access the repository to search for
appropriate templates to annotate their studies.
Webbased interfaces and REST APIs enable access to
all metadata templates, as well as to all the
metadata collected using those templates
          <xref ref-type="bibr" rid="ref10 ref7">(O’Connor,
M.J., et al., 2016)</xref>
          .
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Metadata Template Creator and Template</title>
        <p>Editor: Two highly interactive Web-based tools
simplify the process of authoring metadata
templates. The Template Creator allows users to
create, search, and author metadata templates. Using
interactive look-up services linked to the NCBO
BioPortal, template authors can find terms in
ontologies to annotate their templates and to restrict
the values of template fields. The Template
Creator automatically produces a user interface
specification as it builds a template. The Metadata
Editor uses this specification to generate a
formsbased acquisition interface for acquiring
individual metadata components.</p>
        <p>
          Intelligent Authoring: To ease the burden of
authoring high quality metadata, a recommender
framework learns associations between data
elements and suggests to the user context-sensitive
metadata values
          <xref ref-type="bibr" rid="ref2 ref3">(Mart´ınez-Romero, M., et al.,
2017)</xref>
          . The system can recommend possible
values for metadata elements during the submission
process as each blank is selected and the user
begins to type. The template editor also sorts
possible selections in drop-down windows so that the
terns that occur in the database with the greatest
frequency in the context of the other entries that
have already been made into the template appear
at the top of the drop-down list. The goal is to
make it as simple as possible for metadata authors
to fill in the templates, using as many entries from
standard ontologies as they can, and to do allow
the authors to do so as quickly and as accurately
as possible.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Deployment and Evaluation</title>
      <p>
        The CEDAR team includes several community
based groups who are helping to develop and
evaluate our current system. These collaborators
include (1) the BioSharing initiative, which catalogs
metadata standards for describing biomedical
experiments
        <xref ref-type="bibr" rid="ref4">(McQuilton, P., et al., 2016)</xref>
        , (2)
ImmPort, a data warehouse of immunology-related
datasets
        <xref ref-type="bibr" rid="ref1">(Bhattacharya, et al., 2014)</xref>
        , and (3) the
Human Immunology Project Consortium
Standards Working Group, which designs new
metadata templates and channels experimental datasets
to the ImmPort repository. We successfully have
represented metadata from several hundred
studies provided by these groups within the CEDAR
workbench. We also are working with the LINCS
project to develop a more robust metadata
management pipeline that supports the authoring of
metadata for a wide range of studies
        <xref ref-type="bibr" rid="ref9">(Vempati,
U.D., et al., 2014)</xref>
        . Collaborations with other
scientific consortia are in the planning stage, with the
long-term goal of making all scientific data easier
to find, access, integrate, and reuse.
      </p>
      <sec id="sec-3-1">
        <title>Acknowledgments</title>
        <p>CEDAR is supported by NIAID grant U54
AI117925 through funds provided by the
transNIH Big Data to Knowledge (BD2K)
initiative. CEDAR includes participation from
groups at Stanford University, Yale University,
the University of Oxford, and Northrop
Grumman corporation. Martin J. O’Connor,
Marcos Mart´ınez-Romero, Attila L. Egyedi, Debra
Willrett, and John Graybeal have contributed
to the development of the CEDAR Workbench.
Additional information about CEDAR is
available from the Center’s Web site: http://
metadatacenter.org.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bhattacharya</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andorf</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomes</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.
          <year>2014</year>
          .
          <article-title>Imm-Port: disseminating data to the public for the future of immunology</article-title>
          .
          <source>Immunologic Research</source>
          <volume>58</volume>
          (
          <issue>23</issue>
          ):
          <fpage>234239</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Gonalves</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          , OConnor, M.J.,
          <string-name>
            <surname>Martnez-Romero</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.
          <year>2017</year>
          .
          <article-title>Metadata in the BioSample online repository are impaired by numerous anomalies. Procedings of SemSci: Enabling Open Semantic Science</article-title>
          .
          <source>International Semantic Web Conference</source>
          . Vienna, Austria.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Mart´ınez-</article-title>
          <string-name>
            <surname>Romero</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>OConnor</surname>
          </string-name>
          , M.J.,
          <string-name>
            <surname>Shankar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.
          <year>2017</year>
          .
          <article-title>Fast and accurate metadata authoring using ontology-based recommendations</article-title>
          .
          <source>Proceedings of the American Medical Informatics Association Annual Symposium</source>
          . Washington, DC.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>McQuilton</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Beltran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rocca-Serra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Biosharing: curated and crowdsourced metadata standards, databases, and data policies in the life sciences</article-title>
          .
          <source>Database</source>
          <year>2016</year>
          , doi: 10.1093/database/baw075.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bean</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheung</surname>
          </string-name>
          , K.
          <string-name>
            <surname>-H.</surname>
          </string-name>
          , et al.
          <year>2015</year>
          .
          <article-title>The Center for Expanded Data Annotation and Retrieval</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>22</volume>
          (
          <issue>6</issue>
          ):
          <fpage>11481152</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Nosek</surname>
            ,
            <given-names>B.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alter</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banks</surname>
            ,
            <given-names>G.C.</given-names>
          </string-name>
          , et al.
          <year>2015</year>
          .
          <article-title>Promoting an open research culture</article-title>
          .
          <source>Science</source>
          <volume>348</volume>
          (
          <issue>6242</issue>
          ):
          <fpage>14221424</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>O'Connor</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martnez-Romero</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Egyedi</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>An open repository model for acquiring knowledge about scientific experiments</article-title>
          .
          <source>Proceedings of the 20th International Conference on Knowledge Engineering and Knowledge Management</source>
          . Bologna, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Taylor</surname>
            ,
            <given-names>C.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Field</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sansone</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Promoting coherent minimum reporting guidelines for biological and biomedical investigaitons: the MIBBI project</article-title>
          .
          <source>Nature Biotechnology</source>
          <volume>26</volume>
          :
          <fpage>889896</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Vempati</surname>
          </string-name>
          , U.D.,
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mader</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.
          <year>2014</year>
          .
          <article-title>Specifications to describe, model, and integrate complex and diverse high-throughput screening data from the Library of Integrated Network-based Cellular Signatures (LINCS)</article-title>
          .
          <source>Journal of Bio-molecular Screening</source>
          <volume>19</volume>
          (
          <issue>5</issue>
          ):
          <fpage>803816</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aalbersberg</surname>
            ,
            <given-names>I.J.</given-names>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>The FAIR guiding principles for scientific data management and stewardship</article-title>
          .
          <source>Nature Scientific Data</source>
          <volume>3</volume>
          :
          <fpage>160018</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>