<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Harmonizing bioCADDIE Metadata Schemas for Index- ing Clinical Research Datasets Using Semantic Web Technologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Harold R. Solbrig</string-name>
          <email>solbrig.harold@mayo.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guoqian Jiang</string-name>
          <email>jiang.guoqian@mayo.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mayo Clinic College of Medicine</institution>
          ,
          <addr-line>Rochester, MN [solbrig.harold</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>An important role of the NIH Big Data to Knowledge (BD2K) biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) is to promote data integration through the adoption of content standards and alignment to common data elements and high-level schema. The objective of this study was to investigate how a combination of Semantic Web technologies and the ISO/IEC 11179 data element model could be used in the alignment of a biomedical study database and the bioCADDIE indexing schema. Using the database of Genotypes and Phenotypes (dbGaP) as a representative example, we were able to demonstrate the viability of the general approach and propose a number of promising next steps.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Part 3 of the ISO/IEC 11179-3 defines a formal model of a data element registry and
its basic attributes. The model provides a structure to represent data elements, their
types, units of measure, possible values, etc. It also specifies how each of these
components can be associated with their intended meaning -- the real world objects
properties that these data elements represent. In this study, we transform the dbGaP and
bioCADDIE models from their native XML Schema and JSON Schema
representations into their corresponding OWL equivalents. We then align the results with an
OWL8 representation of the ISO/IEC 11179-3 model, which serves the role of an
Upper Level Ontology (ULO). We demonstrate that the result of this process, when
used in combination with a description logic (DL) reasoner, can be used to discover,
validate, and uncover issues with possible alignments between dbGaP and
bioCADDIE model components.</p>
    </sec>
    <sec id="sec-2">
      <title>Materials and Methods</title>
      <p>This project utilized three resources - the XML Schema representation of the dbGaP
data dictionary, the JSON Schema representation of the bioCADDIE metadata
schema files, and an OWL representation the model of ISO/IEC 11179 Edition 3 Part 3.9
dbGap
dbGaP4 is an NIH pilot data commons charged to archive, curate and distribute
information produced by studies in investigating the interaction of genotype and
phenotype. The dbGaP database structure is defined in XML Schema. Figure 1 shows a
diagram illustrating a portion of the dbGaP schema for the Study resource.
bioCADDIE
bioCADDIE1 is a data discovery index (DDI) being developed to index resources
such as dbGaP. The bioCADDIE metadata schema files represent a collection of
descriptive metadata and structure for datasets being developed by the bioCADDIE
Metadata Specification Working Group 3.10 Figure 2 shows an overview of the
structure of the bioCADDIE metadata schemas.
ISO/IEC 11179 Edition 3 Part 37 defines a conceptual model of a metadata registry - a
registry of the contents and semantics of data elements such as those defined in
dbGaP and bioCADDIE. An OWL representation of this conceptual model was
developed by the eXtended MetaData Repository (XMDR) group11 for use with RDF
based metadata repositories. Figure 3 shows a high level view of the ISO/IEC 11179
metamodel.</p>
      <p>Methods
We converted the dbGap and bioCADDIE schemas into their equivalent OWL
representation, using the ISO/IEC 11179 model as a common Upper Level Ontology
(ULO), and demonstrated how an OWL DL reasoner could be used to evaluate
proposed similarities between elements in the two models. We then prototyped
approaches to using a reference ontology such as Ontology of Clinical Research
(OCRe)† to discover common or similar elements between the two models.
The dbGaP XML Schema was transformed into the OWL equivalent by representing
the complex types as OWL Classes, properties referencing complex types as Object
Properties and properties referencing simple types and their restrictions as Data
Properties. Data properties were classified as sub properties of ISO 11179 Data Element,
their target types as 11179 Value Domain with the Boolean and Enumeration types as
* Source: https://github.com/biocaddie/WG3-MetadataSpecifications
† https://bioportal.bioontology.org/ontologies/OCRE
Enumerated Value Domain and everything else as Described Domain. RDF domains
and ranges were defined for the properties. A prefix, "GAP ", was added to all labels
to allow dbGaP elements to be easily distinguished in the combined environment.
We repeated this transformation process with the bioCADDIE JSON Schema,
mapping objects to OWL Classes, containments and associations to OWL Object
Properties and data types to OWL Data Elements and then anchoring this transformation to
the ISO 11179 model as described above, prepending the labels with "DDI".
We then created a mapping ontology that imported the dbGaP, bioCADDIE and ISO
11179 ontologies. This mapping ontology allowed us to (attempt to) make assertions
about potential alignments between dbGaP and bioCADDIE data elements and
properties and to understand the ramifications of these decisions.</p>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>We successfully transformed the dbGaP and bioCADDIE metadata schemas into a
common OWL / ISO 11179 syntax and have demonstrated that the DL reasoner can
be used to determine and validate equivalence assertions between the resources. As
an example, the assertion that the dbGaP Dataset is equivalent to the bioCADDIE
Dataset asserts that both classes include createDates, modDates, original names, data
types, etc. (see Figure 4A). This process also uncovers underspecified types such as
the dbGaP “DisplayName” element in Dataset. An assertion about the equivalence of
the dbGaP Dataset “Accession ID” with the bioCADDIE Dataset “identifierInfo”
shows as a typing error (see Figure 4B).
(A)</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>Our next step will be to propose conceptual definitions (meanings) for both sets of
data elements, which should enable richer validation. We anticipate that this platform
and approach, once completed, will be equally applicable to both the metamodel and
model instance levels and, in combination with other Natural Language Processing
(NLP) and table based alignment tools, will provide a framework for both validation
and eventual dissemination through the CDISC PhUSE as well as other 11179 based
tooling. The artifacts of the project are accessible at
https://github.com/crDDI/ontologies.</p>
      <p>Acknowledgement: This study is supported in part by the funding from NIH Big
Data to Knowledge, Grant 1U24AI117966-01 and NCI U01 CA180940.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>BioCADDIE</given-names>
            <surname>Project</surname>
          </string-name>
          .
          <source>2016 [March 10</source>
          ,
          <year>2016</year>
          ]; Available from: https://biocaddie.org/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>BioCADDIE</given-names>
            <surname>Pilot Project</surname>
          </string-name>
          .
          <source>2016 [March 10</source>
          ,
          <year>2016</year>
          ]; Available from: https://biocaddie.org
          <article-title>/pilot-project-harvester-announcement.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <source>HL7 FHIR DSTU 2</source>
          .
          <source>2016 [March 10</source>
          ,
          <year>2016</year>
          ]; Available from: https://http://www.hl7.org/fhir/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Tryka</surname>
            <given-names>KA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hao</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sturcke</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>ZY</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ziyabari</surname>
            <given-names>L</given-names>
          </string-name>
          , et al.
          <source>NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic acids research</source>
          .
          <year>2014</year>
          ;
          <volume>42</volume>
          (Database issue):
          <fpage>D975</fpage>
          -
          <lpage>9</lpage>
          . Epub 2013/12/04.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>5. The College of American Pathologists (CAP) eCC (electronic Cancer Checklists)</article-title>
          <year>2016</year>
          <source>[August 7</source>
          ,
          <year>2016</year>
          ]; Available from: http://www.cap.org/capecc.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>TCGA</given-names>
            <surname>Data Portal</surname>
          </string-name>
          .
          <source>2016 [March 10</source>
          ,
          <year>2016</year>
          ]; Available from: https://tcgadata.nci.nih.gov/tcga/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. ISO/IEC 11179
          <string-name>
            <given-names>Metadata</given-names>
            <surname>Standard</surname>
          </string-name>
          .
          <source>2016 [March 10</source>
          ,
          <year>2016</year>
          ]; Available from: http://metadata-standards.org/11179/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Web</given-names>
            <surname>Ontology Language (OWL)</surname>
          </string-name>
          .
          <source>2016 [March 10</source>
          ,
          <year>2016</year>
          ]; Available from: http://www.w3.org/2001/sw/wiki/OWL.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <source>XMDR OWL Schema</source>
          .
          <source>2016 [March 10</source>
          ,
          <year>2016</year>
          ]; Available from: https://wiki.nci.nih.gov/pages/viewpageattachments.action?pageId=
          <fpage>10854184</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. bioCADDIE WG3
          <string-name>
            <given-names>Metadata</given-names>
            <surname>Specifications</surname>
          </string-name>
          .
          <source>2016 [March 10</source>
          ,
          <year>2016</year>
          ]; Available from: https://github.com/biocaddie/WG3-MetadataSpecifications.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>XMDR</surname>
          </string-name>
          .
          <year>2016</year>
          [March 10,
          <year>2016</year>
          ]; Available from: https://en.wikipedia.org/wiki/XMDR.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>