<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Demonstration of Layered Schema Architecture as a Semantic Harmonization Tool</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Burak Serdar</string-name>
          <email>bserdar@cloudprivacylabs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cloud Privacy Labs</institution>
          ,
          <addr-line>Colorado</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Eighth Joint Ontology Workshops</institution>
          ,
          <addr-line>JOWO'22</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We will demonstrate Layered Schema Architecture (LSA) as a novel semantic harmonization tool to make health-related records research and AI-ready. LSA is designed to capture data, contextual metadata, and semantics to preserve the correct meaning of data through transformations in an ecosystem where multiple data standards and conventions exist. The specific examples in this demonstration will deal with the harmonization of clinical data and social needs screening survey data collected from disparate systems using multiple data standards and ontologies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Layered schemas</kwd>
        <kwd>semantic interoperability</kwd>
        <kwd>semantic harmonization</kwd>
        <kwd>data warehousing</kwd>
        <kwd>FHIR</kwd>
        <kwd>OMOP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The longitudinal health data from large diverse populations with varying social, economic,
geographic, and environmental conditions is a highly valuable resource for medical and public
health researchers through the creation of various data commons where disparate data are
structured and harmonized to expand research options. Many challenges hinder the complete
and eficient capture and exchange of health data, including: 1) a lack of semantic interoperability
across systems; 2) the varying adoption of data standards within and between systems; 3) a lack
of standardized metadata; and 4) the poor integration of electronic health records (EHR) data
with data from other relevant sources such as social services, environmental measurements,
patient-entered, and data collected from wearable devices.</p>
      <p>
        A challenge to semantic harmonization is the ingestion of data from increasingly diverse
sources where vendor specific variations and non-standard representations are common. Some
examples include data from wearable devices, diferent social needs screening tools, and public
datasets. Another challenge is the dificulty in interpreting data based on context. Semantic
harmonization has to take into account the context in which data are captured as well as the
context in which data will be used after transformation. A common data model (CDM) such as
Observational Medical Outcomes Partnership (OMOP [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) helps to integrate data sets coming
from multiple sources and transform them for use by researchers; however, most source data
are developed for the fit and purpose of specific organizations.
      </p>
      <p>A common approach for the semantic harmonization of such data involves developing
extracttransform-load (ETL) scripts for each data source. This approach hardwires source-specific
variations in data (such as measurement units, coding systems, organizational conventions
or extensions to standards) while losing valuable contextual information (e.g. measurement
variations for laboratory results, patient-reported vs. physician-entered data). This approach
also lacks a standardized and reproducible way to capture and distribute relevant metadata.</p>
      <p>
        The “Layered Schema Architecture” (LSA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is an open-source technology developed by
Cloud Privacy Labs to enable semantic interoperability in a data ecosystem where multiple
overlapping data standards exist. In a partnership with the DARTNet Institute, we are exploring
the use of LSA to semantically harmonize and translate health and health related data captured
from disparate sources in diferent formats into OMOP for research and AI purposes. The specific
examples of this project use clinical and social determinants of health (SDoH) related data, but
the framework is domain and ontology agnostic and can be applied to various use-cases.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Layered Schemas and Labeled Property Graphs</title>
      <p>
        A schema is a machine-readable document that describes the structure of data. JSON and XML
schemas are widely used to generate executable code from specifications and to check structural
validity of data. LSA extends schemas (such as FHIR[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or OMOP schemas) with layers (overlays)
to add semantic annotations. The semantic annotations add ontology mappings, contextual
metadata, tags, and processing instructions that control data ingestion and normalization. A
schema variant is composed of a schema and a set of overlays, and contains the combination of
annotations given in the schema and the overlays.
      </p>
      <p>A schema variant is represented using a labeled property graph (LPG) that has a node for
each data field. An LPG is a directed graph where every node and edge contain a set of labels
describing its type or class, and a set of properties that represent named values. An LPG allows
assigning tags that represent diferent types of metadata to fields. A field may be a simple value,
a structured object (e.g a JSON object, array, polymorphic object), or a reference to another
schema.</p>
      <p>Diferent schema variants can be used to ingest data that shows variations based on data
source. Data variations can be structural (e.g. additional data fields, extensions) or semantic
(e.g. measurements in diferent units, diferent ontologies or coding systems), and can be due to
diferent vendor implementations, local conventions, or regulations. Ingesting structured data
using a schema variant creates an LPG whose nodes combine the annotations from the schema
variant and data values from the input.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data Processing Pipeline with LSA</title>
      <p>
        A high-level overview of the data processing pipeline to ingest FHIR messages and SDoH surveys
is illustrated in Figure 1. FHIR ingestion uses a schema variant that enriches standard FHIR
schemas with tags for terminology database lookups and data privacy information. The first
overlay contains tags that specify lookups in a terminology database. These tags are used by the data
ingestion logic to add OMOP concept ids for codes given in diferent coding systems. The second
overlay assigns data privacy vocabulary [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] terms: it adds privacyClass: "dpv:Patient"
to Patient, and privacyClass: "dpv-pd:Identifying" to all personally identifying
information fields.
      </p>
      <p>
        The survey data used in our example are collected using Protocol for Responding to &amp;
Assessing Patients’ Assets, Risks &amp; Experiences (PRAPARE)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in spreadsheet form as shown in
Table 1. The ingestion schema describes the columns of this spreadhseet. An overlay is used
to specify a valueset to map question and answer values to OMOP concept ids (Table 2). This
valueset is diferent for each data source as each organization codes these questions and answers
diferently.
      </p>
      <p>
        The second stage converts the ingested data to the database graph model that organizes
conditions, measurements, observations, etc. as clusters of nodes linked to a person object.
This database graph model provides a convenient representation to perform searches, or to
perform further normalizations on data such a de-duplication, or imputations. Conversion of
ingested data to the database graph model is again done using schema variants. The overlays
for the “graph shaping” stage assign graph queries (using openCypher [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] language) to target
schema fields. Graph reshaping operation creates an instance of the target schema using the
nodes selected by the assigned graph queries. The ingested data is stored in a graph database
(Neo4j[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
      </p>
      <p>The PRAPARE survey data are ingested as Obvervations using an overlay for that
assigns OMOP concept ids for questions and answers to the corresponding concept_id and
value_as_concept_id fields. The overlay also adds the PRAPARE and Survey tags to the
Observation labels. This allows us to diferentiate between observations coming from
clinical data and observations coming from surveys. Such metadata is useful in determining the
situations under which data is captured, and in assigning “confidence levels” to data elements.</p>
      <p>OMOP uses a relational data model, thus it is necessary to translate data stored in the graph
database after the research population is identified. Once research population is identified, each
Person graph is exported from the database, shaped to fit to OMOP schemas using another
“graph shaping” stage, and exported in tabular format. A sample output for OMOP observations
table is given in Figure 2.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Our goal is to develop a reusable and scaleable interoperability framework to ingest and
semantically harmonize health-related data and metadata from disparate sources in a research data
warehouse setting. While our focus is producing OMOP output for researchers, the architecture
is flexible enough to accommodate other CDMs. Our eforts so far suggest that incorporating
new data sources using LSA is faster compared to ETL scripting, and yields reusable artifacts.</p>
      <p>The LSA approach ofers a method for the management of relevant metadata that can be
important in building high quality data sets. Such metadata may include information about
the context in which data was captured (e.g. whether data was entered by a provider or
selfreported), processed (e.g. whether data was imputed, or fuzzed for deidentification purposes),
or stored (e.g. the source filename for the data point).</p>
      <p>The use of labeled property graphs as the core data model ofers unique opportunities, such
as the ability to use multiple ontologies to tag data. Future work in this area will involve the
use of graph properties for data imputations, evaluating the efects of diferent graph models
for building study populations, and the incorporation of natural language processing tools to
tag textual data.</p>
      <p>
        LSA is an open-source project [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The toolset, schemas, overlays, and value sets we developed
during this project will be available on a GitHub repository.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>DARTNet Institute is the primary grantee of this project, and provided the sample datasets
and original ETL scripts. We thank DARTNet Institute for their partnership and their valuable
insight.</p>
      <p>This project is supported by the Ofice of the National Coordinator for Health Information
Technology (ONC) of the U.S. Department of Health and Human Services (HHS) under grant
number 90AX0034, Semantic Interoperability for Electronic Health Data Using the Layered
Schemas Architecture, total award $999,990 with 100% financed with federal dollars and 0%
ifnanced with non-governmental sources. This information or content and conclusions are
those of the author and should not be construed as the oficial position or policy of, nor should
any endorsements be inferred by ONC, HHS, of the U.S. Government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O. D. D.</given-names>
            <surname>Sciences</surname>
          </string-name>
          , Informatics,
          <source>Omop common data model</source>
          ,
          <year>2022</year>
          . URL: https : / / www.ohdsi.org
          <article-title>/data-standardization/the-common-data-model/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Layered</surname>
            <given-names>schemas</given-names>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://layeredschemas.org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Fast healthcare interoperability resources (fhir</article-title>
          ),
          <year>2022</year>
          . URL: https://hl7 .org/fhir/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Data</surname>
            <given-names>privacy vocabulary</given-names>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://w3c.github.io/dpv/dpv/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] Protocol for responding to &amp; assessing patients' assets, risks &amp; experiences (prapare</article-title>
          ),
          <year>2022</year>
          . URL: https://prapare.org/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] opencypher,
          <year>2022</year>
          . URL: https://opencypher.org/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <issue>Neo4j</issue>
          ,
          <year>2022</year>
          . URL: https://neo4j.com.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lsa</surname>
            <given-names>github</given-names>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://github.com/cloudprivacylabs/lsa.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>