A Demonstration of Layered Schema Architecture as a Semantic Harmonization Tool Burak Serdar1 1 Cloud Privacy Labs, Colorado, USA Abstract We will demonstrate Layered Schema Architecture (LSA) as a novel semantic harmonization tool to make health-related records research and AI-ready. LSA is designed to capture data, contextual metadata, and semantics to preserve the correct meaning of data through transformations in an ecosystem where multiple data standards and conventions exist. The specific examples in this demonstration will deal with the harmonization of clinical data and social needs screening survey data collected from disparate systems using multiple data standards and ontologies. Keywords Layered schemas, semantic interoperability, semantic harmonization, data warehousing, FHIR, OMOP, 1. Introduction The longitudinal health data from large diverse populations with varying social, economic, geographic, and environmental conditions is a highly valuable resource for medical and public health researchers through the creation of various data commons where disparate data are structured and harmonized to expand research options. Many challenges hinder the complete and efficient capture and exchange of health data, including: 1) a lack of semantic interoperability across systems; 2) the varying adoption of data standards within and between systems; 3) a lack of standardized metadata; and 4) the poor integration of electronic health records (EHR) data with data from other relevant sources such as social services, environmental measurements, patient-entered, and data collected from wearable devices. A challenge to semantic harmonization is the ingestion of data from increasingly diverse sources where vendor specific variations and non-standard representations are common. Some examples include data from wearable devices, different social needs screening tools, and public datasets. Another challenge is the difficulty in interpreting data based on context. Semantic harmonization has to take into account the context in which data are captured as well as the context in which data will be used after transformation. A common data model (CDM) such as Observational Medical Outcomes Partnership (OMOP [1]) helps to integrate data sets coming from multiple sources and transform them for use by researchers; however, most source data are developed for the fit and purpose of specific organizations. The Eighth Joint Ontology Workshops (JOWO’22), August 15-19, 2022, Jönköping University, Sweden " bserdar@cloudprivacylabs.com (B. Serdar) ~ https://cloudprivacylabs.com/ (B. Serdar) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) A common approach for the semantic harmonization of such data involves developing extract- transform-load (ETL) scripts for each data source. This approach hardwires source-specific variations in data (such as measurement units, coding systems, organizational conventions or extensions to standards) while losing valuable contextual information (e.g. measurement variations for laboratory results, patient-reported vs. physician-entered data). This approach also lacks a standardized and reproducible way to capture and distribute relevant metadata. The “Layered Schema Architecture” (LSA) [2] is an open-source technology developed by Cloud Privacy Labs to enable semantic interoperability in a data ecosystem where multiple overlapping data standards exist. In a partnership with the DARTNet Institute, we are exploring the use of LSA to semantically harmonize and translate health and health related data captured from disparate sources in different formats into OMOP for research and AI purposes. The specific examples of this project use clinical and social determinants of health (SDoH) related data, but the framework is domain and ontology agnostic and can be applied to various use-cases. 2. Layered Schemas and Labeled Property Graphs A schema is a machine-readable document that describes the structure of data. JSON and XML schemas are widely used to generate executable code from specifications and to check structural validity of data. LSA extends schemas (such as FHIR[3] or OMOP schemas) with layers (overlays) to add semantic annotations. The semantic annotations add ontology mappings, contextual metadata, tags, and processing instructions that control data ingestion and normalization. A schema variant is composed of a schema and a set of overlays, and contains the combination of annotations given in the schema and the overlays. A schema variant is represented using a labeled property graph (LPG) that has a node for each data field. An LPG is a directed graph where every node and edge contain a set of labels describing its type or class, and a set of properties that represent named values. An LPG allows assigning tags that represent different types of metadata to fields. A field may be a simple value, a structured object (e.g a JSON object, array, polymorphic object), or a reference to another schema. Different schema variants can be used to ingest data that shows variations based on data source. Data variations can be structural (e.g. additional data fields, extensions) or semantic (e.g. measurements in different units, different ontologies or coding systems), and can be due to different vendor implementations, local conventions, or regulations. Ingesting structured data using a schema variant creates an LPG whose nodes combine the annotations from the schema variant and data values from the input. 3. Data Processing Pipeline with LSA A high-level overview of the data processing pipeline to ingest FHIR messages and SDoH surveys is illustrated in Figure 1. FHIR ingestion uses a schema variant that enriches standard FHIR schemas with tags for terminology database lookups and data privacy information. The first over- lay contains tags that specify lookups in a terminology database. These tags are used by the data ingestion logic to add OMOP concept ids for codes given in different coding systems. The second Figure 1: Pipeline to ingest FHIR and SDoH survey data into a common graph model Table 1 SDoH questionnaire data (PRAPARE) sample patient id question answer 309613 Social Integration More than 5 times a week 309613 Material Security - Food N 415637 Material Security - Food Y overlay assigns data privacy vocabulary [4] terms: it adds privacyClass: "dpv:Patient" to Patient, and privacyClass: "dpv-pd:Identifying" to all personally identifying information fields. The survey data used in our example are collected using Protocol for Responding to & Assessing Patients’ Assets, Risks & Experiences (PRAPARE)[5] in spreadsheet form as shown in Table 1. The ingestion schema describes the columns of this spreadhseet. An overlay is used to specify a valueset to map question and answer values to OMOP concept ids (Table 2). This valueset is different for each data source as each organization codes these questions and answers differently. The second stage converts the ingested data to the database graph model that organizes conditions, measurements, observations, etc. as clusters of nodes linked to a person object. This database graph model provides a convenient representation to perform searches, or to perform further normalizations on data such a de-duplication, or imputations. Conversion of ingested data to the database graph model is again done using schema variants. The overlays for the “graph shaping” stage assign graph queries (using openCypher [6] language) to target schema fields. Graph reshaping operation creates an instance of the target schema using the Table 2 SDoH questionnaire OMOP concept id mappings valueset question code question concept id answer answer concept id Housing Status 37020172 Y 45877994 Housing Status 37020172 N 45878245 Material Security - Food 37020774 Y 37079482 Material Security - Clothing 37020774 Y 37079033 Figure 2: OMOP observations output nodes selected by the assigned graph queries. The ingested data is stored in a graph database (Neo4j[7]). The PRAPARE survey data are ingested as Obvervations using an overlay for that as- signs OMOP concept ids for questions and answers to the corresponding concept_id and value_as_concept_id fields. The overlay also adds the PRAPARE and Survey tags to the Observation labels. This allows us to differentiate between observations coming from clini- cal data and observations coming from surveys. Such metadata is useful in determining the situations under which data is captured, and in assigning “confidence levels” to data elements. OMOP uses a relational data model, thus it is necessary to translate data stored in the graph database after the research population is identified. Once research population is identified, each Person graph is exported from the database, shaped to fit to OMOP schemas using another “graph shaping” stage, and exported in tabular format. A sample output for OMOP observations table is given in Figure 2. 4. Discussion Our goal is to develop a reusable and scaleable interoperability framework to ingest and seman- tically harmonize health-related data and metadata from disparate sources in a research data warehouse setting. While our focus is producing OMOP output for researchers, the architecture is flexible enough to accommodate other CDMs. Our efforts so far suggest that incorporating new data sources using LSA is faster compared to ETL scripting, and yields reusable artifacts. The LSA approach offers a method for the management of relevant metadata that can be important in building high quality data sets. Such metadata may include information about the context in which data was captured (e.g. whether data was entered by a provider or self- reported), processed (e.g. whether data was imputed, or fuzzed for deidentification purposes), or stored (e.g. the source filename for the data point). The use of labeled property graphs as the core data model offers unique opportunities, such as the ability to use multiple ontologies to tag data. Future work in this area will involve the use of graph properties for data imputations, evaluating the effects of different graph models for building study populations, and the incorporation of natural language processing tools to tag textual data. LSA is an open-source project [8]. The toolset, schemas, overlays, and value sets we developed during this project will be available on a GitHub repository. Acknowledgments DARTNet Institute is the primary grantee of this project, and provided the sample datasets and original ETL scripts. We thank DARTNet Institute for their partnership and their valuable insight. This project is supported by the Office of the National Coordinator for Health Information Technology (ONC) of the U.S. Department of Health and Human Services (HHS) under grant number 90AX0034, Semantic Interoperability for Electronic Health Data Using the Layered Schemas Architecture, total award $999,990 with 100% financed with federal dollars and 0% financed with non-governmental sources. This information or content and conclusions are those of the author and should not be construed as the official position or policy of, nor should any endorsements be inferred by ONC, HHS, of the U.S. Government. References [1] O. D. D. Sciences, Informatics, Omop common data model, 2022. URL: https : / / www.ohdsi.org/data-standardization/the-common-data-model/. [2] Layered schemas, 2022. URL: https://layeredschemas.org. [3] Fast healthcare interoperability resources (fhir), 2022. URL: https://hl7.org/fhir/. [4] Data privacy vocabulary, 2022. URL: https://w3c.github.io/dpv/dpv/. [5] Protocol for responding to & assessing patients’ assets, risks & experiences (prapare), 2022. URL: https://prapare.org/. [6] opencypher, 2022. URL: https://opencypher.org/. [7] Neo4j, 2022. URL: https://neo4j.com. [8] Lsa github, 2022. URL: https://github.com/cloudprivacylabs/lsa.