<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Collaborative, Realism-Based, Electronic Healthcare Graph: Public Data, Common Data Models, and Practical Instantiation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mark A. Miller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian J. Stoeckert Jr.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Genetics, Institute for Biomedical Informatics Perelman School of Medicine, University of Pennsylvania Philadelphia</institution>
          ,
          <addr-line>PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Biomedical Informatics Perelman School of Medicine, University of Pennsylvania Philadelphia</institution>
          ,
          <addr-line>PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>RDF triples</institution>
          ,
          <addr-line>realism, EHR, OMOP</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>There is ample literature on the semantic modeling of biomedical data in general, but less has been published on realism-based, semantic instantiation of electronic health records (EHR). Reasons include difficult design choices and issues of data governance. A collaborative approach can address design and technology utilization issues, but is especially constrained by limited access to the data at hand: protected health information. Effective collaboration can be facilitated with public, EHR-like data sets, which would ideally include a large variety of datatypes mirroring actual EHRs and enough records to drive a performance assessment. An investment into reading public EHRlike data from a popular common data model (CDM) is preferable over reading each public data set's native format. In addition to identifying suitable public EHR-like data sets and CDMs, this paper addresses instantiation via relational-to-RDF mapping. The completed instantiation is available for download, and a competency question demonstrates fidelity across all discussed formats.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>This paper describes the reification and semantic instantiation of
selected columns from public data sets that are largely
representative of a prototypical electronic health record system.</p>
      <p>Because the original data come from public sources, the output (a
downloadable RDF data set) is available for scrutiny by anyone
who has an interest in disciplines like healthcare informatics,
linked data, or ontological realism (1).</p>
      <p>There is evidence that, while even experienced users of
sophisticated upper ontologies like the Basic Formal Ontology
(BFO) (2) may have some difficulty in tasks such as properly
classifying individuals when working in isolation (3), those same
ontologists are likely to reach a consensus as a group after
reviewing one another’s positions. Participation in the weekly
web meeting (4) held by the Ontology for Biomedical
Investigations (OBI) community confirms the value of this kind
of collaborative approach. Even semi-anonymous resources like
Stack Overflow can be a source of useful collaboration, given the
submission of a well written question, including sample data.</p>
      <p>Fundamentally, the collaborative approach acknowledges that no
one individual or group is likely to be an authority on the content
and structure of an EHR, semantic web technologies in general,
and specifically, the Web Ontology Language (OWL), the BFO,
and mid-level ontologies from the Open Biomedical and
Biological Ontologies Foundry (OBO).</p>
      <p>There is no reason to believe that these difficulties or the need for
collaboration are unique to a semantic approach. Large-scale
initiatives to harmonize, integrate, or transfer health care data
with relational technologies, including CDMs to be discussed
later, have benefited from precisely the large, diverse input that is
recommended for similar semantic initiatives.</p>
      <p>Despite a 50-year history, resulting in a plethora of commercial
product and support offerings, the relational database world still
struggles to cope with complex, heterogeneous data. Therefore,
we find this to be an ideal time for innovative application of
semantic web approaches to health care data. Specifically, we
advocate working collaboratively with synthetic healthcare data
stored as RDF triples, and using terms from OWL ontologies that
follow the ontological realism method. This can be thought of a
chain of progressive and cumulative commitments:</p>
      <p>The use of any graph format will support assertions
about (and visualization of) chained or branching
relations like temporal precedence, or the inputs and
output of processes, without requiring self-joins that are
characteristic of relational database solutions.</p>
      <p>Use of the W3C’s RDF standard supports a linked data
approach, where statements about patients can use terms
that are defined in some external, public data set. We
believe that the flexibility of property graphs is valuable
in a standalone data integration effort, but that the
subject-predicate-object structure imposed by RDF is
more supportive of broader data interoperability, sharing
and linking. Likewise, the use of the SPARQL query
language and software libraries like RDF4J serves a
protection against vendor lock-in.(5)
We limit ourselves to using class and property/predicate
terms from an ontology, which could theoretically use
the RDFS, SKOS or OWL schemas. Among other
things, this is an initial step in making the database
selfdocumenting, or free from dependency on an external
data dictionary. RDFS and OWL both define subclass
and subproperty relations that can be used for reasoning,
and OWL provides support for richer axioms (which
may or may not be supported by default reasoning levels
in RDF triplestores).</p>
      <p>Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>At the highest level of rigor, we limit our upper
ontologies to those that adhere to the principles of the</p>
      <p>OBO foundry and therefore ontological realism
The term “ontological realism” is used here to specifically mean
using the BFO as an upper ontology, and more generally
following the methodology advocated by Smith and Ceusters in
2010 (1) as best as possible. At a minimum, this means
instantiating universal and generalizable classes of things, and
resisting the temptation to structure knowledge as topical
“concepts”. This is a top-down approach that emphasizes
computability and consistency between ontology artifacts.</p>
      <p>While similar, the terms reification and semantic instantiation are
used here to describe two different processes, both of which can
be performed in an automated fashion, after some initial
configuration by one or more people with domain knowledge and
ontological training:
1. Seeing values like “M”, “3/11/1969” and “123456” in
one row from a data table; inspecting contextual
information like the table and column names and a data
dictionary; then coming to the following conclusions:
a. “M” itself means ‘male gender identity datum’</p>
      <p>While “3/11/1969” itself doesn’t mean
anything, it is associated with a datum about
someone’s birth. Likewise, “123456” is
associated with some thing that can denote the
person.
2.</p>
      <p>Writing this knowledge, in the form of RDF triples, into
a semantic triplestore database. In part, an
approximation of this would look like
a. :X a ‘male gender identity datum’ .
b. :Y a ‘Homo sapiens’ .</p>
      <p>c. :X ‘is about’ :Y .</p>
      <p>At this point in time, there is limited precedence for realism-based
semantic instantiations of EHRs. This paper primarily builds upon
ideas developed in the PennTURBO project (6,7). Beyond that,
one especially relevant paper describes realism-based
instantiation of electronic dental records (8), although the patient
data has not been made public. Bona, Nolan and Brochhausen
have generated realism based RDF triples from the
non-imagerelated clinical data present in the Cancer Imaging Archive (9).</p>
      <p>
        Elkin and colleagues have applied natural language approaches to
electronic health records, resulting in property- and RDF graphs
that use terms from vocabularies such as SNOMED (
        <xref ref-type="bibr" rid="ref6 ref7">10,11</xref>
        ).
      </p>
      <p>
        Ceusters and colleagues have demonstrated the applicability of
their referent tracking approach to electronic health records (
        <xref ref-type="bibr" rid="ref8">12</xref>
        ),
and they have built a system that inserts referent tracking
statements into an RDF triplestore (
        <xref ref-type="bibr" rid="ref9">13</xref>
        ). Research at the
University of Murcia in Spain has resulted in several papers (
        <xref ref-type="bibr" rid="ref10">14</xref>
        )
describing a Semantic Web Integration Tool that can consume
data from XML files and relational databases and then apply
reasoning via the OWL API (
        <xref ref-type="bibr" rid="ref11">15</xref>
        ). At a minimum, they have
applied an archetype-based colorectal cancer classifier to a 500
patient subset from a 20,000 patient database (
        <xref ref-type="bibr" rid="ref12">16</xref>
        ). While
intriguing, it appears that their work doesn’t share many of the
objectives of this report: their terms were drawn broadly from the
NCBO BioPortal, without emphasizing realism; the inputs into
their classifier were instances of ‘histopathology report’, not
instances of ‘patient’ or ‘Homo sapiens’; there was little
discussion of loading statements into an RDF triplestore.
      </p>
      <p>In addition to qualitatively describing the experience of working
with various public, relational, EHR-like data sets, this paper uses
a competency question (CQ1) to ensure that the same result is
obtained after any data transformation. The question is: how
many white male patients, born between 1960 and 1980, have an
average systolic blood pressure between 110 and 130?</p>
      <sec id="sec-1-1">
        <title>DE-SynPUF data set</title>
        <p>
          The United States Centers for Medicare &amp; Medicaid Services
(CMS) provides a data set entitled “Data Entrepreneurs’ Synthetic
Public Use File (DE-SynPUF)” (
          <xref ref-type="bibr" rid="ref13">17</xref>
          ). Background information
provided by the CMS includes the following:
“The DE-SynPUF was created with the goal of providing a
realistic set of claims data in the public domain while providing
the very highest degree of protection to the Medicare
beneficiaries’ protected health information.”
The purposes of the DE-SynPUF are to:
1. allow data entrepreneurs to develop and create software
and applications that may eventually be applied to
actual CMS claims data;
2. train researchers on the use and complexity of
conducting analyses with CMS claims data prior to
initiating the process to obtain access to actual CMS
data; and,
3. support safe data mining innovations that may reveal
unanticipated knowledge gains while preserving
beneficiary privacy.
        </p>
        <p>DE-SynPUF consists of five types of data, for the years 2008,
2009 and 2010:</p>
        <sec id="sec-1-1-1">
          <title>Beneficiary Summary 1. 3. 4.</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>2. Inpatient Claims</title>
        </sec>
        <sec id="sec-1-1-3">
          <title>Outpatient Claims</title>
        </sec>
        <sec id="sec-1-1-4">
          <title>Carrier Claims</title>
        </sec>
        <sec id="sec-1-1-5">
          <title>5. Prescription Drug Events</title>
          <p>The DE-SynPUF page provides links to documentation, such as
the data dictionary. It is noted that the synthetic data generation
process may impose some limits on the usefulness of DE-SynPUF
for inferential research.</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>MIMIC-III data set</title>
        <p>
          The Medical Information Mart for Intensive Care III (MIMIC-III)
data set (
          <xref ref-type="bibr" rid="ref14">18</xref>
          ) is described as “a large, freely-available database
comprising de-identified health-related data associated with over
forty thousand patients who stayed in critical care units of the
Beth Israel Deaconess Medical Center between 2001 and 2012.”
Among other information, MIMIC-III contains
patient demographics
laboratory test results
procedures
medications
diagnosis codes
vital sign measurements (~1 data point per hour)
While the MIMIC-III website describes the data set as
“freelyavailable”, access is in fact limited and requires the completion of
an application and proving completion of human subjects
research training. This is not surprising, given that MIMIC-III
consists of de-identified data (
          <xref ref-type="bibr" rid="ref15">19</xref>
          ) from actual patients of the Beth
Israel Deaconess Medical Center. Most significantly, MIMIC-III
licenses are issued on a per-individual basics, and derived data
must be shared via the limited-access PhysioNet website.
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Synthea data set</title>
        <p>
          From the Synthea website:
“SyntheaTM is an open-source, synthetic patient generator that
models the medical history of synthetic patients. Our mission is
to provide high-quality, synthetic, realistic but not real, patient
data and associated health records…. The resulting data is free
from cost, privacy, and security restrictions, enabling research
with Health IT data that is otherwise legally or practically
unavailable.” (
          <xref ref-type="bibr" rid="ref16">20</xref>
          )
A 1000 patient, pre-built Synthea data set is available for
download in multiple formats, including CSV, from the Synthea
homepage.
        </p>
        <p>
          In addition to the pre-built files, users can build their own Synthea
data sets by downloading Apache-licensed code from the Synthea
GitHub repository (
          <xref ref-type="bibr" rid="ref17">21</xref>
          ).
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>Loading public EHR-like data sets into a Relational</title>
      </sec>
      <sec id="sec-2-2">
        <title>Database</title>
        <p>
          DE-SynPUF
The DE-SynPUF data set is split into 20 collections of
independent CSV files. All five types of data from all three years
were downloaded for this paper, but only from sample 1 of 20,
resulting in a data set with slightly more than 100,000 unique
patient identifiers. PostgreSQL schemas for the beneficiary,
inpatient claim and drug event CSV files were constructed with
the csvsql application from the csvkit (
          <xref ref-type="bibr" rid="ref18">22</xref>
          ) package, and then the
CSV files were imported into a PostgreSQL database.
The three previously mentioned EHR-like data sets each have
distinct structures, in terms of how different kinds of data are
segmented into separate CSV files, how the columns are named,
etc. For all three to be directly instantiated into a triplestore, three
different sets of SPARQL statements would need to be developed.
Instead, we converted all three data sets into a single CDM. This
commitment to a CDM allows the use of a single set of SPARQL
statements, with minor modifications, to implement all three
instantiations.
        </p>
        <p>
          The Observational Medical Outcomes Partnership schema
(OMOP) was chosen based on literature evaluations (
          <xref ref-type="bibr" rid="ref20">24</xref>
          ) and
utility within the PennMedicine organization. Since the CDM
functions as a staging area between tabular data sets and the
semantic graph, limitations of OMOP from the perspective of
ontological realism (
          <xref ref-type="bibr" rid="ref21">25</xref>
          ) and concerns about accurate counting
(
          <xref ref-type="bibr" rid="ref22">26</xref>
          ) were tolerated.
        </p>
        <p>OMOP conversion tools exist for all three of the public data sets
under consideration and were used to load or migrate the data sets
into new PostgreSQL schemas in the OMOP format. At the time
this paper was written, OMOP schema version 6.0 was under
development, but the conversion tools all used a recent 5.x
schema.</p>
        <p>Full utilization of OMOP includes obtaining vocabularies from
their Athena system in order to understand the meaning of its
concept codes, like “8507” for “MALE”. While a semantic
instantiation will still need to map “8507” to a term like ‘male
gender identity datum’ (OMRSE_00000141), the use of a CDM
eliminates the need to map three different codes from three
different data sets.</p>
        <p>One contributor to database population difficulties described
below could be minor mismatches between the schema created by
the Extract/Transform/Load (ETL) scripts and the most recent
OMOP vocabulary downloads.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Instantiation via Relational-to-RDF Mapping</title>
        <p>Realism-based statements about the Synthea data, from the
previously described OMOP schema, were loaded into a Ontotext
GraphDB triplestore with a relational-to-RDF (R2R) mapping
approach. Because all three EHR-like datasets evaluated in this
paper had been loaded into OMOP schemas, three different,
source-specific variable recodings were not required.
The PennTURBO team uses Ontotext GraphDB as its primary
triplestore because it has multiple text indexing solutions, a
sophisticated web-based SPARQL development environment,
and a visual data exploration tool. Unfortunately, its Ontorefine
tool can only instantiate data from files, not database connections.
Because we couldn’t find any single R2R tool that met all of our
needs, the ability to instantiate triples from the contents of an
OMOP database was added to PennTURBO's existing
“Carnival/Drivetrain” data integration and harmonization
software suite. Carnival and Drivetrain have not been completely
released into the public domain yet.</p>
        <p>In order to demonstrate the general applicability of our approach,
we have also performed an instantiation with Stardog’s Virtual
Graph feature. Following the example of Carnival/Drivetrain, the
Synthea data in the OMOP PostgreSQL database were
instantiated as shallow “data models with shortcuts”, using terms
from the TURBO Ontology whose scope is limited to “data
space”, like the class ‘person data model’ (TURBO_0010161)
and the predicate ‘shortcut person data model to DOB (textual)’
(TURBO_0010085). Expansion of the shallow triples into
statements using OBO foundry terms, including the recoding of
categorical variables, was performed by writing federated</p>
        <p>Insertion of statements about the precedence of a given person’s
healthcare encounters did not require the typical comparison of
encounter dates, as OMOP populates a
“preceding_visit_occurrence_id” column in the
“visit_occurrence” table as part of the Synthea ETL.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <sec id="sec-3-1">
        <title>DE-SynPUF</title>
        <p>DE-SynPUF’s “Beneficiary Summary” and “Prescription Drug
Events” files were found to have some useful overlap with the
patient demographics and medication-order tables in the
University of Pennsylvania’s clinical data warehouse.
“Beneficiary Summary” also contains some less relevant (or at
least off-topic) summary phenotype and claims/ utilization data.
“Inpatient Claims”, “Outpatient Claims”, and “Carrier Claims”
all contain provider identifiers, dates, diagnosis codes, and
procedure codes. All of the claims files, especially “Carrier
Claims” contain numerous columns for the financial aspects of
health insurance claims, which were not examined for this report.
Each of the claims tables use multiple columns per table for
diagnosis and procedure codes, since the tables were normalized
to one row per claim. This is not especially appealing for input
into a semantic instantiation, in which diagnoses and codes are
first class citizens, just like claims. “Prescription Drug Events”
refers to drugs by NDC codes, which are less desirable than
RxNorm codes, due to their higher granularity, or number of
codes per product/route/dose. DE-SynPUF provides values about
the patients’ dates of birth, genders and races, but no clinical
findings or measurements like height, weight or blood pressure.
See Table 1.
Compared to the DE-SynPUF data set, MIMIC-III uses similar
vocabularies and contains essentially all of the same clinical
datatypes, with the addition of clinical observations and
measures, and free-text clinical notes. On the other hand,
MIMICIII includes a much smaller number of patients: 1,152. Finally,
while the MIMIC-III data access policy isn’t unreasonable for an
individual investigator, it does pose a limitation for a
multiinvestigator, collaborative effort.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Synthea</title>
        <p>Data generated with Synthea contain all of the clinical datatypes
present in the DE-SynPUF and MIMIC-III data sets, except for
the free text notes that are available in MIMIC-III alone. Synthea
primarily refers to drugs with RxNorm codes, which is more
directly compatible with existing PennTURBO work than the
NDCs used in DE-SynPUF and MIMIC-III, yet it refers to
disorders with SNOMED codes, which is a minor incompatibility
with PennTURBO.
The association of SNOMED codes with disorders is also a
conflation of concepts, which can be remedied with a realism
approach. According to the Ontology for General Medical
Science, a disorder is a ‘A material entity which is clinically
abnormal and part of an extended organism. Disorders are the
physical basis of disease.’ In contrast, a SNOMED or ICD code
is an information entity, not a material entity, although it may
have an aboutness relationship with the patient, or some material
anatomical entity.</p>
        <p>Since Synthea offers scripted generation, records can be created
for any number of patients. The previously discussed 1000-patient
Synthea dataset “from Philadelphia” was selected over
DESynPUF and MIMIC-III for the rest of this report
CQ1.S, for the Synthea data in their native format:
select count(*) from (
join native.observations o on
select
from
where
group by
having
p.id
distinct p.id
native.patients p
o.patient = p.id
race = 'white'
and gender = 'M'
and birthdate
between '1960-01-01'
and '1980-01-01'
and o.code = '8480-6'
avg(cast(o.value as decimal(4, 1)))</p>
        <p>between 110 and 130) as included</p>
      </sec>
      <sec id="sec-3-3">
        <title>OMOP CDM</title>
        <p>
          The DE-SynPUF data didn’t require an ETL per se, as it can be
downloaded as CSV files (
          <xref ref-type="bibr" rid="ref23">27</xref>
          ) that are ready to be directly
imported into a relational database using OMOP schema. The
Observational Health Data Sciences and Informatics
collaborative (OHDSI), which created the OMOP schema, also
provides scripts (
          <xref ref-type="bibr" rid="ref24">28</xref>
          ) for doing a complete load of DE-SynPUF,
as downloaded in its native CMS format, into the OMOP schema.
When running the MIMIC-III OMOP ETL (
          <xref ref-type="bibr" rid="ref25">29</xref>
          ), it appeared that
a NOT NULL constraint was violated by at least one value in the
vocabulary_reference column from the vocabulary table, which
contains metadata about the vocabularies. Therefore, that NOT
NULL constrain was removed.
        </p>
        <p>OHDSI hosts a GitHub repository containing code for loading
Synthea data from CSV files into a PostgreSQL database that uses
the OMOP schema. A useful side effect of this process is loading
the same Synthea data, in its own native format, into another
(
select
from
where
group by
having
PostgreSQL schema. Multiple solutions are provided for this task,
in support of multiple operating systems. The Synthea OMOP
ETL code is evolving over time, and minor bugs were observed
in the wrapper scripts each time the GitHub software repository
was fetched. In any case, the repository consistently contained all
of the SQL commands necessary to build the tables, load the data
and build reasonable indices.</p>
        <p>It appeared that the ETL correctly migrated most of the Synthea
observations table into the OMOP measurement table, but not the
units or values columns. (see
https://github.com/OHDSI/ETLSynthea/issues/19) Sufficient keys were shared between the two
tables for the “unit_source_value” and “value_source_value”
columns in the measurement table to be repopulated after the fact.
Population of the “unit_concept_id” column was largely
automated by looking up the freshly-loaded unit source values in
a map table created as part of the ETL process. Finally, OMOP’s
“value_as_number” column was copied from
“value_source_value” for each row in which a unit concept had
successfully been mapped.</p>
        <p>
          Migrations were generally performed as single-threaded
operations on a 64 GB Amazon Web Services server, running
PostgreSQL 11 and Ubuntu 18. The MIMIC-III and Synthea
ETLs each took over one hour. The RAM allocation was
decreased to 16 GB after the ETLs and indexing were completed.
CQ1.O, for the Synthea data in an OMOP schema:
select count(*) from
p.person_id,
decimal(4, 1)))
avg(cast(m.value_source_value as
join cdm_synthea10.measurement m on
cdm_synthea10.person p
p.person_id = m.person_id
race_concept_id = 8527
and gender_concept_id = 8507
and birth_datetime between '1960-01-01'
and '1980-01-01'
and m.measurement_source_value = '8480-6'
p.person_id
avg(cast(m.value_source_value as
decimal(4, 1))) between 110
and 130) as included
After using Carnival to read from the 1000-patient Synthea
OMOP schema into a property graph, Drivetrain can perform
realism-based instantiation, RDFS+ reasoning, and import of
additional ontologies and linked data sets in roughly ten minutes.
Instantiating a topical subset of the data, like patient
demographics alone, can be completed with the Stardog Virtual
Graph + GraphDB federation approach in roughly the same
amount of time, but the subsequent steps have not been automated
outside of Drivetrain and are therefore more time consuming.
The PennTURBO ontology (
          <xref ref-type="bibr" rid="ref26">30</xref>
          ) is loaded into its own named
graph, as are the Monarch Disease Ontology, the Drug Ontology
and the Chemicals of Biological Interest ontology. Several RDF
linked data sets are also imported, in order to link clinical codes
to labels and other relationships (while remaining wary of their
concept orientation): RxNorm, Vaccines Administered (CVX)
(
          <xref ref-type="bibr" rid="ref27">31</xref>
          ) and SNOMED. The RxNorm file is conveniently available
for download from the NCBO BioPortal. Generating the
SNOMED and CVX files requires a conversion from the UMLS
.nlm format to the .RRF format with MetaMorphoSys, loading
that into a MySQL database, and then writing that to RDF with
scripts from NCBO. (
          <xref ref-type="bibr" rid="ref28 ref29">32,33</xref>
          )
CQ1.R, for the Synthea data in realism-based graph:
PREFIX : &lt;http://transformunify.org/ontologies/&gt;
        </p>
        <sec id="sec-3-3-1">
          <title>PREFIX efo: &lt;http://www.ebi.ac.uk/efo/&gt;</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>PREFIX obo: &lt;http://purl.obolibrary.org/obo/&gt;</title>
          <p>PREFIX pmbb: &lt;http://www.itmat.upenn.edu/biobank/&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
select (count(distinct ?patient) as ?count)
where
{
{
select ?patient (avg(xsd:float(?sbpv)) as ?avgsbpv)
where {
graph pmbb:expanded {
?mgidInst a obo:OMRSE_00000141 ;</p>
          <p>obo:IAO_0000136 ?patient .
?wridInst a obo:OMRSE_00000184 ;</p>
          <p>obo:IAO_0000136 ?patient .
?dob a efo:EFO_0004950 ;
obo:IAO_0000136 ?sns ;
obo:IAO_0000004 ?dobValue .
?patient a obo:NCBITaxon_9606 ;
:TURBO_0000303 ?sns ;
obo:RO_0000056 ?encounter ;
obo:RO_0000087 ?patientRole .
?patientRole a obo:OBI_0000093 ;</p>
          <p>obo:BFO_0000054 ?encounter .
?encounter a obo:OGMS_0000097 .
?bpassay a obo:VSO_0000006 ;
obo:BFO_0000050 ?encounter ;
obo:OBI_0000299 ?sbpdatum1 .
?sbpdatum1 a obo:HTN_00000001 ;
obo:OBI_0001938 ?svs ;
obo:IAO_0000221 ?bpq .
?bpq a obo:VSO_0000004 ;</p>
          <p>obo:RO_0000052 ?patient .
?svs a :TURBO_0010149 ;
obo:IAO_0000039 obo:UO_0000272 ;
obo:OBI_0002135 ?sbpv .
filter(?dobValue &gt; "1960-01-01"^^xsd:date &amp;&amp;</p>
          <p>?dobValue &lt; "1980-01-01"^^xsd:date)
}</p>
          <p>}
group by ?patient
}</p>
          <p>}</p>
          <p>filter(?avgsbpv &gt; 110 &amp;&amp; ?avgsbpv &lt; 130)
At least a partial understanding of the resulting RDF triples can
be inferred from the previous SPARQL query. Additionally,
Figure 1 provides a visualization of some of the data items and
aboutness relationships, and Figure 2 illustrates denotation and
mentioning patterns. All of the RDF triples generated for this
paper are available as a compressed n-quads RDF file, which is
further described in the discussion section.</p>
          <p>A handful of design patterns are proposed in this instantiation and
have already been the subject of some collaborative evaluation.
We invite further feedback from those who have read this paper
and/or loaded a dump of our work into their own triplestore.
Highlighted Patterns:</p>
          <p>What should we aspire towards in terms of succinct,
consistent, and semantically clear aboutness patterns?
We are currently asserting that racial and gender identity
datums are about the patient, but date of birth is about
the ‘start of neonate stage’ that the patient participates
in. Some clinical measurements, like blood pressure, are
asserted to be about a quality inhering in the patient, and
supported with the instantiation of a specific assay class
and a value specification with units. What then would a
‘body mass index’ datum be about?
Perhaps we are being overconfident in translating values
of “M” from the person.gender_source_value column as
instances of class ‘male gender identity datum’,
OMRSE_00000141. The ontology of medically relevant
social entities defines gender identity datums as being
the output of gender identification processes. If the "M"
value is based on genotype data or a health care
professional’s examination of external genitalia, is the
resulting datum really about gender identity? To support
this inquiry, male and female biological sex datum
classes have been added to the TURBO ontology, along
with defined classes for the union of male gender
identity datums and male biological sex datums (along
with the analogous case for females).
PennTURBO’s Drivetrain application has the capability
of making data-driven inferences, even in the case of
missing or contradictory data. The conclusion drawing
process and its evidence are instantiated explicitly in the
graph, and collaborators can specify what rules they
want to apply. For example, the presence of three male
gender identity (or biological sex?) datums and one
female datum might lead to the inference that a female
biological sex quality inheres in the patient. Can
inferences about the population in which a patient is a
member be drawn from racial identity datums? Is this an
ontological question or a societal question?
How can dates of birth be expressed succinctly and in
adherence to the ontological realism methodology?
Synthea/OMOP “birth_datetime” values have been
modeled as instances of ‘date of birth’ (EFO_0004950),
which is placed into the TURBO ontology as a subclass
of ‘time measurement datum’ (IAO_0000416). The
‘date of birth’ instances take xsd:date literals and are
about ‘start of neonate stage’ (UBERON_0035946)
process boundary instances. A ‘born on’ property
(TURBO_0000303) is used to link the patient to the
process boundary, and has the following characteristics:
domain 'Homo sapiens'; range 'start of neonate stage';
definition “'participates in' o inverse (starts)’”.
Supplemental class definitions and property chains are
being added to the TURBO ontology and the PCORowl
ontology
How should we use information content entities for
denoting, given that identifying values from a database
might be rigorously maintained by some authority or
could just be auto-generated primary keys? We have
illustrated these cases with a ‘centrally registered
identifier’ (CRID) for denoting the patient, and a
‘database primary key’ from the TURBO ontology for
the encounter. (This mimics the actual situation we face
with our EHR.). Also, the Information Artifact Ontology
(IAO) defines a CRID as having some part that denotes
some 'centrally registered identifier registry’, but IAO
does not include any class that is defined as denoting
registries. Based on the fact that the domain of ‘denotes’
is ‘information content entity’, we have added ‘identifier
source’, ‘identifier source denoter’ and ‘registry denoter’
classes, as subclasses of the slightly more specific ‘data
item’ class, into the TURBO ontology. Additionally,
what datatype predicate should bind the lexical
representation of an identifier to the symbol part of a
CRID or primary key? Because IAO does not appear to
have a suitable predicate, a ‘has representation’
predicate has been added to the TURBO ontology and
will shortly be added to OBI.</p>
          <p>How can we provide context for clinical codes without
violating ontological realism principles or suggesting
that the graph contains knowledge that was recklessly
interpreted from the codes? Medication codes and
“condition” codes are present in the Synthea/OMOP
data set. We say that these codes, manifest as URIs for
classes, are mentioned by diagnoses and prescriptions,
which are in turn the outputs of ‘health care
encounters’. Since the object of these (instance-level)
‘mentions’ statements are defined in their source ontologies
as classes, this is a case of OWL2 punning. When
combined with additional public ontologies and clinical
lined data sets (see Instantiation, above), it becomes
possible to indirectly answer real collaborator requests
like “count the patients with diabetes who were taking
statin drugs.” However, the graph doesn't truly know
what disease dispositions inhere in the patients, or
which drugs were actually ingested (etc.), only the
codes that were assigned or recorded.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>Besides this current work, we are not aware of any other
realismbased instantiation of a data set that is representative of an EHR
and also available for all to see. The previously mentioned
instantiation of an EDR comes closest, but is not available for
public review as it contains PHI.</p>
      <p>The data set that was ultimately instantiated in this paper
represents 1000 synthetic patients. While the PennTURBO team
has good experience instantiating tens of thousands of patients,
more work is required to determine how this method will scale to
hundreds of thousands or millions of patients. It’s possible that
enterprise versions of the triplestore applications may be required,
along with hardware and operating system optimization.
There are some differences between the data that are available in
Synthea, that can fit in an OMOP schema, and that we are already
routinely instantiating into PennTURBO (independent of the
work described in this paper.) Synthetic genomic data is not
available and has no table in the OMOP schema, but PennTURBO
does make statements about predicted loss of function calls when
available for Penn Medicine patients. As part of that, we
instantiate the specimens that were collected from patients as part
of health care encounter, and which went on to serve as the input
into a chain of sequencing and bioinformatics processes. There is
an OMOP specimens table, but synthesis of specimen data with
Synthea might require writing a plugin. Synthea and OMOP
support data about health care procedures, and we have future
plans to instantiate it, as well as the procedure data in our EHR.
The PennTURBO team routinely runs its instantiations through
RDFS+ reasoning, but neither PennTURBO nor this
Synthea/OMOP work has been run through higher levels of
reasoning, like OWL-Horst.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We believe that graph models of health care data will enable faster
question answering and cohort building, compared to what can be
done in existing relational EHRs or clinical data warehouses.
Because we wish to develop this approach collaboratively, in a
way that fosters interoperability and peer review, we have
constructed an RDF graph model of synthetic health care data,
synthea_graph_exportable.nq and have shared it at
http://doi.org/10.5281/zenodo.2641233
One of our requirements for this project was identifying a source
of sharable healthcare data whose contents are as similar as
possible to the clinical data warehouse that provides the majority
of our information. DE-SynPUF and MIMIC-III were considered,
but Synthea was chosen as having both the most relevant data and
a suitable redistribution policy. Specifically, Synthea allows the
generation of any number of observations and includes numerical
and qualitative clinical findings. MIMIC-III would be a good
choice in a setting where the value of free text clinical notes
outweighed the inconvenience of the more restrictive license.
The three data sources were staged in the OMOP common data
model in order to minimize the effort required to become familiar
with each source’s structure, and also because we anticipate using
the OMOP model for both ingesting data sources complementary
to our clinical warehouse, and as a format for sharing portions of
the clinical warehouse with other medical research institutions.
Scripts for transforming each of the three “sharable” data sources
into an OMOP model are available at OHDSI’s GitHub software
repository. While these scripts dramatically decrease the effort
required to perform the transformations, users should be prepared
to do a small amount of debugging.</p>
      <p>We have briefly demonstrated the ability to migrate the Synthea
data from an OMOP-formatted PostgreSQL relational database
with two methods: our internal “Carnival/Drivetrain” software
suite, and the Virtual Graph feature from the Stardog triplestore,
in federation with the GraphDB triplestore (which also serves as
the final destination.)
A competency question, representative of a cohort-building
query, was applied to Synthea data in its native format, the same
data in an OMOP schema, and corresponding RDF triples. The
same answer was obtained in all three cases.</p>
      <p>
        We encourage readers to download our Synthea triples from the
address above and load them into any RDF triplestore. The
TURBO ontology is included, but not the supporting RxNorm,
CVX and SNOMED clinical knowledgebases. An RDF
representation of RxNorm can be obtained from the NCBO
BioPortal, but obtaining RDF models of CVX and SNOMED
requires performing a multi-step conversion from the Unified
Medical Language System.(
        <xref ref-type="bibr" rid="ref28 ref29">32,33</xref>
        )
We are especially interested in hearing feedback about our ‘is
about’ relations, the way we ‘mention’ diagnosis and medication
codes via OWL2 punning, and the way we denote entities with
either centrally registered identifiers or database primary keys,
depending on our confidence that the identifier is truly centrally
registered. Several of these issues are already the subjects of
active GitHub issues such as
https://github.com/obiontology/obi/issues/985.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>We benefitted from valuable collaborations with Amanda Hicks,
our PennTURBO colleagues (David Birtwell, Hayden Freedman,
Heather Williams), ontologists from the Eukaryotic Pathogen
database (Jie Zheng and John Judkins), and numerous members
of the OBI development community.</p>
      <p>This work was done as part of the PennTURBO project, which is
supported by the Institute for Biomedical Informatics and by the
Institute for Translational Medicine and Therapeutics at the
University of Pennsylvania.</p>
    </sec>
    <sec id="sec-7">
      <title>Address for correspondence</title>
      <p>Mark A. Miller, markampa@pennmedicine.upenn.edu
4.</p>
      <sec id="sec-7-1">
        <title>TURBO|PennTURBO Documentation [Internet].</title>
        <p>TURBO|PennTURBO Documentation. Available from:
https://pennturbo.github.io/Turbo-Documentation/
Schleyer TK, Ruttenberg A, Duncan W, Haendel M,
Torniai C, Acharya A, et al. An ontology-based method for
secondary use of electronic dental record data. AMIA
Summits Transl Sci Proc. 2013 18;2013:234–8.</p>
        <p>Bona JP, Nolan TS, Brochhausen M. Ontology-Enhanced
Representations of Non-image Data in The Cancer Imaging
Archive. 2018;6.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Smith</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceusters</surname>
            <given-names>W.</given-names>
          </string-name>
          <article-title>Ontological realism: A methodology for coordinated evolution of scientific ontologies</article-title>
          .
          <source>Appl Ontol</source>
          .
          <source>2010 Nov</source>
          <volume>15</volume>
          ;
          <issue>5</issue>
          (
          <issue>3</issue>
          -4):
          <fpage>139</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Arp</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spear</surname>
            <given-names>AD</given-names>
          </string-name>
          .
          <article-title>Building Ontologies with Basic Formal Ontology</article-title>
          . The MIT Press;
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Stevens</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lord</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malone</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matentzoglu</surname>
            <given-names>N.</given-names>
          </string-name>
          <article-title>Measuring expert performance at manually classifying domain entities under upper ontology classes</article-title>
          .
          <source>J Web Semant. 2018 Sep</source>
          <volume>6</volume>
          ;
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Alocci</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mariethoz</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horlacher</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolleman</surname>
            <given-names>JT</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campbell</surname>
            <given-names>MP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lisacek</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Property</surname>
          </string-name>
          <article-title>Graph vs RDF Triple Store: A Comparison on Glycan Substructure Search</article-title>
          .
          <source>PLoS ONE [Internet]</source>
          .
          <source>2015 Dec 14 [cited 2019 Jul</source>
          <volume>12</volume>
          ];
          <volume>10</volume>
          (
          <issue>12</issue>
          ). Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4684231/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Stoeckert</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birtwell</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freedman</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            <given-names>H. ICBO</given-names>
          </string-name>
          _
          <year>2018</year>
          _
          <article-title>12: Transforming and Unifying Research with Biomedical Ontologies: The Penn TURBO project</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          10.
          <string-name>
            <surname>Schlegel</surname>
            <given-names>DR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crowner</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehoullier</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elkin</surname>
            <given-names>PL</given-names>
          </string-name>
          .
          <article-title>HTPNLP: A New NLP System for High Throughput Phenotyping</article-title>
          .
          <source>Stud Health Technol Inform</source>
          .
          <year>2017</year>
          ;
          <volume>235</volume>
          :
          <fpage>276</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          11.
          <string-name>
            <surname>Schlegel</surname>
            <given-names>DR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bona</surname>
            <given-names>JP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elkin</surname>
            <given-names>PL</given-names>
          </string-name>
          .
          <article-title>Comparing Small Graph Retrieval Performance for Ontology Concepts in Medical Texts</article-title>
          . In:
          <string-name>
            <surname>Wang</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>C</given-names>
          </string-name>
          , editors.
          <source>Biomedical Data Management and Graph Online Querying</source>
          . Springer International Publishing;
          <year>2016</year>
          . p.
          <fpage>32</fpage>
          -
          <lpage>44</lpage>
          . (Lecture Notes in Computer Science).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          12.
          <string-name>
            <surname>Ceusters</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            <given-names>CY</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            <given-names>B.</given-names>
          </string-name>
          <article-title>Clinical Data Wrangling using Ontological Realism and Referent Tracking</article-title>
          . :
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          13.
          <string-name>
            <surname>Manzorr</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceusters</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rudnicki</surname>
            <given-names>R</given-names>
          </string-name>
          .
          <article-title>Implementation of a Referent Tracking System</article-title>
          :
          <string-name>
            <surname>Int J Healthc Inf Syst Inform</surname>
          </string-name>
          . 2007 Oct;
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>41</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          14.
          <string-name>
            <surname>Semantic Web Integration Tool</surname>
          </string-name>
          (SWIT) [Internet].
          <source>[cited 2019 Mar</source>
          <volume>30</volume>
          ]. Available from: http://sele.inf.um.es/swit/publications.html
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          15.
          <article-title>OWL API by owlcs</article-title>
          [Internet].
          <source>[cited 2019 Mar</source>
          <volume>30</volume>
          ]. Available from: http://owlcs.github.io/owlapi/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          16.
          <string-name>
            <surname>Fernández-Breis</surname>
            <given-names>JT</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maldonado</surname>
            <given-names>JA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcos</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Legaz-García M del C</surname>
          </string-name>
          ,
          <string-name>
            <surname>Moner</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torres-Sospedra</surname>
            <given-names>J</given-names>
          </string-name>
          , et al.
          <article-title>Leveraging electronic healthcare record standards and semantic web technologies for the identification of patient cohorts</article-title>
          .
          <source>J Am Med Inform Assoc JAMIA</source>
          .
          <year>2013</year>
          Dec;
          <volume>20</volume>
          (
          <issue>e2</issue>
          ):
          <fpage>e288</fpage>
          -
          <lpage>296</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          17.
          <string-name>
            <surname>CMS 2008-2010 Data Entrepreneurs' Synthetic Public Use File (DE-SynPUF</surname>
            <given-names>)</given-names>
          </string-name>
          [Internet].
          <source>2014 [cited 2019 Mar</source>
          <volume>26</volume>
          ]. Available from: https://www.cms.gov/research-statistics-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          18. MIMIC [Internet].
          <source>[cited 2019 Mar</source>
          <volume>26</volume>
          ]. Available from: https://mimic.physionet.org/about/mimic/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          19.
          <string-name>
            <surname>Johnson</surname>
            <given-names>AEW</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollard</surname>
            <given-names>TJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehman</surname>
            <given-names>LH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghassemi</surname>
            <given-names>M</given-names>
          </string-name>
          , et al.
          <article-title>MIMIC-III, a freely accessible critical care database</article-title>
          .
          <source>Sci Data</source>
          .
          <source>2016 May</source>
          <volume>24</volume>
          ;
          <issue>3</issue>
          :
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          20.
          <article-title>Synthea by the Standard Health Record Collaborative [Internet]</article-title>
          .
          <source>[cited 2019 Mar</source>
          <volume>26</volume>
          ]. Available from: https://synthetichealth.github.io/synthea/#about-landing
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          21. Synthetic Patient Population Simulator.
          <article-title>Contribute to synthetichealth/synthea development by creating an account on GitHub [Internet]</article-title>
          .
          <source>synthetichealth; 2019 [cited 2019 Mar</source>
          <volume>26</volume>
          ]. Available from: https://github.com/synthetichealth/synthea
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          22.
          <article-title>A suite of utilities for converting to and working with CSV, the king of tabular file formats</article-title>
          .: wireservice/csvkit [Internet].
          <source>wireservice; 2019 [cited 2019 Mar</source>
          <volume>26</volume>
          ]. Available from: https://github.com/wireservice/csvkit
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          23.
          <string-name>
            <given-names>MIMIC</given-names>
            <surname>Code</surname>
          </string-name>
          <article-title>Repository: Code shared by the research community for the MIMIC-III database: MIT-LCP/mimiccode [Internet]</article-title>
          . MIT Laboratory for Computational Physiology;
          <source>2019 [cited 2019 Mar</source>
          <volume>26</volume>
          ]. Available from: https://github.com/MIT-LCP/mimic-code
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          24.
          <string-name>
            <surname>Garza</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Fiol</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tenenbaum</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walden</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zozus</surname>
            <given-names>MN</given-names>
          </string-name>
          .
          <article-title>Evaluating common data models for use with a longitudinal community registry</article-title>
          .
          <source>J Biomed Inform</source>
          .
          <year>2016</year>
          ;
          <volume>64</volume>
          :
          <fpage>333</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          25.
          <string-name>
            <surname>Blaisure</surname>
            <given-names>JC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceusters</surname>
            <given-names>WM</given-names>
          </string-name>
          .
          <article-title>Improving the 'Fitness for Purpose' of Common Data Models through Realism Based Ontology</article-title>
          .
          <source>AMIA Annu Symp Proc. 2018 Apr</source>
          <volume>16</volume>
          ;
          <year>2017</year>
          :
          <fpage>440</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          26.
          <string-name>
            <surname>Ceusters</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blaisure</surname>
            <given-names>J.</given-names>
          </string-name>
          <article-title>A Realism-Based View on Counts in OMOP's Common Data Model</article-title>
          .
          <source>Stud Health Technol Inform</source>
          .
          <year>2017</year>
          ;
          <volume>237</volume>
          :
          <fpage>55</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          27.
          <string-name>
            <given-names>LTS</given-names>
            <surname>Computing</surname>
          </string-name>
          <article-title>Downloads [Internet]</article-title>
          .
          <source>LTS Computing Downloads</source>
          . Available from: http://www.ltscomputingllc.com/downloads/
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          28.
          <article-title>Workproducts to ETL CMS datasets into OMOP Common Data Model: OHDSI/ETL-CMS [Internet]</article-title>
          .
          <source>Observational Health Data Sciences and Informatics; 2019 [cited 2019 Mar</source>
          <volume>26</volume>
          ]. Available from: https://github.com/OHDSI/ETLCMS
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          29.
          <article-title>Mapping the MIMIC-III database to the OMOP schema</article-title>
          .
          <article-title>Contribute to MIT-LCP/mimic-omop development by creating an account on GitHub [Internet]</article-title>
          . MIT Laboratory for Computational Physiology;
          <source>2019 [cited 2019 Mar</source>
          <volume>27</volume>
          ]. Available from: https://github.com/MIT-LCP/mimic-omop
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          30.
          <article-title>The TURBO ontology [Internet]. The TURBO ontology</article-title>
          . Available from: https://raw.githubusercontent.com/PennTURBO/Turbo-Ontology/master/ontologies/turbo_merged.owl
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          31.
          <string-name>
            <surname>IIS</surname>
          </string-name>
          <article-title>| Code Sets | CVX | Vaccines | CDC [Internet]</article-title>
          .
          <source>2018 [cited 2019 Jul</source>
          <volume>15</volume>
          ]. Available from: https://wcms-wp
          <article-title>-testbr.cdc.gov/php-app-template/index</article-title>
          .php
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          32.
          <article-title>These python scripts connect to the Unified Medical Language System (UMLS) database and translate the ontologies into RDF/OWL files</article-title>
          .
          <article-title>This is part of the BioPortal project</article-title>
          .: ncbo/umls2rdf [Internet].
          <source>National Center for Biomedical Ontology; 2019 [cited 2019 Jul</source>
          <volume>15</volume>
          ]. Available from: https://github.com/ncbo/umls2rdf
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          33.
          <string-name>
            <surname>UMLS - Rich Release Format MySQL Load Script</surname>
          </string-name>
          [Internet].
          <source>[cited 2019 Jul</source>
          <volume>15</volume>
          ]. Available from: https://www.nlm.nih.gov/research/umls/implementation_resources/scripts/README_RRF_MySQL_Output_Stream.h tml
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>