<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Distributed Analytics Platform to Execute FHIR-based Phenotyping Algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Md. Rezaul Karim</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Binh-Phi Nguyen</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukas Zimmermann</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toralf</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kirsten</string-name>
          <xref ref-type="aff" rid="aff7">7</xref>
          <xref ref-type="aff" rid="aff8">8</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias L¨obe</string-name>
          <xref ref-type="aff" rid="aff8">8</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Meineke</string-name>
          <xref ref-type="aff" rid="aff8">8</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Holger Stenzhorn</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oliver</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kohlbacher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Decker</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oya Beyan</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Biomolecular Interactions, Max Planck Institute for Developmental Biology</institution>
          ,
          <addr-line>Tu ̈bingen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Center for Bioinformatics Tu ̈bingen, University of Tu ̈bingen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dept. of Computer Science, University of Tu ̈bingen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Fraunhofer FIT</institution>
          ,
          <addr-line>Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Informatik 5, RWTH Aachen University</institution>
          ,
          <addr-line>Aachen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Institute for Translational Bioinformatics, University Hospital Tu ̈bingen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Quantitative Biology Center, University of Tu ̈bingen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff7">
          <label>7</label>
          <institution>University of Applied Sciences Mittweida</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff8">
          <label>8</label>
          <institution>University of Leipzig</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>despite the benefits of reusing health data collected in routine care, sharing datasets outside of the organizational boundaries is not always possible due to the legal and ethical restrictions. The Personal Health Train (PHT) is a novel privacy-preserving approach to execute analytics tasks at distributed data repositories, without sharing the data itself. In this work, we report a proof-of-concept implementation of the PHT by using FHIR data standards and Clinical Query Language (CQL). The Semantic Web and containerization technologies have been utilized to move computations to the data. We developed tools to design phenotyping algorithms on the data consumer side, implemented an infrastructure to transfer and execute Docker containers at the data centers, and to return results to the consumers. We experimented the evaluated PHT infrastructure and tools by designing a phenotyping algorithm for diabetes mellitus and prostate cancer risk case-control study and executed it at three distributed FHIR repositories.</p>
      </abstract>
      <kwd-group>
        <kwd>Distributed analytics</kwd>
        <kwd>Data reuse</kwd>
        <kwd>Personal Health Train</kwd>
        <kwd>Phenotyping</kwd>
        <kwd>HL7 CQL</kwd>
        <kwd>FHIR</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        conventional way of re-purposing EHR data is on-demand sharing of research
data sets. However, since the curation process is quite labor-intensive and costly,
eventually it impedes researchers to exploit the full benefit of EHRs. Moreover,
the sensitive nature of the health data and EU General Data Protection
Regulation (GDPR) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] enforcements obstruct naive data sharing approaches.
      </p>
      <p>
        Novel distributed learning approaches, such as the Personal Health Train
(PHT)1, provide an alternative to conventional data sharing. It introduces the
concept of sharing analytical algorithms rather than data by executing tasks at
the data source in a tightly regulated manner without revealing the primary data
directly to the requesting data consumer [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this train metaphor, stations are
the data-containing repositories which can be discoverable and reusable, trains
are the analytical task which could be queries, statistical models or algorithms,
and tracks are the rules of interaction and underlying exchange infrastructure.
      </p>
      <p>
        Recently, the German Federal Ministry of Education and Research has
initiated the Medical Informatics Initiative (MII) funding concept2 to support
digitalization in medicine, and funded four large consortia (HiGHmed,
DIFUTURE, MIRACUM, and SMITH). Each consortium develops an infrastructure
to support cross-institution data exchange and implements a set of use cases to
demonstrate data reuse for various purposes including research and data-driven
medicine. The core element of the concept is the establishment of Data
Integration Centers (DIC) at university hospitals and innovative ways to link data,
information, and knowledge from health care, clinical, and biomedical research
across the boundaries of sites3. Authors are the members of two of the funded
MII consortia: SMITH [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and DIFUTURE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We developed an architecture
and a reference implementation of the PHT approach to execute distributed
analytics over the distributed data will be located in the DICs of the German
MII. In this paper, we report a specific use case of the reference
implementation, which focuses on exchanging the phenotyping algorithms between the DICs
and perform statistical analysis for the selected cohorts. This implementation is
based on HL7 standards4 using Fast Healthcare Interoperable Resources (FHIR)
resources and the Clinical Quality Language (CQL)5.
      </p>
      <p>In our research, we developed a FHIR standard-based distributed analytics
platform tool by utilizing Semantic Web and containerization technologies. The
developed platform has a user-friendly interface to generate CQL for further use
in phenotyping algorithms, provides the infrastructure to send them to multiple
DIC and interact with the FHIR resources to compute the requested analytics
and return the outcomes to the data consumer.</p>
      <p>The rest of the paper is structured as follows: section 2 briefly discuss related
works. Section 3 describes the phenotyping algorithms. Section 4 presents the
main PHT concepts, and Section 5 describes the proposed approach and the
implementation. Section 6 outlines some initial evaluation results based on a
sample use case. Finally, limitations of the study along with some future works
is discussed before concluding the paper in section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In order to facilitate knowledge discovery for both humans and machines, FAIR
data principles [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have been proposed, which suggest a set of guiding principles
1https://www.dtls.nl/fair-data/personal-health-train/
2http://www.medizininformatik-initiative.de/en/start
3https://www.bmbf.de/de/medizininformatik-3342.html
4http://www.hl7.org/implement/standards/
5http://www.hl7.org/implement/standards/product_brief.cfm?product_id
to make research data Findable, Accessible, Interoperable, and Re-usable. These
guiding principles are promising in the discovery, access, integration, and analysis
of task-appropriate scientific data and associated algorithms and workflows.
      </p>
      <p>
        The GoFAIR PHT implementation network initiative, which is an adoption
of FAIR, aims to increase the reuse of existing biomedical data for research for
personalized healthcare, preventive medicine, and value-based healthcare.
Recently, Jochems et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Deist et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed the PHT approach based
on Semantic Web technologies. The underlying information system architecture
enables learning from privacy-sensitive data without the data ever crossing
organizational boundaries, maintaining control over the data, preserving data privacy
and thereby overcoming legal and ethical issues common to other forms of data
exchanges. The key concept in PHT is bring algorithms to the data rather than
bringing data to a central repository, which gives controlled access to
heterogeneous data sources while ensuring maximum privacy protection and engagement
of individual patients.
      </p>
      <p>
        Core to realizing both PHT and FAIR are Semantic Web technologies [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
which provides a framework for data sharing and reuse by making the
semantics of data machine interpretable. On the other hand, as an HL7 specification,
CQL aims to provide a rule-based way to define clinical quality measures and
decision support rules. One of the key features of using CQL is that it makes
logic expressions independent of any specific data models [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Use Case: Phenotyping Algorithms</title>
      <p>
        Phenotyping algorithms used for identifying cohorts – a group of patients with
certain characteristics [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for purposes such as testing of novel therapies or
recruitment for clinical trials. They can be described as rule sets, mostly
represented as decision trees. These rule sets typically describe detailed patient
characteristics and health conditions, and may include various data types, such
as structured data, molecular data, and machine learning algorithms, such as
text mining [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>A number of studies have been published describing automated phenotyping
techniques employed by medical organizations. However, owing to the sensitive
nature of the data, most institutions have developed their own systems. The
PHT approach can help to share and execute phenotyping algorithms at multiple
data centers without sharing such privacy-sensitive data required for identifying
cohorts.</p>
      <p>In this study, we leveraged HL7 CQL to model and exchange phenotyping
algorithms across DICs. The CQL provides a standard representation of rules
and execution of phenotype algorithms required for machine-interpretable and
exchangeable representation of phenotyping, which can be executed with FHIR
resources. We provided a user-friendly web-based application to write
phenotyping algorithms as CQL queries. With this tool, users can design their phenotype
definitions on the fly, and specify any previously implemented algorithms or
statistical models to be executed at the selected cohort.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Concepts</title>
      <p>In this section, we present the main concepts of PHT and it’s different
components in the proposed architecture. The PHT approach is based on the principle
that data does not leave the origin, rather than the analytics are transferred to
the data sources by a gateway. Once the data analytics task is dispatched, the
owner of the task is detached from the process, the gateway controls routing to
the DICs and collecting the outcomes of the tasks. DICs, which are coupled with
computation power, runs the tasks and returns the trained models or result to
requester via the gateway. We define three core entities, namely Curator
(Station), Consumer (Train) and Gateway (Handling station) as main components
of the proposed architecture.</p>
      <p>Curators provide data sets, metadata and required computational power to
execute analytics task. They are conceptualized as the Train Stations. They
integrate the privacy sensitive data from multiple sources in a secure storage that
can be accessed by authenticated and authorized parties. They act as FAIR data
points, by publishing schemas, metadata and access protocols. Our assumption
is that the data is privately hosted across stations but the schema and metadata
are public.</p>
      <p>Train Stations play three main roles: (i) publishing the metadata and schemas,
which are required for the discoverability of the data and for the definition of
input parameters of the analytics task; (ii) providing an access mechanism and
an interface to evaluate and execute the data queries, either to negotiate the
availability of data sets or to extract the data to feed the analytics tasks; (iii)
providing a secure enclave to execute Dokerized computations on the extracted
data during the computation phase. In our implementation, DICs play the role
of the Train Stations.</p>
      <p>Consumers are real world persons or services who aim to access potentially
privacy sensitive data resides at multiple repositories to execute analytics task
e.g. statistical analysis, machine learning, cohort identification, record linkage.
Consumers are responsible from defining their data requirements by formulating
queries, implementing and containerizing their analytics tasks and specifying
how the different results generated by each curator will be aggregated.
Consumers build Trains and send them to the gateway.</p>
      <p>A Train has four main components: (i) a Query, which will be executed at the
Train Station to retrieved the data input for the analytics task; (ii) an Analytics
algorithm, which encapsulates the main task in a container; (iii) an Aggregator,
which defines the methods to aggregate and post process the results; and (iv) a
Metadata, which describes and keeps track of a range of information, e.g. owner
and purpose of the task, access rights, execution provenance.</p>
      <p>The Gateway is a point of trust. It provides common interfaces to describe,
transfer and execute trains, as well as supervise the transmission of them. In our
implementation, the Handling Station acts as a Gateway between consumers
and curators. The Handling Station provides protocols to communicate with
consumers and stations. It receives and forwards trains, keeps a privacy
preserving index of stations and provides a uniform and trustful execution interface to
the stations. During the training process, it keeps the provenance of trains e.g.
to which stations it has been forwarded, execution status. The handling station
also has the role of a broker. It publishes aggregated data schemas and
vocabularies to describe the data in the stations, and directs trains to the relevant
stations.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Proposed Approach and Implementation</title>
      <p>stations. Stations pull the trains, execute them and push them back to Train
Registry. Stations interact with trains via the standardized Train API, such
as checking the requirements of the train and controlling the life cycle of the
execution, e.g. timeouts, observation of machine resource consumption. Once all
the scheduled training is completed, the consumer is notified and receives the
outcome of the tasks. At the current implementation, aggregation of results has
been performed at the consumer site.</p>
      <p>The current implementation is focused on methods to build trains and
establishing protocols for exchanging them. It does not cover how data will be
curated in each station and how various data formats will be transformed into
agreed, consumable data standards. We assume there will be a FHIR server and
CQL evaluation engine service in each data station, and data will be available
as FHIR resources in JSON format.</p>
      <p>We assume that DICs are trusted parties and they use FHIR as data
representation standard. The current prof-of-concept implementation does not audit
the containers, nor does control them to check if they leave with any privacy
sensitive information. The following subsections will present the implementation
details.
5.1</p>
      <sec id="sec-5-1">
        <title>Data and metadata preparation</title>
        <p>We generate synthetic data of about 2,500 patients using Synthea tool1. This
enables us using the data without concern of legal or privacy restrictions. Patient
records are based on a set of de-identified data recommended by the clinicians
and real-world statistics collected by the CDC, NIH, and other sources.</p>
        <p>Each patient is simulated independently from birth to present day and their
diseases, conditions and medical care describing a progression of states and the
transitions between them. Thus, for each synthetic patient, the data contains a
complete medical history, including medications, medical encounters, and social
determinants of health. We also introduced minor biases to patient observations
-e.g. weight, height, blood pressure, heart rate etc. The condition and observation
encoded with LOINC1 and SNOMED2, respectively. We convert patient data
into a set of FHIR resources bundles based on observation and condition value
sets and distribute them into three DICs to simulate different hospitals.</p>
        <p>
          Metadata scheme is based on the condition3 and observation4 resource’s
examples provided by FHIR, the basic metadata information is collected by looking
up corresponding examples in Meaningful Use Value Sets from the United
States Health Information Knowledge Base 5. The schema is first extracted
inspired by the literature [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Then the resulting schema was further encoded into
FHIR standard and publicly exposed using a dedicated FHIR server.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Train building</title>
        <p>Consumers build trains by using a phenotype design client. This client provides a
Web user interface to specify phenotyping algorithms, to generate CQL queries,
and to create docker images containing the metadata, query and an algorithm.
Query builder accesses the metadata deployed to a publicly accessible schema
endpoint. Users then interact with this metadata and write their queries as CQL.</p>
        <p>For the phenotyping artifacts, FHIR model version 3.0.0 is used as the
primary data model to support for the FHIR STU3 standard within the library:
all the artifact logic using CQL are wrapped in a library in the FHIR client.
The data model supports a subset of resources including MedicationRequest,
Observation, and Condition.Once the CQL is generated, it is sent to the CQL
evaluation engine for syntactic validation before encapsulating and shipping.</p>
        <p>Then consumers then generate a package containing the metadata of the
phenotype algorithm, query and the script that specified the phenotype
computation mechanism, and send to handler. During the training process, real-time
progress can be tracked using the train monitor module and the resulting service
API is activated, which listens to the changes, refresh the status and notify the
users when there is a result returned from the station.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Train dispatching and routing</title>
        <p>Once a package containing the metadata and CQL queries is shipped, it is
wrapped into a Docker image, which is then pushed into a known Docker registry
instance. This instance is known to all stations and the train issuer. The Docker
tag designates the station that should pull the image next. We currently use the
syntax station.&lt;id&gt;, where &lt;id&gt; is an integer that unambiguously identifies a
station.</p>
        <p>Each station continuously monitors the Docker registry to see whether it
is supposed to pull an image, which should be running locally (which means
executing the phenotyping algorithm using the local FHIR resources). Stations
can be configured to scan a particular namespace of the Docker registry for new
images. Currently, we use the Portus6 authorization service and Docker registry
1https://loinc.org/
2https://www.snomed.org/
3https://www.hl7.org/fhir/condition-examples.html
4https://www.hl7.org/fhir/observation-examples.html
5https://ushik.ahrq.gov/ValueSets?system=mu
6http://port.us.org/
frontend to manage the authentication for the station. The User Interface (UI)
of Portus also allows for monitoring the train images that are produced.</p>
        <p>Routing is currently delegated to a simple Spring Boot Framework, which
knows all present stations and hence can tag the Docker images accordingly.
Once a station has processed an image, the Station pushes the newly created
image containing the updated model to the Docker registry, which then sends a
notification to the routing service. This service then determines the next station
the train is supposed to visit and tags the pushed Docker image accordingly. The
next station will then be able to find the image, which it should execute next.
5.4</p>
      </sec>
      <sec id="sec-5-4">
        <title>Distributed phenotyping</title>
        <p>Based on the scheduling suggested by the routing module, the train travels to
different DICs, where the CQL query is executed inside a Docker container.
First, the queries are evaluated and validated. Then the request is scheduled for
processing within a secure enclave, where the query and algorithm are executed.
Query returns a FHIR resource bundle from the FHIR servers of DICs. Once
the data is available, phenotyping algorithms are executed on the data specified
by the algorithm script inside the same Docker container.</p>
        <p>The station is also responsible for updating train states depending on events
and repeatedly status updates to the issuer by invoking the result service
endpoint as HTTP POST requests without ever directly granting access to the data.
After the computation is finalized, phenotype results and workflows are further
pushed back to Docker registry, version of the Docker image is updated in the
Docker registry, and station invokes the result service API and posts the results
to the issue.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation</title>
      <p>
        We have evaluated our developed approach by simulating a population based
case control study from literature reported in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] partially, which examines
the risk of prostate cancer (PCa) among men with Type 2 Diabetes Mellitus
(T2DM) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In this study, prostate cancer risk categories among men with
T2DM carefully characterized regarding glucose-lowering therapy, duration of
disease, body mass index (BMI), and circulating levels of glycated hemoglobine
(HbA1c).
      </p>
      <p>This study showed a reduced risk of being diagnosed with PC among men
with T2DM –especially for low risk tumors. Obese diabetic men (BMI¿30 kg/m2)
showed a reduced risk compared to men without diabetes. We used this study
as inspiring use case, and designed phenotyping algorithms to specify PCa and
T2DM cohorts and calculated the BMI for the queried subpopulation at the each
stations. We used the synthetic data created as FHIR resources and distributed
into three stations (DICs) hosted at the University of Tu¨bingen, RWTH Aachen
University and an AWS EC2 cloud, respectively. Data contains approximately
400 PCa, and 120 T2DM cases. HAPI FHIR server used to host resources.</p>
      <p>We asked users to define PCa and T2DM phenotypes by using diagnostic
codes, observations such as HbA1C, and medication data. Via phenotype design
client four phenotypes has been created and named as four arms of the study
as follows: (i) PCa positive and T2DM negative; (ii) PCa positive and T2DM
positive; (i)PCa negative and T2DM negative; (ii) PCa negative and T2DM
positive. Each named phenotype is generated as CQL query and validated with
the CQL engine service. Then in order to identify the cohort, for each arm, CQL
is used to query the characteristics of each patient from FHIR server and then
the service of CQL engine collects all the patient’s information that satisfies the
condition.</p>
      <p>BMI is calculated for the arms, which includes T2DM positive cases. The
BMI calculation algorithm is implemented in Python and selected by user via
the web interface. Once the BMI algorithm is selected the required FHIR
resources to query height and weight is included to the CQL query. Additionally,
for each arm a count algorithm is included to return the number of patients
selected for the specified phenotype.</p>
      <p>Listing 1.1. Part-I: CQL query showing library, data model, context and terminology
definition (code system, value sets, codes)
An example of generated CQL is shown in listing 1.1 and 1.2 As shown in the
CQL, as a default and must-have statement to measure the population, the
InDemographic statement is defined as a condition to characterize the disease
in terms of patient characteristics observable from clinical data. Once user
defined and validated the phenotyping algorithms, a Docker image has been build
and published at the Docker registry. Each of the three stations monitoring the
Docker registry pulls the image, performs the computing over the retrieved FHIR
bundles, updates the image and pushes it back.</p>
      <p>Results and the status updates are then received through the result in service
endpoint and aggregated there. Table 1 shows the distribution of the patients
into different groups based on BMI.
1 define " I n D e m o g r a p h i c " :
2 " P r o s t a t e C a n c e r P o s i t i v e " and " T y p e 2 D M P o s i t i v e "
3
4 define " P r o s t a t e C a n c e r P o s i t i v e " : exists (
5 [ Condition ] C where ToCode ( C . code . coding ) ˜ "</p>
      <p>C a r c i n o m a O f P r o s t a t e " )
10
11
12
13
14
15
6
7 define " T y p e 2 D M P o s i t i v e " : exists (
8 [ Condition ] C where ToCode ( C . code . coding ) ˜ "</p>
      <p>D i a b e t e s M e l l i t u s T y p e 2 " ) and not exists (
9 [ Condition ] C where ToCode ( C . code . coding ) ˜ "
D i a b e t e s M e l l i t u s 1 " ) or exists ([
M e d i c a t i o n R e q u e s t ] MR where ToCode ( MR .</p>
      <p>me di ca ti on . coding [0]) ˜ " Insulin " )
and To Qu an ti ty ( Last ([ O b s e r v a t i o n ] O where ToCode ( O .
code . coding ) ˜ " F a s t i n g G l u c o s e "</p>
      <p>sort by effective . value ) . value as Quantity
) . value &gt; 200 and To Qu an ti ty ( Last ([ O b s e r v a t i o n ] O
where ToCode ( O . code . coding ) ˜ " H e m o g l o b i n A 1 C "
sort by effective . value
) . value as Quantity ) . value &gt;= 6.5
Listing 1.2. Part-II: CQL query showing the InDemographic and statement definitions
for population and condition
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Outlook</title>
      <p>In this work, we presented a FHIR standard-based approach for distributed
phenotyping to reuse EHRs for research purposes by using Semantic Web and Docker
technologies. We developed a proof-of-concept implementation for the PHT
approach to exchange computations, instead of sharing the data. We briefly
presented the system architecture, concepts, implementation choices and the overall
workflow. From the users perspective, our approach enables query formulation
against privacy sensitive data sources and successive evaluation of that request
in a secure enclave at the data provider’s end.</p>
      <p>Initial experiments for the distributed phenotyping on T2DM and PCa risk
case-control use case show that our approach using CQL, FHIR, and Docker can
overcome the reliance of previous approaches on agreeing upon shared schema
and encoding a priori in favor of more flexible schema extraction based on FHIR
standards. Further, we specified the main concepts of the PHT, such as train,
station and gateway. Our current PHT implementation is limited to specify
basic protocols to define trains and exchange them between stations. The first
results are promising, however, requires extensions to achieve full power of PHT
approach. Additional functionalities such as authentication and authorization,
intelligent routing, trust services planned to be added in future.</p>
      <p>Another limitation of the current work is the absence of the privacy
preserving technologies such as differential privacy, encryption, secure computation.
This proposed architecture could be extended with selected approach. In our
work, we did not focus on the curation of data at the train stations. We assumed
there is a data described with FHIR standards. There are numerous challenges
to create the FHIR resources from the operational systems in hospitals.
Additionally, there are various standards applied by other communities, such OMOP,
OpenEHR and FHIR. Communication between these standards remains as
another challenge to explore.
This work was supported by the German Ministry for Research and
Education (BMBF) as part of the SMITH consortium (MRK, OB, and SD grant no.
01ZZ1803K; TK, ML, and FM grant no. 01ZZ1609A) and the DIFUTURE
consortium (OK, HS, and LZ, grant no. 01ZZ1804D).</p>
      <p>This work was conducted jointly by RWTH Aachen University, Tubingen
University and Fraunhofer FIT as part of the PHT and GoFAIR implementation
network, which aims to develop a proof of concept information system to address
current data reusability challenges.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cowie</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomster</surname>
            ,
            <given-names>J.I.</given-names>
          </string-name>
          , Curtis,
          <string-name>
            <surname>Lesley</surname>
            <given-names>H</given-names>
          </string-name>
          , e.a.:
          <article-title>Electronic health records to facilitate clinical research</article-title>
          .
          <source>Clinical Research in Cardiology 106(1)</source>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Shields</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alset</surname>
            ,
            <given-names>A.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boal</surname>
          </string-name>
          , Nina S, e.a.:
          <article-title>Conjunctival tumors in 5002 cases. comparative analysis of benign versus malignant counterparts</article-title>
          .
          <source>American journal of ophthalmology 173</source>
          (
          <year>2017</year>
          )
          <fpage>106</fpage>
          -
          <lpage>133</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chassang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The impact of the eu general data protection regulation on scientific research</article-title>
          . ecancermedicalscience
          <volume>11</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gleim</surname>
            ,
            <given-names>L.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karim</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohlbacher</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stenzhorn</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beyan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Schema extraction for privacy preserving processing of sensitive data</article-title>
          .
          <source>life sciences 1(39) 48</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Sta¨ubert,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Ammon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Aiche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Beyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Bischoff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Daumke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Funkat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Gewehr</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.E.</surname>
          </string-name>
          , et al.:
          <article-title>Smart medical information technology for healthcare (smith)</article-title>
          .
          <source>Methods of information in medicine 57(S 01)</source>
          (
          <year>2018</year>
          )
          <fpage>e92</fpage>
          -
          <lpage>e105</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Prasser</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohlbacher</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mansmann</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuhn</surname>
            ,
            <given-names>K.A.</given-names>
          </string-name>
          :
          <article-title>Data integration for future medicine (difuture)</article-title>
          .
          <source>Methods of information in medicine 57(S 01)</source>
          (
          <year>2018</year>
          )
          <fpage>e57</fpage>
          -
          <lpage>e65</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aalbersberg</surname>
            ,
            <given-names>I.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Appleton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Axton</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomberg</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boiten</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          , da Silva Santos,
          <string-name>
            <given-names>L.B.</given-names>
            ,
            <surname>Bourne</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.E.</surname>
          </string-name>
          , et al.:
          <article-title>The fair guiding principles for scientific data management and stewardship</article-title>
          .
          <source>Scientific data 3</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jochems</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deist</surname>
          </string-name>
          , T.M., van Soest, e.a.:
          <article-title>Distributed learning: Developing a predictive model based on data from multiple hospitals without data leaving the hospital - A real life proof of concept</article-title>
          .
          <source>Radiotherapy and Oncology</source>
          <volume>121</volume>
          (
          <issue>3</issue>
          ) (
          <year>2016</year>
          )
          <fpage>459</fpage>
          -
          <lpage>467</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Deist</surname>
            ,
            <given-names>T.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jochems</surname>
            , A., van Soest,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nalbantov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oberije</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walsh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eble</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bulens</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coucke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dries</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dekker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lambin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT</article-title>
          .
          <source>Clinical and Translational Radiation Oncology</source>
          <volume>4</volume>
          (
          <year>2017</year>
          )
          <fpage>24</fpage>
          -
          <lpage>31</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lassila</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The semantic web</article-title>
          .
          <source>Scientific american 284(5)</source>
          (
          <year>2001</year>
          )
          <fpage>34</fpage>
          -
          <lpage>43</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prud'Hommeaux</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solbrig</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          :
          <article-title>Developing a semantic web-based framework for executing the clinical quality language using fhir</article-title>
          .
          <source>CEURWS. org</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Shivade</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fosler-Lussier</surname>
            ,
            <given-names>E.e.a.</given-names>
          </string-name>
          :
          <article-title>A review of approaches to identifying patient phenotype cohorts using electronic health records</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>21</volume>
          (
          <issue>2</issue>
          ) (
          <year>2013</year>
          )
          <fpage>221</fpage>
          -
          <lpage>230</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Lo¨be, M., Sta¨ubert,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Goldberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Haffner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Towards phenotyping of clinical trial eligibility criteria</article-title>
          .
          <source>Studies in health technology and informatics 248</source>
          (
          <year>2018</year>
          )
          <fpage>293</fpage>
          -
          <lpage>299</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pierce</surname>
            ,
            <given-names>B.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plymate</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostrander</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanford</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Diabetes mellitus and prostate cancer risk</article-title>
          .
          <source>The Prostate</source>
          <volume>68</volume>
          (
          <issue>10</issue>
          ) (
          <year>2008</year>
          )
          <fpage>1126</fpage>
          -
          <lpage>1132</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Fall</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garmo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gudbjornsdottir</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stattin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zethelius</surname>
            ,
            <given-names>B.O.</given-names>
          </string-name>
          :
          <article-title>Diabetes mellitus and prostate cancer risk; a nationwide case-control study within pcbase sweden</article-title>
          .
          <source>Cancer Epidemiology and Prevention Biomarkers</source>
          (
          <year>2013</year>
          )
          <fpage>cebp</fpage>
          -
          <lpage>1046</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>