<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pipeline for Population and Analysis of Personal Health Knowledge Graphs (PHKGs)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dagmar Celuchova Bosanska</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Huptych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lenka Lhotská</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Personal Health Knowledge Graphs, Ontology, Graph algorithms, Machine Learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Czech Institute of Informatics</institution>
          ,
          <addr-line>Robotics, and Cybernetics</addr-line>
          ,
          <institution>Czech Technical University in Prague</institution>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Biomedical Engineering, Czech Technical University in Prague</institution>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Personal Health Knowledge Graphs (PHKGs) are not yet ubiquitous, even though they have a great potential to enrich general knowledge captured in various Knowledge Graphs by adding personal contexts. This poster paper presents work in progress about a pipeline for generating PHKGs from tree-structured Electronic Health Record (EHR) data by applying a hierarchical ontological approach. This pipeline could also be applied to other domains of Personal Knowledge Graphs. Moreover, this pipeline targets the intersection between the symbolic representation of knowledge used for computational semantics and numeric graph data representation used for graph analysis and machine learning. We present the ifrst results from applying this pipeline to synthetic patient EHRs with the diagnosis of colorectal cancer (based on Synthea). The resulting numeric representation of PHKGs or their subgraphs can be used in many practical graph algorithms. Finally, our pipeline study uncovers future research on how this numeric representation of PHKGs should be embedded into continuous and low-dimensional vector space to utilize graph machine learning and deep learning methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Personal health knowledge graphs (PHKGs) represent structured information about entities
related to a patient’s health and well-being, attributes, and relations between them. Unlike
Knowledge Graphs (KGs), PHKGs are not yet ubiquitous. PHKGs should be generated for
individual patients from numerous information sources such as electronic health records (EHRs),
wearables and mobile health apps, sensors, and patient annotated texts and notes related to the
patient’s condition. In principle, PHKGs can be populated by all techniques mentioned in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] if
the information source contains personal data or data related to a patient and her health and
well-being. However, no agreed representation and population of PHKGs exists. For instance,
a KG for asthma can describe causes, symptoms, and treatments for asthma, and PHKG can
be the subgraph containing just those causes, symptoms, and treatments that apply to a given
patient [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Another point of view is that PHKGs can be used to add personal context to KGs
and to help develop a personalized diagnosis, recommendations, and treatments [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>This poster paper presents a work in progress about a pipeline on how to populate a PHKG
from several data sources. First, the data must be harmonized according to a flexible and helpful
data model for data analysis. Then, we apply this approach to the PHKG population from data
stored in EHRs. In addition, ongoing work will provide evidence that this approach generalizes
to PKGs and other data sources, and enables linking PHKGs with KGs and generating new KGs.</p>
    </sec>
    <sec id="sec-2">
      <title>2. How to represent a PHKG and set up a pipeline</title>
      <p>
        The proposed representation of a PHKG is composed of two elements: a domain graph and
a mapping from the nodes and edge labels of the data graph to those of the domain graph in
which they are called entities and relation-types, respectively [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The domain graph defines
the schema of the PHKG – its high-level structure that can evolve more flexibly than a schema
for a relational model. Using a harmonizing data model – the Simple Event Model Ontology
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] enables us to view EHR data in the HL7 FHIR RDF format as evolving chains of events
and sub-events in time. This ontology represents a schema that simplifies further analysis and
manipulation of PHKG graphs. Its practicality has been proven for Event-Centric Temporal
Knowledge Graph [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In our case of PHKGs, events are central elements in representing
a patient’s experience with a concrete disease. This experience includes visits (encounters),
reported complaints and symptoms (for example, in the form of observations), performed
procedures, prescribed medications, finalized diagnostic reports, and even applied care plans.
In terms of self-management, it can be, for example, exercising, self-measurements, or sleep
monitoring. This ontology allows us to write easy-to-understand SPARQL queries without a
more profound understanding of the domain ontologies such as HL7 FHIR RDF. The PROV-O
ontology can capture the administrative part of health records, such as who created the entry,
when, how, and in which institution.
      </p>
      <p>
        Ontologies such as SNOMED CT and LOINC (also a part of UMLS) give identity to our nodes
in PHKGs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This identity denotes which nodes in PHKGs, or external KGs refer to the same
real-world entity. Thanks to these ontologies, we can use the subsumption relationships to align
the node labels to the same, more general term and thus improve the ontological graph union
operation (see Section 3). In the clinical and self-care setting, the labels of the corresponding
nodes would hardly be the same if there was no method of standardization and subsumption
with the help of mentioned ontologies in place.
      </p>
      <p>
        A PHKG is thus represented as a directed edge-labeled graph that enables querying and
reasoning. However, most graph data analysis techniques do not apply to this representation.
Therefore, we need to transform it into an undirected or directed graph without edge labels (i.e.,
predicate names). A directed graph is thus projected by optionally selecting a sub-graph from
the data graph from which all edge labels can be dropped. The proposed pipeline to populate a
PHKG from various data sources and use it for graph analysis is as follows:
1. Find or create an ontology for data harmonization: Our research suggests that the
mentioned Simple Event Model Ontology is a universal and useful upper-ontology that
can be extended by more specific models for various event types.
2. Choose a standard or format for harmonized data transformation into graph
dataset: As our chosen example of HL7 FHIR data was available in a tree-structured
format (JSON) and a tool to convert the data into HL7 FHIR RDF format existed (see
Section 3), the choice was straightforward. But for diferent categories of data other
formats such as KGTK [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or property graphs may be easier to apply.
3. Assert relations between nodes/edges in the graph dataset and nodes (classes) / edges
(properties) from the ontology chosen for data harmonization (or even for provenance):
It means that more facts stated as triples will be added to the graph dataset representing
the PHKG. Additional information about events will be available in the data from various
sources in their original data model, such as HL7 FHIR RDF.
4. Create a subgraph from the source PHKG with the help of SPARQL based on the data
harmonization ontology.
5. Convert the symbolic representation of the subgraph into its numeric
representation for a directional graph (for example into an adjacency matrix) for further
analysis and transformation: Make use of the subsumption provided by linked ontologies
(SNOMED CT and LOINC in our case) to unite node and/or edge labels. If possible, store
edge and node labels and other meta data within the (sub)graph structure. The symbolic
representation is thus as follows:
• V is a vertex set - a set of nodes {a, b, c, d, ...}.
• A is an |V| × |V| adjacency matrix (assume binary if there are no edge weights).
• X ∈ Rm×|V| is a matrix of node features, such as node identity (URI and the
SNOMED
      </p>
      <p>CT code) and the begin timestamp.
6. Apply graph algorithms, such as operators, distance measures, and shortest paths to
one or more PHKGs in their numeric form. Even if the edge labels and other edge and
node meta data are dropped for further graph analysis, it is a good practice to create a
possibility to retrieve this information even for the outputs of the graph analysis.
7. If needed, convert the result of the graph analysis back into the symbolic form to expand
the original PHKG by new knowledge.</p>
      <p>This choice of a strategy to convert any source PHKG into its symbolic representation for
analysis is not trivial. It requires empirical validation and iterations of the steps 5 and 6 to
develop new insights from the graph analysis. In addition, more study is needed to understand
the efects of such strategies more generally on the results of diferent analytical techniques.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The usage example</title>
      <p>
        We applied the methods and pipeline proposed in the Section 2 on synthetic data generated
for the colorectal cancer diagnosis according to [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The output data from this generator are in
the JSON format based on the HL7 FHIR standard. In addition, datasets for individual patients
were converted to the HL7 FHIR RDF format using the FHIR JSON to RDF conversion utility1
(on the HL7 FHIR webpage, other open source implementations can be found). We populated,
visualized, and analyzed the PHKGs for individual patients using Python libraries RDFlib2,
1https://github.com/BD2KOnFHIR/fhirtordf
2https://github.com/RDFLib/rdflib
      </p>
      <p>A PHKG (only a part is displayed)
fhir:Patient</p>
      <p>fhir:Condition
rdf:type
fhir:Patient
pt01</p>
      <p>sem:
hasActor
„malignant tumor of
colon“@en</p>
      <p>rdf:type
fhir:Condition</p>
      <p>c01
rdfs:label
sem:eventType
sem:hasSubEvent
sem:hasBeginTimeStamp
2011-05-21T05:14-53+02:00
rdfs:label
„partial resection
of colon“@en
sem:hasSubEvent</p>
      <p>sem: sem:
fhir:Procedure has fhir:Procedure has fhir:Procedure
p01 ESvuebnt p02 ESvuebnt p03</p>
      <p>rdfs:label rdfs:label</p>
      <p>
        DBpedia:
Colorectal cancer
fhir:Observation
o01
Owlready2 with its PyMedTermino2 for easy access to domain ontologies3 and NetworkX
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We needed to implement tools to switch between diferent graph data structures of these
libraries to implement the whole pipeline for graph data analysis.
      </p>
      <p>Once we had the numeric representation of PHKG subgraphs for our 325 synthetic patients,
performed according to the Figure 1, we developed an algorithm for an ontological graph union
to create a KG containing diferent patient pathways from the point of diagnosis to the outcome.
This algorithm can contract nodes of the same type (in our case, the same SNOMED CT code of
the underlying procedure) across patient PHKGs while preserving all node features. In theory,
if the set of PHKGs were representative enough, this KG would cover all possible ways of
treatment and their outcomes. We analyzed the most connected nodes - procedures with the
help of the degree centrality. They represent the key decision points in the patients’ treatment
or life events.</p>
      <p>It is also possible to analyze the shortest simple paths from the principal diagnosis to the
outcome, considering the weights. For our group of patients, the shortest simple path from the
diagnosis to the unfortunate event of death is length four because the diagnosis happened at a
very late stage of the disease.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and future work</title>
      <p>In our future work, we will further explore the best combination of ontologies, knowledge, and
data harmonization to create a universal pipeline for the PKG population. In addition, as we can
see in Figure 1, the adjacency matrix is sparse (the sparsity equals 2/3 in this case), and this fact
about the numeric representation holds in general. Therefore, a more eficient representation
in continuous and low-dimensional vector space with the help of node and graph embedding
should be researched.</p>
      <p>Finally, in our pipeline, we found a more straightforward representation of the partial PHKG
knowledge in which we could drop the edge labels (the nodes representing the events
(procedures) were connected only by a sub-event relationship to capture the sequence of events in
time). However, the more properties the graph embedder encodes, the better results can be
retrieved in later tasks. Therefore, we can generate a more complex subgraph from our source
PHKG with edge labels.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>graphs</given-names>
          </string-name>
          ,
          <source>ACM Computing Surveys</source>
          <volume>54</volume>
          (
          <year>2021</year>
          ).
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 4 4 7 7 7 2</volume>
          .
          <article-title>a r X i v : 2 0 0 3 . 0 2 3 2 0</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gyrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shekarpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Thirunarayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <article-title>Personalized health knowledge graph</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>2317</volume>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Seneviratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harris</surname>
          </string-name>
          , C.
          <article-title>-h.</article-title>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D. L.</given-names>
          </string-name>
          <string-name>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <article-title>Personal Health Knowledge Graph for Clinically Relevant Diet Recommendations</article-title>
          ,
          <source>Workshop on Personal Knowledge Graphs Co-located with the 3rd Automatic Knowledge Base Construction Conference (AKBC'21)</source>
          (
          <year>2021</year>
          ). URL: http://arxiv.org/abs/2110.10131.
          <article-title>a r X i v : 2 1 1 0 . 1 0 1 3 1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Van Hage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Malaisé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Segers</surname>
          </string-name>
          , et al.,
          <article-title>Design and use of the Simple Event Model (SEM)</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>9</volume>
          (
          <year>2011</year>
          )
          <fpage>128</fpage>
          -
          <lpage>136</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>1 6</volume>
          / j . w e b s e
          <source>m . 2 0 1 1 . 0 3 . 0 0 3 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gottschalk</surname>
          </string-name>
          , E. Demidova, EventKG:
          <string-name>
            <given-names>A Multilingual</given-names>
            <surname>Event-Centric Temporal Knowledge Graph</surname>
          </string-name>
          ,
          <source>Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10843 LNCS</source>
          (
          <year>2018</year>
          )
          <fpage>272</fpage>
          -
          <lpage>287</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>0 7 / 9 7 8 - 3 - 3 1 9 - 9 3 4 1 7 - 4</volume>
          _
          <fpage>1</fpage>
          <lpage>8</lpage>
          .
          <article-title>a r X i v : 1 8 0 4 . 0 4 5 2 6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ivanović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Budimac</surname>
          </string-name>
          ,
          <article-title>An overview of ontologies and data resources in medical domains</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>41</volume>
          (
          <year>2014</year>
          )
          <fpage>5158</fpage>
          -
          <lpage>5166</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>1 6</volume>
          / j . e
          <source>s w a . 2 0 1 4 . 0 2 . 0 4 5 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ilievski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chalupsky</surname>
          </string-name>
          , et al.,
          <source>KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12507 LNCS</source>
          (
          <year>2020</year>
          )
          <fpage>278</fpage>
          -
          <lpage>293</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>0 7 / 9 7 8 - 3 - 0 3 0 - 6 2 4 6 6 - 8</volume>
          _
          <fpage>1</fpage>
          <lpage>8</lpage>
          .
          <article-title>a r X i v : 2 0 0 6 . 0 0 0 8 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Walonoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kramer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nichols</surname>
          </string-name>
          , et al.,
          <article-title>Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>25</volume>
          (
          <year>2018</year>
          )
          <fpage>230</fpage>
          -
          <lpage>238</lpage>
          . URL: https: //academic.oup.com/jamia/article/25/3/230/4098271. doi:
          <article-title>1 0 . 1 0 9 3 / j a m i a / o c x 0 7 9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Hagberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Schult</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Swart</surname>
          </string-name>
          ,
          <article-title>Exploring network structure, dynamics, and function using NetworkX, 7th Python in Science Conference</article-title>
          (SciPy
          <year>2008</year>
          )
          <article-title>-</article-title>
          (
          <year>2008</year>
          )
          <fpage>11</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>