=Paper= {{Paper |id=Vol-3235/paper8 |storemode=property |title= |pdfUrl=https://ceur-ws.org/Vol-3235/paper8.pdf |volume=Vol-3235 |authors=Dagmar Celuchova Bosanska,Michal Huptych,Lenka Lhotska |dblpUrl=https://dblp.org/rec/conf/i-semantics/BosanskaHL22 }} ==== https://ceur-ws.org/Vol-3235/paper8.pdf
A Pipeline for Population and Analysis of Personal
Health Knowledge Graphs (PHKGs)
Dagmar Celuchova Bosanska1,∗ , Michal Huptych2 and Lenka Lhotská1,2
1
    Faculty of Biomedical Engineering, Czech Technical University in Prague, Czech Republic
2
    Czech Institute of Informatics, Robotics, and Cybernetics, Czech Technical University in Prague, Czech Republic


                                         Abstract
                                         Personal Health Knowledge Graphs (PHKGs) are not yet ubiquitous, even though they have a great
                                         potential to enrich general knowledge captured in various Knowledge Graphs by adding personal contexts.
                                         This poster paper presents work in progress about a pipeline for generating PHKGs from tree-structured
                                         Electronic Health Record (EHR) data by applying a hierarchical ontological approach. This pipeline
                                         could also be applied to other domains of Personal Knowledge Graphs. Moreover, this pipeline targets
                                         the intersection between the symbolic representation of knowledge used for computational semantics
                                         and numeric graph data representation used for graph analysis and machine learning. We present the
                                         first results from applying this pipeline to synthetic patient EHRs with the diagnosis of colorectal cancer
                                         (based on Synthea). The resulting numeric representation of PHKGs or their subgraphs can be used
                                         in many practical graph algorithms. Finally, our pipeline study uncovers future research on how this
                                         numeric representation of PHKGs should be embedded into continuous and low-dimensional vector
                                         space to utilize graph machine learning and deep learning methods.

                                         Keywords
                                         Personal Health Knowledge Graphs, Ontology, Graph algorithms, Machine Learning




1. Introduction
Personal health knowledge graphs (PHKGs) represent structured information about entities
related to a patient’s health and well-being, attributes, and relations between them. Unlike
Knowledge Graphs (KGs), PHKGs are not yet ubiquitous. PHKGs should be generated for
individual patients from numerous information sources such as electronic health records (EHRs),
wearables and mobile health apps, sensors, and patient annotated texts and notes related to the
patient’s condition. In principle, PHKGs can be populated by all techniques mentioned in [1] if
the information source contains personal data or data related to a patient and her health and
well-being. However, no agreed representation and population of PHKGs exists. For instance,
a KG for asthma can describe causes, symptoms, and treatments for asthma, and PHKG can
be the subgraph containing just those causes, symptoms, and treatments that apply to a given
patient [2]. Another point of view is that PHKGs can be used to add personal context to KGs
and to help develop a personalized diagnosis, recommendations, and treatments [3].


SEMANTICS 2022 EU: 18th International Conference on Semantic Systems, September 13-15, 2022, Vienna, Austria
∗
    Corresponding author.
Envelope-Open celucdag@fbmi.cvut.cz (D. C. Bosanska)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   This poster paper presents a work in progress about a pipeline on how to populate a PHKG
from several data sources. First, the data must be harmonized according to a flexible and helpful
data model for data analysis. Then, we apply this approach to the PHKG population from data
stored in EHRs. In addition, ongoing work will provide evidence that this approach generalizes
to PKGs and other data sources, and enables linking PHKGs with KGs and generating new KGs.


2. How to represent a PHKG and set up a pipeline
The proposed representation of a PHKG is composed of two elements: a domain graph and
a mapping from the nodes and edge labels of the data graph to those of the domain graph in
which they are called entities and relation-types, respectively [1]. The domain graph defines
the schema of the PHKG – its high-level structure that can evolve more flexibly than a schema
for a relational model. Using a harmonizing data model – the Simple Event Model Ontology
[4] enables us to view EHR data in the HL7 FHIR RDF format as evolving chains of events
and sub-events in time. This ontology represents a schema that simplifies further analysis and
manipulation of PHKG graphs. Its practicality has been proven for Event-Centric Temporal
Knowledge Graph [5]. In our case of PHKGs, events are central elements in representing
a patient’s experience with a concrete disease. This experience includes visits (encounters),
reported complaints and symptoms (for example, in the form of observations), performed
procedures, prescribed medications, finalized diagnostic reports, and even applied care plans.
In terms of self-management, it can be, for example, exercising, self-measurements, or sleep
monitoring. This ontology allows us to write easy-to-understand SPARQL queries without a
more profound understanding of the domain ontologies such as HL7 FHIR RDF. The PROV-O
ontology can capture the administrative part of health records, such as who created the entry,
when, how, and in which institution.
   Ontologies such as SNOMED CT and LOINC (also a part of UMLS) give identity to our nodes
in PHKGs [6]. This identity denotes which nodes in PHKGs, or external KGs refer to the same
real-world entity. Thanks to these ontologies, we can use the subsumption relationships to align
the node labels to the same, more general term and thus improve the ontological graph union
operation (see Section 3). In the clinical and self-care setting, the labels of the corresponding
nodes would hardly be the same if there was no method of standardization and subsumption
with the help of mentioned ontologies in place.
   A PHKG is thus represented as a directed edge-labeled graph that enables querying and
reasoning. However, most graph data analysis techniques do not apply to this representation.
Therefore, we need to transform it into an undirected or directed graph without edge labels (i.e.,
predicate names). A directed graph is thus projected by optionally selecting a sub-graph from
the data graph from which all edge labels can be dropped. The proposed pipeline to populate a
PHKG from various data sources and use it for graph analysis is as follows:
   1. Find or create an ontology for data harmonization: Our research suggests that the
      mentioned Simple Event Model Ontology is a universal and useful upper-ontology that
      can be extended by more specific models for various event types.
   2. Choose a standard or format for harmonized data transformation into graph
      dataset: As our chosen example of HL7 FHIR data was available in a tree-structured
         format (JSON) and a tool to convert the data into HL7 FHIR RDF format existed (see
         Section 3), the choice was straightforward. But for different categories of data other
         formats such as KGTK [7] or property graphs may be easier to apply.
      3. Assert relations between nodes/edges in the graph dataset and nodes (classes) / edges
         (properties) from the ontology chosen for data harmonization (or even for provenance):
         It means that more facts stated as triples will be added to the graph dataset representing
         the PHKG. Additional information about events will be available in the data from various
         sources in their original data model, such as HL7 FHIR RDF.
      4. Create a subgraph from the source PHKG with the help of SPARQL based on the data
         harmonization ontology.
      5. Convert the symbolic representation of the subgraph into its numeric represen-
         tation for a directional graph (for example into an adjacency matrix) for further anal-
         ysis and transformation: Make use of the subsumption provided by linked ontologies
         (SNOMED CT and LOINC in our case) to unite node and/or edge labels. If possible, store
         edge and node labels and other meta data within the (sub)graph structure. The symbolic
         representation is thus as follows:
             • V is a vertex set - a set of nodes {a, b, c, d, ...}.
             • A is an |V| × |V| adjacency matrix (assume binary if there are no edge weights).
             • X ∈ Rm×|V| is a matrix of node features, such as node identity (URI and the SNOMED-
               CT code) and the begin timestamp.
      6. Apply graph algorithms, such as operators, distance measures, and shortest paths to
         one or more PHKGs in their numeric form. Even if the edge labels and other edge and
         node meta data are dropped for further graph analysis, it is a good practice to create a
         possibility to retrieve this information even for the outputs of the graph analysis.
      7. If needed, convert the result of the graph analysis back into the symbolic form to expand
         the original PHKG by new knowledge.
  This choice of a strategy to convert any source PHKG into its symbolic representation for
analysis is not trivial. It requires empirical validation and iterations of the steps 5 and 6 to
develop new insights from the graph analysis. In addition, more study is needed to understand
the effects of such strategies more generally on the results of different analytical techniques.


3. The usage example
We applied the methods and pipeline proposed in the Section 2 on synthetic data generated
for the colorectal cancer diagnosis according to [8]. The output data from this generator are in
the JSON format based on the HL7 FHIR standard. In addition, datasets for individual patients
were converted to the HL7 FHIR RDF format using the FHIR JSON to RDF conversion utility1
(on the HL7 FHIR webpage, other open source implementations can be found). We populated,
visualized, and analyzed the PHKGs for individual patients using Python libraries RDFlib2 ,

1
    https://github.com/BD2KOnFHIR/fhirtordf
2
    https://github.com/RDFLib/rdflib
         A PHKG (only a part is displayed)

                                                                      DBpedia:                   fhir:Observation
         fhir:Patient                    fhir:Condition           Colorectal cancer                     o01

                                                                                                           sem:hasSubEvent
                   rdf:type                        rdf:type      sem:eventType
                                                                                                              sem:                       sem:
         fhir:Patient                    fhir:Condition         sem:hasSubEvent          fhir:Procedure        has    fhir:Procedure      has   fhir:Procedure
             pt01                              c01                                             p01             Sub          p02           Sub          p03
                                sem:
                              hasActor                    sem:hasBeginTimeStamp                               Event                      Event
                                          rdfs:label                                    rdfs:label                 rdfs:label                 rdfs:label
                 „malignant tumor of                                                     „partial resection
                                              2011-05-21T05:14-53+02:00                                              „hospice care“@en          „hospice care“@en
                     colon“@en                                                             of colon“@en



         A PHKG subgraph
                 After dropping edge labels and contracting similar nodes in this PHKG subgraph:
                                                                                                              c01    p01        p02
                                                                                    weight:1
                                                                                                     c01              1                         0   1    0
          fhir:Condition                 fhir:Procedure            fhir:Procedure
                c01                            p01                       p02                         p01                        1               0   0    1

                                                                                                     p02                        1               0   0    1
       LEGEND:

                   Class                   Instance               Literal                                                                       adjacency
                                                                                                                                                 matrix A
       fhir: http://hl7.org/fhir/   sem: http://semanticweb.cs.vu.nl/2009/11/sem/


Figure 1: The method of data harmonization transformation into numeric form for graph analysis


Owlready2 with its PyMedTermino2 for easy access to domain ontologies3 and NetworkX
[9]. We needed to implement tools to switch between different graph data structures of these
libraries to implement the whole pipeline for graph data analysis.
   Once we had the numeric representation of PHKG subgraphs for our 325 synthetic patients,
performed according to the Figure 1, we developed an algorithm for an ontological graph union
to create a KG containing different patient pathways from the point of diagnosis to the outcome.
This algorithm can contract nodes of the same type (in our case, the same SNOMED CT code of
the underlying procedure) across patient PHKGs while preserving all node features. In theory,
if the set of PHKGs were representative enough, this KG would cover all possible ways of
treatment and their outcomes. We analyzed the most connected nodes - procedures with the
help of the degree centrality. They represent the key decision points in the patients’ treatment
or life events.
   It is also possible to analyze the shortest simple paths from the principal diagnosis to the
outcome, considering the weights. For our group of patients, the shortest simple path from the
diagnosis to the unfortunate event of death is length four because the diagnosis happened at a
very late stage of the disease.




3
    https://owlready2.readthedocs.io/en/v0.37/
4. Conclusions and future work
In our future work, we will further explore the best combination of ontologies, knowledge, and
data harmonization to create a universal pipeline for the PKG population. In addition, as we can
see in Figure 1, the adjacency matrix is sparse (the sparsity equals 2/3 in this case), and this fact
about the numeric representation holds in general. Therefore, a more efficient representation
in continuous and low-dimensional vector space with the help of node and graph embedding
should be researched.
   Finally, in our pipeline, we found a more straightforward representation of the partial PHKG
knowledge in which we could drop the edge labels (the nodes representing the events (proce-
dures) were connected only by a sub-event relationship to capture the sequence of events in
time). However, the more properties the graph embedder encodes, the better results can be
retrieved in later tasks. Therefore, we can generate a more complex subgraph from our source
PHKG with edge labels.


References
[1] A. Hogan, E. Blomqvist, M. Cochez, et al., Knowledge graphs, ACM Computing Surveys 54
    (2021). doi:1 0 . 1 1 4 5 / 3 4 4 7 7 7 2 . a r X i v : 2 0 0 3 . 0 2 3 2 0 .
[2] A. Gyrard, M. Gaur, S. Shekarpour, K. Thirunarayan, A. Sheth, Personalized health knowl-
    edge graph, in: CEUR Workshop Proceedings, volume 2317, 2018, pp. 1–6.
[3] O. Seneviratne, J. Harris, C.-h. Chen, D. L. McGuinness, Personal Health Knowledge Graph
    for Clinically Relevant Diet Recommendations, Workshop on Personal Knowledge Graphs
    Co-located with the 3rd Automatic Knowledge Base Construction Conference (AKBC’21)
    (2021). URL: http://arxiv.org/abs/2110.10131. a r X i v : 2 1 1 0 . 1 0 1 3 1 .
[4] W. R. Van Hage, V. Malaisé, R. Segers, et al., Design and use of the Simple Event Model
    (SEM), Journal of Web Semantics 9 (2011) 128–136. doi:1 0 . 1 0 1 6 / j . w e b s e m . 2 0 1 1 . 0 3 . 0 0 3 .
[5] S. Gottschalk, E. Demidova, EventKG: A Multilingual Event-Centric Temporal Knowledge
    Graph, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
    Intelligence and Lecture Notes in Bioinformatics) 10843 LNCS (2018) 272–287. doi:1 0 . 1 0 0 7 /
    978- 3- 319- 93417- 4_18. arXiv:1804.04526.
[6] M. Ivanović, Z. Budimac, An overview of ontologies and data resources in medical domains,
    Expert Systems with Applications 41 (2014) 5158–5166. doi:1 0 . 1 0 1 6 / j . e s w a . 2 0 1 4 . 0 2 . 0 4 5 .
[7] F. Ilievski, D. Garijo, H. Chalupsky, et al., KGTK: A Toolkit for Large Knowledge Graph
    Manipulation and Analysis, Lecture Notes in Computer Science (including subseries Lecture
    Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12507 LNCS (2020)
    278–293. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 6 2 4 6 6 - 8 _ 1 8 . a r X i v : 2 0 0 6 . 0 0 0 8 8 .
[8] J. Walonoski, M. Kramer, J. Nichols, et al., Synthea: An approach, method, and software
    mechanism for generating synthetic patients and the synthetic electronic health care record,
    Journal of the American Medical Informatics Association 25 (2018) 230–238. URL: https:
    //academic.oup.com/jamia/article/25/3/230/4098271. doi:1 0 . 1 0 9 3 / j a m i a / o c x 0 7 9 .
[9] A. A. Hagberg, D. A. Schult, P. J. Swart, Exploring network structure, dynamics, and
    function using NetworkX, 7th Python in Science Conference (SciPy 2008) - (2008) 11–15.