<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semi-automatic Semantic Enrichment of Personal Data Streams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jean-Paul Calbimonte</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabien Dubosson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilia Kebets</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre-Mikael Legris</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Schumacher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information Systems, University of Applied Sciences and Arts Western Switzelrand (HES-SO)</institution>
          ,
          <addr-line>Sierre</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pryv SA</institution>
          ,
          <addr-line>Lausanne</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Current information technologies allow people to acquire personal data related to their health, lifestyle, behavior, and activities, often using wearable and mobile devices. Personal data management technologies have emerged recently, in order to cope with the requirements of this type of data, ranging from personal clouds to self-storage solutions. Pryv.io is a comprehensive solution for managing this particularly sensible type of data streams, focusing both on data privacy and decentralization. In this paper, we describe SemPryv, a system aiming at providing a semantization mechanism for enriching personal data streams with standardized specialized vocabularies from third-party providers. It relies on third providers of semantic concepts, and includes rule-based mechanisms for facilitating the semantization process. A full implementation of SemPryv has been produced, pluggable to the existing Pryv.io platform, showing the feasibility of the approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>The increasing amount of generated personal data allows for the
development of personalized applications in different domains, usually related to health,
lifestyle, or everyday activities. These often rely on different sources and
acquisition modalities, including wearable devices, sensors, domotic technologies, or
self-reporting methods. In this context, it is essential to provide data privacy
guarantees, in order to avoid unintended access or disclosure. This difficulty to
address this challenge is further increased due to the streaming nature of many
of these datasets, which require infrastructure designed to manage high-volume
and high-velocity information flows.</p>
      <p>
        Pryv.io is a privacy-centric middleware, used as a robust data management
foundation to develop risk-controlled mHealth, eHealth, and InsurTech
applications with confidence and in respect to IT and regulatory requirements. Pryv.io
is built based on two key pillars: decentralization and privacy. Unlike traditional
centralized solutions, Pryv.io stores each data account separately and
independently, making it possible to be even deployed on its own server [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Furthermore,
data access can be delegated in a modular way, providing token-based
authorization, e.g. for tertiary use by a clinician. Data is organized as a hierarchy of
streams, each containing a series of events of different nature and type. Given
the large heterogeneity of data sources in these areas, and the velocity of the
data, it becomes essential to provide the means for automatically categorizing
them according to standard ontologies and vocabularies, especially in the health
domain. Given the diversity of potential personal data sources (e.g. from time
series of a smartwatch to health record annotations), the accurate semantization
of the data is a primary concern in order to provide an added value over the
collected information.
      </p>
      <p>
        This paper describes the SemPryv subsystem for stream data enrichment3.
The goal of SemPryv is to provide semantization capabilities for the Pryv.io
middleware, such that it can automatically propose semantic concepts,
associated to the heterogeneous data streams managed by the platform. The data
semantization makes it possible to enhance the data model, currently conformed
by typed events. Associating high-level ontology concepts to the stream events
enables new types of search and discovery functionalities in the middleware,
which were not possible up to now. Also, it provides the means to link the Pryv
datasets with existing standards and models used widely for cataloguing data
in the health sector. In particular, the use of standards, such as HL7 FHIR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
make it possible to export and share the Pryv.io data with other systems and
applications, as long as it is annotated with semantic vocabularies. The system
described in this paper focuses on both the service-oriented architecture of
SemPryv and its interaction with existing ontology providers as BioPortal [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], as well
as a dedicated UI that allows experts to confirm or choose from the semantics
suggestions offered by the module. The implementation of the system shows the
feasibility of our fully decentralized solution for semantization of personal data
streams, relying on the widely used HL7 FHIR standards for interoperability.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 SemPryv Architecture</title>
      <p>
        The Pryv.io middleware is used to manage large and diverse streams of data
coming from external platforms, wearable devices, and health record systems.
These streams are organized through identifiers and tags that are later used for
searching and querying. While it is technically possible to export and make the
Pryv datasets available to external applications in different formats, standards
such as HL7 FHIR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] impose the necessity of adding explicit semantic
annotations to the Pryv.io data streams, for instance using the SNOMED-CT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
vocabulary. While this semantization process could be carried out manually, it
is unrealistic and too time consuming to be realizable. The SemPryv module
enables the addition of semantics to the datasets, in an automatic, or
semiautomatic manner. Given the decentralized nature of Pryv.io, multiple instances can
be used in order to store and manage isolated data streams. SemPryv is designed
3 Available at: https://sempryv.ehealth.hevs.ch
to act as a proxy for these instances, being able to forward requests to Pryv.io
through its REST API (Figure 1). By passing an authentication token and the
domain within the request, SemPryv is able to access any Pryv.io instance, and
add the semantic annotation/suggestion features, as well as the FHIR support.
      </p>
      <p>
        The architecture of SemPryv is depicted in Figure 2. SemPryv has two main
components: a back-end that exposes the core services as a REST API, and a UI
for end-users and experts. Besides the proxy capabilities mentioned before, the
SemPryv back-end can connect to a series of providers for semantic vocabularies
and ontologies. These may include existing APIs such as BioPortal [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], or other
collections of relevant ontologies. SemPryv is able to query these providers in
order to suggest relevant ontology terms for a given Pryv stream, or hierarchy
of streams, which can then be validated, or confirmed by an expert through
the SemPryv Web UI. The Pryv.io metadata can be then updated according to
these suggestions and annotations. Additionally, the SemPryv back-end includes
endpoints dedicated for the import/export of HL7 FHIR-compliant data streams,
represented as bundle collections of observations.
      </p>
    </sec>
    <sec id="sec-3">
      <title>SemPryv Suggestions &amp; Annotations</title>
      <p>The architecture presented previously
describes how the different components of
the system interact with each other.
Concerning the semantization process itself,
the SemPryv module is flexible enough
to adapt to different types of situations.</p>
      <p>For users and data integrators, the
SemPryv UI (Figure 4) exposes the proposed
semantics, queried from the 3rd party
providers (e.g. BioPortal). Then, these
suggestions can optionally be confirmed
by an administrator before being
consolidated into its corresponding Pryv
instance. This semi-automatic semantiza- Fig. 4. SemPryv UI: suggested
annotion makes it possible to have full control tations obtained from the BioPortal
over the type of semantics to be assigned. provider: SNOMED-CT terms.
As an example, a body-weight stream in
the Pryv.io middleware can be modeled as the weight of an individual according
to SNOMED-CT, codified as: SNOMED-CT:27113001. Notice that multiple
annotations can be attached to a given stream, and that these annotations can be
inherited recursively by sub-streams and events inside of its hierarchy.
Furthermore, other custom 3rd party ontology/vocabulary providers can be configured
in order to feed the system. In addition, SemPryv includes the possibility of
using predefined rules expressed in its knowledge graph. These rules can be
modified by administrators, and essentially allow the definition of close terms
from different ontologies. For instance in the following example, the knowledge
graph matches Pryv temperature streams to a SNOMET-CT code identified as:
snomed-ct:386725007. Similarly the same is done for mass. Then, the system
also allows to match these rules to certain stream paths, defined using regular
expressions.
"@graph": [{
"@id": "pryv:temperature", "@type": "skos:Concept",
"skos:notation": "note/txt",
"skos:closeMatch": "snomed-ct:386725007", },
{
{</p>
      <p>Listing 1.1. Predefined rules mapping pryv concepts to SNOMED-CT.</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion &amp; Future Work</title>
      <p>In this paper we have described SemPryv, a system that allows the semantic
enrichment of personal data streams, set up in a fully distributed environment.
The proposed approach has been fully implemented, comprising not only the
semantization but also (i) its integration with external providers such as
BioPortal, (ii) the implementation of an interoperability bridge through HL7 FHIR,
and (iii) a rule-based automated suggestion feature. The system is currently
deployed, showcasing the use of semantics in real-life scenarios and on integrated
with a commerical solution. The SemPryv approach relies on two main
principles. First, on the reuse of consolidated vocabularies, ontologies, and taxonomies
that are standardized and widely used in the domains of application. This is the
case for well-known standards (eg. SNOMED-CT, LOINC; UCUM, RxNorm)
which have been curated to enable interoperability among applications. Second,
SemPryv uses different, but complementary approaches for proposing semantics
for a given dataset, depending on: the available data, metadata, and previous
inferences. For the next iteration of the SemPryv module, we will further enhance
the rule-inferencing approach for establishing suggestions of semantic metadata,
in cases where a bootstrapping process is required. Once a critical mass of data
is acquired, SemPryv will rely on incremental machine learning techniques to
correlate previously annotated datasets with new incoming streams. The
prototype is publicly available, as referenced earlier, and in the future an Open Source
approach will be considered, given the potential reuse opportunities.</p>
      <p>Acknowledgements Supported by the InnoSuisse grant 26547.1 SemPryv project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bender</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sartipi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Hl7 fhir: An agile and restful approach to healthcare information exchange</article-title>
          .
          <source>In: Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems</source>
          . pp.
          <fpage>326</fpage>
          -
          <lpage>331</lpage>
          . IEEE (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Donnelly</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Snomed-ct: The advanced terminology and coding system for ehealth</article-title>
          .
          <source>Studies in health technology and informatics 121</source>
          ,
          <issue>279</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Goumaz</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>White paper: data in pryv (</article-title>
          <year>2018</year>
          ), https://pryv.com/data_in_pryv/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Salvadores</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alexander</surname>
            ,
            <given-names>P.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          :
          <article-title>Bioportal as a dataset of linked biomedical ontologies and terminologies in rdf</article-title>
          .
          <source>Semantic web 4(3)</source>
          ,
          <fpage>277</fpage>
          -
          <lpage>284</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>