<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>LDESTS: Enabling Efficient Storage and Querying of Large Volumes of Time Series Data on Solid Pods</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tom Windels</string-name>
          <email>tom.windels@ugent.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wout Slabbinck</string-name>
          <email>wout.slabbinck@ugent.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pieter Bonte</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stijn Verstichel</string-name>
          <email>stijn.verstichel@ugent.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pieter Colpaert</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofie Van</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoecke</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Femke Ongenae</string-name>
          <email>femke.ongenae@ugent.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, KU Leuven Campus Kulak</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IDLab, Ghent University - imec</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>6</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>The Solid ecosystem provides a good foundation for the decentralised Web. However, the current document-based implementations of the Solid specification lack support to eficiently interact with high volumes of time series data. Therefore, in this demo, we present a technique, called LDESTS, to more eficiently store and query time series data across Solid pods. LDESTS provides a data model for individual streams, and enables enforcing constraints on top of individual stream fragments, to achieve eficient data fragmentation and querying.</p>
      </abstract>
      <kwd-group>
        <kwd>Time series data</kwd>
        <kwd>Decentralized Web</kwd>
        <kwd>Solid</kwd>
        <kwd>Eficient data storage &amp; querying</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>which specific data.
CEUR
Workshop
Proceedings
fashion. Current implementation of the Solid specification mainly use the Linked Data Platform
(LDP) ontology2 to mimic a UNIX-like data structure, where documents are structured in various
folders. It further imposes no extra restrictions on data structure, leaving room for flexibility [ 2].</p>
      <p>The current document-centric interpretation of Solid, wherein a pod is a hierarchy of Linked
Data documents, is not the most eficient solution for all types of data and services. With
the boom of the Internet of Things (IoT), time series data has become increasingly prevalent.
Examples include data gathered by wearable devices or sensors found in a smart home. These
data streams can accumulate to large volumes of rapidly changing data over time, resulting in
the need to store this data eficiently. As this stream of time series data is stored in a remote
location, and most applications want to query over this data, the network trafic should be
kept to a minimal for these applications to work eficiently. Therefore, this demo presents a
technique, called LDESTS (Linked Data Event Streams for Time Series), to more eficiently store
and query time series across Solid pods.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Preliminaries</title>
      <p>Implementations of the Solid specification targeting data streams have recently been proposed,
i.e. Linked Data Event Streams (LDES) in LDP [3]. This technique is based on the LDES
specification 3, which is an ontology for representing collections of immutable objects, like
sensor measurements. It fragments the stream across multiple documents, which in turn are
fragmented across multiple folders, based on a given metric, e.g. timestamps, and optionally
diferent object versions. By splitting up the stream in smaller documents, LDES allows clients
to limit the amount of data gathered. Research [4, 5] has been performed to decide the optimal
number of samples stored per document based on the velocity of the stream and application
requirements, i.e. performed queries. To allow the resulting fragments to be queryable by time,
TREE hypermedia is used4. This ontology provides relations between the various fragments
and the stream, allowing query processors to selectively query the requested fragments.</p>
      <p>Even with these optimizations, LDES in LDP causes a lot of overhead and network load
to store and query rapidly changing time series data, especially for data streams where the
diference between similar data points is small, e.g. a sensor from a user’s wearable, measuring
the same type of measurement with similar properties, but at a diferent point in time. When
an application is only interested in a specific sample, it still needs to fetch a whole (possibly
large) document. Moreover, if the application requires a large range of samples, a lot of fetches
across various documents are needed. Due to the potentially flexible nature of individual data
points, querying over data can only optimally choose its fragments based on the requested time
through TREE relationships, and not the properties of these data samples. The verbosity of
RDF additionally adds a lot over overhead and network load, as the samples are surrounded by
repeated metadata describing their semantics. LDESTS aims to solve these shortcomings.</p>
      <sec id="sec-3-1">
        <title>2https://www.w3.org/TR/ldp/ 3https://semiceu.github.io/LinkedDataEventStreams/ 4https://treecg.github.io/specification/</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. LDESTS overview</title>
      <p>Time series data typically observes a single metric, leaving all other attributes in the RDF
representation unchanged. For example, for two consecutive measurements of the same heart
rate sensor, only the heart rate value and timestamp changes, while all the other metadata about
the observation, e.g. associated sensor or location, remains the same. Other scenarios include
time series data streams that are generated by multiple sources, each measuring the same metric,
and independently generating a similar repeating RDF structure for the sample’s properties, but
associate a diferent value to the sample’s source. An example would be the user’s smartphone
and smartwatch obtaining accelerometer values. To more eficiently represent the data, LDESTS
avoids these superfluous repetitions of common properties by defining an optimized data model
per stream. The LDESTS data model uses the Shapes Constraint Language (SHACL)5 ontology
to define the various properties of a data sample combined with its specific role. There are three
possible roles a sample property can have:
• Identifier property : Denotes the sample’s predicate containing the timestamp. For
example, in a sensor measurement using the Smart Applications Reference ontology
(SAREF) 6, this would be saref:hasTimestamp. Exactly one such property is allowed
per data model.
• Constant property: Declares a property with which either a single constant value is
associated for every sample in the stream (the source property of the data, for example),
or a set of constant values that vary on a per-sample basis (the axis in a multiple-axes
measurement scenario, for example). The latter case can be used for further fragmentation
of the stream, improving the performance of property specific queries.
• Variable property: Declares a value that difers per sample. This is often the metric
being observed in this time series data stream.</p>
      <p>Solid imposes little restriction on how the contents of a user’s pod should be hierarchically
structured. Therefore, we propose a custom hierarchy that allows LDESTS to optimally exploit
the properties of the original data for eficiency gains. The components making up this hierarchy
consists out of the individual fragments as LDP resources adhering to stream-specific constraints,
and the LDP container housing these fragments. The LDP container’s metadata declares the
stream’s properties, e.g. the data model associated with the data described above, and the various
fragments making up the stream. Fragments have their own set of restrictions, further refining
the stream’s data model to the exact type found in that fragment, and provide TREE relationship
information as well. A diagram depicting this flow can be seen in Figure 1. As the exact data
model per fragment is known prior to querying the actual data, a specialised query processor
can choose to ignore specific fragments based on the query directly. The fragments house the
set of data samples themselves by tightly formatting the sample’s properties in a string literal.
Constants difering per sample are represented with an index that can be used with the set of
values defined in the data model for that property, while variables are inserted directly.</p>
      <sec id="sec-4-1">
        <title>5https://www.w3.org/TR/shacl/ 6https://saref.etsi.org/core/v3.1.1/</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Demonstrator scenario and results</title>
      <p>LDESTS is demonstrated based on a homecare use case, in which a patient is being monitored
during their daily routines using various sensors, e.g. a wearable measuring accelerometer and
heart rate, that generate continuous and vast streams of data. LDESTS is used to eficiently
store the data in the patient’s Solid pod. This pod can then be queried by parties authorized by
the patient, e.g. a care organization, to gain insight into the condition of the patient. For the
demonstration, the Data Analytics for Health and Connected Care (DAHCC) Open Dataset is
used [6], which provides real-world collected sensor data from 30 persons in a home setting,
already annotated as RDF using the SAREF ontology7 to indicate the source, type of metric
and timestamp, amongst other things for each measurement. The demo replays the data of
one of the selected patients in the DAHCC dataset, at the same rate as it was collected, and
stores it in a Solid pod using LDESTS. The data model in the created stream is provided with
multiple possible sources and the various axes of the sensor values as constant properties, the
timestamp as the identifier property and the sensor value itself as the only variable property.
The stream is fragmented according to the data source. With an input of 14.8 MB worth of triples,
representing measurements done over the span of half an hour, the resulting stream only uses
378 kB distributed across the stream definition and 6 fragments, resulting in a 97% reduction.
The demo code (with video) is available at https://github.com/SolidLabResearch/LDESTS-demo.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was partly funded the Flemish Government under the “Onderzoeksprogramma
Artificiële Intelligentie (AI) Vlaanderen” programme, the SolidLab Vlaanderen project (Flemish
Government, EWI and RRF project VV023/10) and the FWO Project FRACTION (Nr. G086822N).
[2] R. Dedecker, et. al., What’s in a Pod? – a knowledge graph interpretation for the Solid
ecosystem, in: Proc. of the 6th QuWeDa Workshop at ISWC, 2022.
[3] W. Slabbinck, et. al., Linked data event streams in Solid LDP containers, in: Proc. of the 8th</p>
      <p>MEPDaW Workshop at ISWC, 2022.
[4] B. Van de Vyvere, et. al., Publishing cultural heritage collections of ghent with linked data
event streams, in: Proc. of MTSR, 2021.
[5] H. Delva, et. al., Geospatially partitioning public transit networks for open data publishing,</p>
      <p>Journal of Web Engineering (2021).
[6] B. Steenwinckel, et. al., Data Analytics For Health and Connected Care: Ontology,
Knowledge Graph and Applications, in: Proc. of PervasiveHealth, 2022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <article-title>Re-decentralizing the Web, for good this time, in: Linking the World's Information: A Collection of Essays on the Work of Sir Tim Berners-</article-title>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>