<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modelling and Analysing Dynamic Linked Data using RDF and SPARQL</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute AIFB, Karlsruhe Institute of Technology (KIT)</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Analyses of dynamic Linked Data are inherently dependent on changes in RDF graphs (logical level) and what happens on the HTTP and networking level (physical level). However, these dependencies have been reflected in previous works only to a limited extent, which may lead to inaccurate conclusions about the dynamics of the data. To overcome this limitation, we tackle the problem of modelling dynamic Linked Data to capture changes both at the logical and physical level of Linked Data. We present our work in progress in this paper. We propose an RDF model of descriptions of both the HTTP requests/responses/networking errors made when downloading, and the RDF data thus obtained. The model allows for carrying out more comprehensive analyses of dynamic Linked Data in a declarative fashion. We present a processing pipeline to distil such modelled RDF data from datasets created using LDspider, such as the Dynamic Linked Data Observatory. We show the usefulness of our model by repeating three analyses of a previous paper on the Dynamic Linked Data Observatory using SPARQL queries, which we executed on data that follows our proposed model on three SPARQL engines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Linked Data is dynamic: we see data in Linked Data documents being updated,
or documents, or entire servers, go on- or offline. This dynamics has been
acknowledged and analysed on different levels of abstraction, ranging from data
access and data syntax [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] to data schema [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Yet, the interdependence of the
dynamics that can be observed on the different levels of abstraction is still
uncharted territory. For instance, previous analyses on higher level of abstraction
did not reflect the dynamics caused by data access [
        <xref ref-type="bibr" rid="ref11 ref17 ref7 ref8">8, 11, 7, 17</xref>
        ]. We argue that
analyses of this interdependence have been neglected because there is no uniform
way to access all the required information, and the necessity of the analyses has
been overlooked as previous analyses used code in imperative languages, where
the assumptions are hidden in the control flow of the code. For instance:
Motivating Example. In a previous paper, the number of overall documents
appearing in the Dynamic Linked Data Observatory dataset, which monitors a set
of 95’737 seed URIs, was reported to be 86’696 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. To be precise, this was the
number of information resources ever encountered in any of the 29 snapshots
available at that time and determined by counting the contexts of quads. When
considering e. g. 171 snapshots, this number rises to about 106’105 documents,
more than seed URIs. This mismatch points us towards that we may have to
re-ask our question and take into account the dereferencing process: If we instead
consider the number of seed URIs that dereferenced to information resources,
the number for 29 snapshots is 92’712. To further put this number into context,
we note that for the first snapshot, 67’107 seed URIs were redirected, ie. had
non-trivial dereferencing, and 77’958 could be dereferenced successfully.
      </p>
      <p>We want to lower the barrier of entry to analyses that take the
interdependence between data and data access into account by tackling the problem of
modelling dynamic Linked Data in a way that takes into account both, the data
access and the data itself. Moreover, we want to make the analyses more easy
to validate by using a declarative approach for the analyses.</p>
      <p>Data following the Linked Data principles has been published at a vast scale.
Such data is meant to reflect the state of things in the world, which is without a
doubt dynamic. Moreover, with the advent of Read-Write Linked Data –where
Linked Data is meant to be changed using HTTP requests– and the Web of
Things –where Linked Data provides access to ever-changing sensor values– we
expect a further increase of dynamic Linked Data.</p>
      <p>
        Knowledge about the dynamics of Linked Data is relevant e. g. in choosing
architectures for Linked Data applications (ranging from data warehousing to
live querying) and optimising Linked Data processing systems (e. g. indexing
strategies) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. To investigate the dynamics of Linked Data, Käfer et al. set
up the Dynamic Linked Data Observatory [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This observatory creates weekly
snapshots of data and meta data from RDF documents retrieved from the web.
Several works have investigated the dynamics of Linked Data with analyses
written in imperative languages that post-process the snapshots collected by
the observatory [
        <xref ref-type="bibr" rid="ref11 ref7 ref8">8, 7, 11</xref>
        ]. Such imperatively given analyses may intransparently
introduce assumptions, which hinder the reproducibility and the validity of the
results. In contrast, declaratively modelled analyses raise the understandability
of the analysis process and facilitate the validation of the analyses and the results.
      </p>
      <p>In this paper, we present ongoing work to model dynamic Linked Data using
RDF such that such interdependencies can be investigated more easily. The
proposed model enables declarative analyses of dynamic Linked Data using the
SPARQL query language. In summary, the main contributions of our work are
as follows:
– A data model that relies on RDF and captures the physical and logical
aspects of dynamic Linked Data
– A processing pipeline that exploits the information recorded in crawler logs
to generate relevant meta data about dynamic Linked Data
– A proof concept of our proposed approach that includes: five snapshots of
Linked Data represented with the proposed data model, and three SPARQL
queries to analyse the dynamics of Linked Data</p>
      <p>The paper is structured as follows: In Section 2, we introduce the standards
and practices around Linked Data, as relevant for our analyses. Moreover, we
hint at the technical details of the crawler employed, which shape our proof of
concept. In Section 3, we present the data model and a processing pipeline to
derive data according to the model. Subsequently, in Section 4, we show the
applicability of our approach by giving queries and results as proof of concept.
Based on our proof of concept, we discuss future refinements of the approach
in Section 5. Next, in Section 6, we present related work. Last, in Section 7, we
summarise and discuss our approach, and outline future directions for our work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>
        In this section, we describe Linked Data with special focus on the features we
need for generating the RDF description of the time series that incorporates
both request meta data and the RDF payload from Linked Data. Moreover, we
describe the Dynamic Linked Data Observatory, a project that publishes a time
series of RDF data and log files. For the description of our extraction, we give
the necessary introduction to LDSpider, the Linked Data crawling software used
in the Dynamic Linked Data Observatory, particularly its output data.
Linked Data Linked Data is a set of principles for publishing data on the
web [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The principles advocate the use of web standards for data on the web:
URIs as identifiers, HTTP-GET for data access, and RDF as data model.
URI Uniform Resource Identifiers (URIs) are names for things on the web
in the form of character sequences [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. URIs start with a scheme, where the
scheme http denotes that HTTP-based communication with the resource may
be possible. In addition, URIs may refer to: (i) non-information resources that
denote abstract or physical things, e. g. a URI of the moon is http://dbpedia.
org/resource/Moon; (ii) information resources that denote the RDF documents
that describe the things, e. g. the URI of a document that describes the moon is
http://dbpedia.org/data/Moon.n3.
      </p>
      <p>
        HTTP The Hypertext Transfer Protocol (HTTP) is a protocol for data
transfer on the web [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Data exchange is subdivided into requests and responses.
Requests are sent to URIs, which return a response. Requests do not return in
the case of networking issues and server outages, which are outside of the HTTP
framework. There are a number of request methods, from which Linked Data
uses the GET method for retrieving state. To send a GET request to a URI
is also called dereferencing the URI. A HTTP response consists in (1) a status
line reporting about the status of the request using a numeric status code and
a textual explanation, e. g. in the case of a successful request the status line
reads 200 OK, (2) optional headers with meta data about the response, and (3)
an optional body, which contains RDF data in the case of Linked Data. Status
codes are three-digit integers, where the first digit determines the status code
class. The HTTP specification distinguishes the following classes [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]:
– 1xx (Informational): The request was received, processing continues
– 2xx (Successful): The request was received and can be successfully answered
– 3xx (Redirection): The request needs further client action for completion
– 4xx (Client Error): The request cannot be fulfilled due to a client error
– 5xx (Server Error): The request cannot be fulfilled due to a server error
The responses with 3xx status code typically contain a Location header with a
URI to which the server redirects the client for the next request. On the Linked
Data web, the use of redirects is made to distinguish two classes of URIs: Those
referring to information resources and those referring to non-information
resources. Information resources are differentiated when their URI is dereferenced,
return a successful response with non-empty body (e. g. a RDF document about
the moon). Non-information resources are all other resources (e. g. the moon).
Previous analyses focussed on the mere data from the response bodies of
successful requests (after following redirects if applicable). With the modelling presented
in this paper, we want to enable analyses that also take into account unsuccessful
HTTP requests and the process of following redirects when dereferencing.
RDF, Triples, Quads The Resource Description Framework (RDF) is a
graphbased data model [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used in Linked Data. An RDF graph is a directed graph,
where labelled nodes are connected by labelled arcs. An RDF graph G is
composed of a set of triples, where a triple t = (subject, predicate, object) such
that t ∈ (U ∪ B) × (U ) × (U ∪ B ∪ L), where U denotes the set of all URIs, B
the set of all blank nodes, and L the set of all literals. A triple can be extended
with a fourth element called “context” to form a quad. In this paper, the term
quad refers to a triple plus the URI of the information resource where the triple
has been obtained from, as described in [20, § 3.5].
      </p>
      <sec id="sec-2-1">
        <title>The Dynamic Linked Data Observatory The Dynamic Linked Data Ob</title>
        <p>
          servatory is a dataset containing a time series of Linked Data from the Web [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
The time series is being collected weekly since May 2012. For the composition
of the time series, a seed list of 95,737 URIs is dereferenced each week (following
redirects) and both the retrieved RDF data and meta data in the form of log
files is collected. While the Dynamic Linked Data Observatory also crawls from
the seed list, we neglect the crawling part in this paper. The data collection in
the Dynamic Linked Data Observatory is done using LDSpider. We use the data
from the Dynamic Linked Data Observatory to show the applicability for the
modelling/querying approach we propose in this work.
        </p>
        <p>
          LDSpider LDSpider [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] is a crawler for Linked Data. In the Dynamic Linked
Data Observatory, LDSpider is used to download the seed list following links
and crawl from the seed list. LDSpider produces five types of outputs. The
first output of LDSpider is (1) Data that contains quads data obtained from
dereferencing the seed URIs. The quads contain the triples from the download,
and in context position the URI of the information resource from which the triple
was downloaded. Note that the context URI is not always the URI from the seed
list: If the request to the URI from the seed list was answered using a redirect
to another URI, LDSpider dereferences that URI also. If the GET request to
the latter URI was answered successfully with RDF data, this is the data and
the context URI that go into the quad. Other relevant output of LDSpider is an
(2) HTTP Status log with information about the status codes in the responses
to the HTTP requests the crawler performed. Moreover, LDSpider outputs (3)
Redirects that records pairs describing from which URI to which URI the
crawler has been redirected. In addition, LDSpider records the (4) HTTP Headers
from the HTTP responses. While this meta data is quite comprehensive when it
comes to information on the application layer (i. e. the responses received), is not
sufficient to describe the results of all HTTP requests that have been performed.
Lastly, there is the (5) Standard Error output of the crawler containing e. g.
networking exceptions encountered during crawling. For the destillation of data
from the Dynamic Linked Data Observatory according to the model we propose
in this work, we have to consider the different outputs of LDSpider.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Our Approach</title>
      <p>In our proposed solution, first we provide a model based on RDF to represent
information both at the physical and logical levels of Dynamic Linked Data.
Then, we propose a processing pipeline to derive the RDF data following our
model from current Linked Data monitoring approaches.
3.1</p>
      <sec id="sec-3-1">
        <title>Modelling Dynamic Linked Data in RDF</title>
        <p>Typically, to capture the dynamics of Linked Data, monitoring approaches
periodically dereference snapshots of data from a set of seed URIs. Therefore, we
propose a model to represent snapshots of Linked Data by capturing the entire
dereferencing process of data from a list of seed URIs. The proposed model then
allows for analysing changes among time series of Linked Data by comparing
the data recorded in each snapshot. The proposed model is depicted in
Figure 1. We use the UML class diagram with the following correspondence from
UML to RDFS: UML classes depict rdfs:Classes and UML associations
depict rdfs:domain and rdfs:range of an rdf:Property. We use list:member1
associations to state the rdfs:Class of the members of an rdf:List.</p>
        <p>In our model, every Linked Data snapshot is annotated with a timestamp.
The core concept in each snapshot is an observation associated with a seed URI.</p>
        <p>To represent the details about the physical level of Dynamic Linked Data,
our model captures in an RDF list the HTTP requests following redirects where
dereferencing a seed URI. In this way, the model preserves the order in which
the requests have been made. RDF lists are closed and ordered, both features
are desired for the requests, as the number of requests that has been made is</p>
        <sec id="sec-3-1-1">
          <title>1 The prefix list meaning http://www.w3.org/2000/10/swap/list#</title>
          <p>qb:Observation
:hasSnapshot</p>
          <p>:hasRequestChain
:Snapshot
:next</p>
          <p>rdf:List
dc:created
:hasSeedURI list:member
:hasLastResponse
xsd:date</p>
          <p>http:Request
http:requestURI http:mthd</p>
          <p>http:resp
xsd:anyURI
http:Method</p>
          <p>http:Response
http:statusCodeValue http:body
xsd:integer</p>
          <p>rdf:List
list:member crypto:md5
rdf:Statement</p>
          <p>xsd:hexBinary
rdf:subject rdf:predicate rdf:object
rdfs:Resource</p>
          <p>
            :hasPLD
rdfs:Literal
determined by what happened, and the order of the request is determined by the
redirects. Nonetheless, it is important to highlight that to traverse the elements
in RDF lists in queries it is necessary to use SPARQL property paths, which
may be rather expensive to evaluate depending on the assumed semantics [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. To
allow for querying the information resource of a seed URI, the model includes the
:hasLastResponse property. In this way, it is not necessary to traverse the entire
list of requests thus reducing the complexity of queries that only require data
about the information resource associated with a seed URI. For each request,
the model represents the obtained response which contains the status code and
the response body (if applicable).
          </p>
          <p>The logical level of Dynamic Linked Data corresponds to the RDF graphs
obtained in the body of a response. To associate these graphs with the
corresponding response, the proposed model represent the triples in the RDF graphs
as an RDF list of reified statements. In this context, reified statements allows
for referring to RDF graphs at different points in time. Another option would
have been named graphs for each information resource at each point in time.
Nonetheless, this option has several drawbacks. First, we would have to name
each named graph in FROM clauses of analysis queries. With over 210 snapshots
of about 95’737 URIs, we would have more than 20M FROM clauses in the queries,
which is rather lengthy. Second, to try different triple stores, whose treatment of
named graph varies if they are supported (some treat the triples in the named
graphs as asserted in the default graph), we want to rely on common features
independently from the triple store. In addition to the reified statements, the
model contains a hash of the full graph as binary using the crypto:md5
property2. This allows for efficiently detecting RDF graphs that change over time.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Processing Pipeline to Extract the Dynamics of Linked Data</title>
        <p>The processing pipeline includes the handling of the data produced by the
monitoring Linked Data tools. In our approach, the pipeline obtains relevant
information about the dynamics of the crawled Linked Data that is scattered in different
LDSpider outputs into one RDF graph.</p>
        <p>An overview of the processing pipeline is presented in Figure 2. The proposed
pipeline includes two components: one to process the physical level of Linked
Data (meta data)3, and another to process the logical level of Linked Data
(RDF data)4. It is important to highlight that the pipeline generates URIs for the
requests to the information resources (which is the only information that appears
in both, data and meta data) such that the output of the two components can be
correctly integrated using RDF merge. The merged data then follows the model
presented in Figure 1. The meta data processing code adds custom HTTP status
codes for URIs whose dereferencing yields non-HTTP errors such as networking
errors. Those custom HTTP status codes allow us to query both HTTP errors
and networking errors in a uniform fashion.</p>
        <p>
          An important aspect when processing the logical level of Linked Data is the
correct handling of blank nodes. In the presence of blank nodes, the same RDF
graph may use different blank node identifiers to represent the same data. Hence,
in our approach we detect isomorphisms among RDF graphs from different
snapshots and replace the blank nodes by URIs using the hash-based skolemisation
approach described by Hogan in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This allows for checking whether a triple
        </p>
        <sec id="sec-3-2-1">
          <title>2 With the prefix crypto meaning http://www.w3.org/2000/10/swap/crypto#</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>3 http://github.com/kaefer3000/dyldo-http2rdf</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>4 http://github.com/kaefer3000/dyldo2qb</title>
          <p>Data
HTTP Status Log</p>
          <p>Redirects
HTTP Headers
Standard Error</p>
          <p>RDF data processing
“logical level”</p>
          <p>data-reified.nt
Meta data processing
“physical level”
request-information.ttl
with a blank node from the graph derived at one point in time is the same as a
triple with a blank node from another point in time using RDF URI equality.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Applicability of Our Approach</title>
      <p>
        In this section, we repeat three analyses from the 2013 paper of Käfer et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
using SPARQL queries on the data distilled from the first five snapshots of
the Dynamic Linked Data Observatory. We first present the queries, numbered
according to the number of their corresponding figure in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Then, we report
on the run time of the query execution on different SPARQL engines.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Queries</title>
        <p>
          Q1 The first analysis, depicted in Figure 1 of [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] dubbed “appearances of
documents”, asked how many documents appeared in which number of snapshots.
The aim of the analyses was to investigate the availability of Linked Data
documents. In the spirit of our motivating example, we changed the analysis to the
share of seed URIs that dereferenced successfully. In the query, the commented
line additionally checks for non-empty HTTP response bodies. We give a query
for this analysis in Figure 3. We used the query results to produce Figure 4.
Q2 While the first analysis only looked at successful dereferencing, the second
analysis (Figure 2 of [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]) looked more closely at the availability by investigating
all HTTP status codes returned in the dereferencing process. We give a query
for that analysis in Figure 5. The corresponding produced visualisation can be
found in Figure 6.
PREFIX : &lt;http://purl.org/dyldo/vocab#&gt;
PREFIX http: &lt;http://www.w3.org/2011/http#&gt;
PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
SELECT (COUNT(?seedURI) AS ?uris) ?NoSnapshots WHERE {
{
SELECT (COUNT(?snapshot1) AS ?NoSnapshots) ?seedURI WHERE {
?observation :hasSnapshot ?snapshot1;
:hasSeedURI ?seedURI;
:hasLastResponse ?res .
?res http:statusCodeValue "200"^^xsd:integer .
        </p>
        <p># ?res http:body ?body.</p>
        <p>} GROUP BY ?seedURI
}
} GROUP BY ?NoSnapshots
PREFIX : &lt;http://purl.org/dyldo/vocab#&gt;
PREFIX dc: &lt;http://purl.org/dc/terms/&gt;
PREFIX http: &lt;http://www.w3.org/2011/http#&gt;
PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
SELECT ?snapshotDate
(CONCAT(?scDigit1, "xx") AS ?statusClass)
(COUNT(DISTINCT ?seedURI) AS ?seedUriCount)
WHERE {
?observation :hasSnapshot ?snapshot ;
:hasSeedURI ?seedURI ;
:hasLastResponse ?res .
?res http:statusCodeValue ?sc .</p>
        <p>
          BIND(SUBSTR(STR(?sc), 1, 1) AS ?scDigit1)
} GROUP BY ?scDigit1 ?snapshotDate
Fig. 5. Query Q2 to investigate the distribution of HTTP response classes, cf. Figure 2
from [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Q5 To evaluate more challenging SPARQL queries, we last have a look at
Figure 5 of [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This figure is the first in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] to investigate change in the Linked
Data sources, i. e. it involves multiple snapshots in the query. The query looks
at which number of URIs had which number of changes.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Data Loading and Query Execution</title>
        <p>To showcase the applicability of our approach, we present loading and querying
times for the first five snapshots and the three queries.</p>
        <p>
          From the five snapshots, we yielded about 10M triples overall with
information about the physical level (including the hashes on the graphs) and 481M
triples overall with the reified data (derived from about 80M triples raw data),
the hashes, and data about pay-level domains required for more analyses from [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
As the snapshots are processed individually and because we exploit in the model
the fact that many sources do not change so much over time, there is a high
number of duplicate triples between the data from different snapshots.
Therefore, the data to be processed by the SPARQL query engines is considerably
lower than the reported triple numbers.
        </p>
        <p>We ran the experiments on a Debian 8 (jessie) 64 bit Linux system with
4 cores of an AMD Opteron 62xx with 2 GHz and 48 GB of RAM. We report the
times in Table 1. We report the loading times for the physical data (including
the hashes on the graphs), and the full data. Moreover, we briefly report on
the lessons learned while using different SPARQL query engines to evaluate our
approach:</p>
        <p>Virtuoso5 (Version 7.20.3217) was not able to load the full data due to the
skews in the data introduced by the reified triples. We nevertheless include
Virtuoso in our measurements, as we may switch the representation of the logical
data in the future.</p>
        <p>Blazegraph6 (Version 2.1.4) managed to load the entire data, but query
performance is low when using the automatic query optimiser for queries where
different snapshots are compared, for instance it took about a week to get
results for Q5. Optimising the query plan using Blazegraph’s hints allowed us to
significantly improve the performance, for Q5 down to 73 s.</p>
        <p>
          Linked Data-Fu7 (Version 0.9.12) [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] does not index the data but queries
RDF on the fly, so we there are no loading times. Moreover, Linked Data-Fu
does not support aggregates, so we had to remove them from the queries and
implemented them using AWK8 scripts. For the query evaluation, we piped the
results from Linked Data-Fu through AWK 4.1.1 to get the envisioned results. We
report the overall run time in the table for processing the physical information
and hashes.
        </p>
        <sec id="sec-4-2-1">
          <title>5 http://virtuoso.openlinksw.com/</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>6 http://www.blazegraph.com/</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>7 http://linked-data-fu.github.io/</title>
        </sec>
        <sec id="sec-4-2-4">
          <title>8 http://www.gnu.org/software/gawk/</title>
          <p>PREFIX : &lt;http://purl.org/dyldo/vocab#&gt;
PREFIX crypto: &lt;http://www.w3.org/2000/10/swap/crypto#&gt;
PREFIX http: &lt;http://www.w3.org/2011/http#&gt;
PREFIX qb: &lt;http://purl.org/linked-data/cube#&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
SELECT ?changeCount (COUNT (?changeCount) as ?uriCount) WHERE {
SELECT (COUNT(?seedURI) AS ?changeCount) WHERE {
?observation1 a qb:Observation ;
:hasSnapshot ?snapshot;
:hasLastResponse ?resp1 .
?resp1 http:statusCodeValue "200"^^xsd:integer ;</p>
          <p>http:body [ crypto:md5 ?hash1 ] .
?observation2 a qb:Observation ;
:hasSnapshot ?snapshot2 ;
:hasLastResponse ?resp2 .
?resp2 http:statusCodeValue "200"^^xsd:integer ;</p>
          <p>http:body [ crypto:md5 ?hash2 ] .
?snapshot :next ?snapshot2 .</p>
          <p>
            FILTER((STR(?hash1) != STR(?hash2)))
} GROUP BY ?seedURI
} GROUP BY ?changeCount
Fig. 7. Query Q5 to investigate the number of changes per number of URIs, cf. Figure 5
from [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
          </p>
          <p>0.8
I 0.6
s
R
U
d
e
e
fs 0.4
o
o
i
t
a
R
0.2
0
0
1</p>
          <p>
            2
No. of changes
3
4
Fig. 8. Number of changes per number of URIs. Cf. Figure 5 from [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], created using
the results from Q5 in Figure 7.
∗ Using optimisation hint. Otherwise, the execution time was about 1 week.
5
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion of the Approach</title>
      <p>We observed that our particular workload poses challenges to indexes and query
optimisers. Optimisation of the queries by re-ordering or re-formulating parts of
the query can take the processing time from days down to minutes. Another line
of optimisation is the data: For instance, we could reduce the overall data by
omitting triples that are depicted dotted in Figure 1 because they are the same
for all requests or triples. Second, we could use less verbose modelling for the
triples that are depicted dashed in the figure: We use the RDF list to describe
the HTTP body because it is terminated. The order of the statements does
not matter. As we want to use SPARQL for querying, where the closed-world
assumption is made, we can reduce the number of triples by introducing a blank
node or URI for the HTTP response body, and use triples with a predicate like
rdfs:member to connect the body to the statements in the RDF graph of the
response body.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Related Work</title>
      <p>In this section, we analyse current approaches for monitoring Linked Data.</p>
      <p>
        The Dynamic Linked Data Observatory [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] is a framework that monitors
the dynamics of Linked Data. The Dynamic Linked Data Observatory crawls
RDF documents available on the web periodically and provisions logs and raw
data about the crawling process. The data generated by the the Dynamic Linked
Data Observatory is key for tracking changes in Linked Data sets, however, it
requires further processing in order to extract relevant information about the
dynamics of Linked Data sets. Therefore, in this work we propose an approach
that enriches the data generated by the Dynamic Linked Data Observatory. In
its raw form, the data from the Dynamic Linked Data Observatory has been
analysed by different scholars, mostly without taking networking aspects into
account [
        <xref ref-type="bibr" rid="ref11 ref15 ref17 ref7 ref8">8, 11, 7, 17, 15</xref>
        ].
      </p>
      <p>
        Other approaches have focused on monitoring other aspects of Linked Data
[
        <xref ref-type="bibr" rid="ref12 ref2 ref5">2, 12, 5</xref>
        ]. For example, LODStats [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is an approach that collects statistics about
RDF datasets available on the web. LODStats provides declarative descriptions
of datasets using the LODStats Dataset Vocabulary (LDSO). LDSO extends the
Vocabulary of Interlinked Datasets9 (VoID) and the Data Catalog Vocabulary10
(DCAT) to model meta data and statistical metrics about Linked Data sets.
Furthermore, the work by Hasnain et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] focuses on providing a catalogue
of SPARQL queries to compute statistics based on VoID descriptions of RDF
datasets available through SPARQL endpoints. SPARQLES [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], in turn, focuses
on monitoring publicly available SPARQL endpoints. SPARQLES provides a set
of predefined queries to inspect the support of SPARQL features and
performance of endpoints. In contrast to our proposed approach, related works focus
on reporting statistics about the current state of the datasets.
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we presented an RDF model of dynamic Linked Data for
declaratively analysing dynamic Linked Data time series using SPARQL queries. We
provide an implementation to distil data according to the model from data
collected using LDSpider, such as the Dynamic Linked Data Observatory. We gave
three SPARQL queries for analysing dynamic Linked Data and evaluated them
over five snapshots from the Dynamic Linked Data Observatory using three
different SPARQL engines.</p>
      <p>For future work, we want to run more queries on more snapshots. As the data
that is emitted from our processing code does not have many selective predicates,
the engineering of data and queries seems not to be a trivial undertaking.</p>
      <sec id="sec-7-1">
        <title>Acknowledgements</title>
        <p>This work is partially supported by the German federal ministry of education
and research in AFAP, a Software Campus project (FKZ 01IS12051).</p>
        <sec id="sec-7-1-1">
          <title>9 http://www.w3.org/TR/void/ 10 http://www.w3.org/TR/vocab-dcat/</title>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arenas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conca</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Pérez</surname>
          </string-name>
          , J.:
          <article-title>Counting beyond a Yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard</article-title>
          .
          <source>In: Proceedings of the 21st International Conference on World Wide Web (WWW)</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          , J.:
          <article-title>LODStats - An Extensible Framework for High-Performance Dataset Analytics</article-title>
          .
          <source>In: Proceedings of the 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW)</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked Data</article-title>
          .
          <source>Design Issues</source>
          , (
          <year>2006</year>
          ). http : / / www . w3 . org / DesignIssues/LinkedData.html
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fielding</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Masinter</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Uniform Resource Identifier (URI): Generic Syntax</article-title>
          .
          <source>Internet Standard. RFC 3986</source>
          .
          <string-name>
            <surname>IETF</surname>
          </string-name>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Buil</given-names>
            <surname>Aranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            , and
            <surname>Vandenbussche</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          : SPARQL WebQuerying Infrastructure: Ready for Action?
          <source>In: Proceedings of the 12th International Semantic Web Conference (ISWC)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lanthaler</surname>
          </string-name>
          , M., eds.
          <source>: RDF 1.1 Concepts</source>
          and
          <string-name>
            <given-names>Abstract</given-names>
            <surname>Syntax</surname>
          </string-name>
          . Recommendation,
          <year>W3C</year>
          . (
          <year>2014</year>
          ). http://www.w3.org/TR/rdf11- concepts/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dividino</surname>
            ,
            <given-names>R. Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Strategies for Efficiently Keeping Local Linked Open Data Caches Up-To-Date</article-title>
          .
          <source>In: Proceedings of the 14th International Semantic Web Conference</source>
          ,
          <source>(ISWC)</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dividino</surname>
            ,
            <given-names>R. Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gröner</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Grotton</surname>
          </string-name>
          , T.:
          <article-title>Change-a-LOD: Does the Schema on the Linked Data Cloud Change or Not?</article-title>
          <source>In: Proceedings of the Fourth International Workshop on Consuming Linked Data (COLD) at the 12th International Semantic Web Conference (ISWC)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fielding</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Reschke</surname>
          </string-name>
          , J., eds.:
          <article-title>Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. RFC 7230 (Proposed Standard)</article-title>
          .
          <source>IETF</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Fielding</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Reschke</surname>
          </string-name>
          , J.:
          <source>Hypertext Transfer Protocol (HTTP/1</source>
          .1):
          <article-title>Semantics and Content. RFC 7231 (Proposed Standard)</article-title>
          .
          <source>IETF</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Perplexity of Index Models over Evolving Linked Data</article-title>
          .
          <source>In: Proceedings of the 11th European Semantic Web Conference (ESWC)</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hasnain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehmood</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , Sana e Zainab,
          <string-name>
            <given-names>S.</given-names>
            , and
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>SPORTAL: Profiling the Content of Public SPARQL Endpoints</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>12</volume>
          (
          <issue>3</issue>
          ) (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Skolemising Blank Nodes while Preserving Isomorphism</article-title>
          .
          <source>In: Proceedings of the 24th International Conference on World Wide Web (WWW)</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Harth</surname>
          </string-name>
          , A.:
          <article-title>LDSpider: An open-source crawling framework for the Web of Linked Data</article-title>
          .
          <source>In: Proceedings of Posters and Demos at the 9th International Semantic Web Conference (ISWC)</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Käfer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdelrahman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            , J.,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Byrne</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Observing Linked Data Dynamics</article-title>
          .
          <source>In: Proceedings of the 10th European Semantic Web Conference (ESWC)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Käfer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Towards a Dynamic Linked Data Observatory</article-title>
          .
          <source>In: Proceedings of the 5th Workshop on Linked Data on the Web (LDOW) at the 25th International Conference on World Wide Web (WWW)</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Nishioka</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Information-theoretic Analysis of Entity Dynamics on the Linked Open Data Cloud</article-title>
          .
          <source>In: Proceedings of the 3rd International Workshop on Dataset PROFIling and fEderated Search for Linked Data (PROFILES) at the the 13th European Semantic Web Conference (ESWC)</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Stadtmüller</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Speiser</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Studer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Data-Fu: A Language and an Interpreter for Interaction with Read/Write Linked Data</article-title>
          .
          <source>In: Proceedings of the 22nd International Conference on World Wide Web (WWW)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karnstedt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Parreira</surname>
            ,
            <given-names>J. X.</given-names>
          </string-name>
          :
          <article-title>Hybrid SPARQL Queries: Fresh vs</article-title>
          .
          <source>Fast Results. In: Proceedings of the 11th International Semantic Web Conference (ISWC)</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Zimmermann</surname>
          </string-name>
          , A., ed.
          <source>: RDF 1</source>
          .
          <article-title>1: On Semantics of RDF Datasets</article-title>
          . Working Group Note,
          <fpage>W3C</fpage>
          . (
          <year>2014</year>
          ). http://www.w3.org/TR/rdf11-datasets/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>