Modelling and Analysing Dynamic Linked Data
           using RDF and SPARQL

              Tobias Käfer, Alexandra Wins, and Maribel Acosta

        Institute AIFB, Karlsruhe Institute of Technology (KIT), Germany
                    {tobias.kaefer|maribel.acosta}@kit.edu
                        alexandra.wins@student.kit.edu


      Abstract. Analyses of dynamic Linked Data are inherently dependent
      on changes in RDF graphs (logical level) and what happens on the HTTP
      and networking level (physical level). However, these dependencies have
      been reflected in previous works only to a limited extent, which may lead
      to inaccurate conclusions about the dynamics of the data. To overcome
      this limitation, we tackle the problem of modelling dynamic Linked Data
      to capture changes both at the logical and physical level of Linked Data.
      We present our work in progress in this paper. We propose an RDF model
      of descriptions of both the HTTP requests/responses/networking errors
      made when downloading, and the RDF data thus obtained. The model
      allows for carrying out more comprehensive analyses of dynamic Linked
      Data in a declarative fashion. We present a processing pipeline to distil
      such modelled RDF data from datasets created using LDspider, such as
      the Dynamic Linked Data Observatory. We show the usefulness of our
      model by repeating three analyses of a previous paper on the Dynamic
      Linked Data Observatory using SPARQL queries, which we executed on
      data that follows our proposed model on three SPARQL engines.


1   Introduction
Linked Data is dynamic: we see data in Linked Data documents being updated,
or documents, or entire servers, go on- or offline. This dynamics has been ac-
knowledged and analysed on different levels of abstraction, ranging from data
access and data syntax [15] to data schema [8]. Yet, the interdependence of the
dynamics that can be observed on the different levels of abstraction is still un-
charted territory. For instance, previous analyses on higher level of abstraction
did not reflect the dynamics caused by data access [8, 11, 7, 17]. We argue that
analyses of this interdependence have been neglected because there is no uniform
way to access all the required information, and the necessity of the analyses has
been overlooked as previous analyses used code in imperative languages, where
the assumptions are hidden in the control flow of the code. For instance:
Motivating Example. In a previous paper, the number of overall documents ap-
pearing in the Dynamic Linked Data Observatory dataset, which monitors a set
of 95’737 seed URIs, was reported to be 86’696 [15]. To be precise, this was the
number of information resources ever encountered in any of the 29 snapshots
2         Tobias Käfer, Alexandra Wins, and Maribel Acosta

available at that time and determined by counting the contexts of quads. When
considering e. g. 171 snapshots, this number rises to about 106’105 documents,
more than seed URIs. This mismatch points us towards that we may have to
re-ask our question and take into account the dereferencing process: If we instead
consider the number of seed URIs that dereferenced to information resources,
the number for 29 snapshots is 92’712. To further put this number into context,
we note that for the first snapshot, 67’107 seed URIs were redirected, ie. had
non-trivial dereferencing, and 77’958 could be dereferenced successfully.
    We want to lower the barrier of entry to analyses that take the interdepen-
dence between data and data access into account by tackling the problem of
modelling dynamic Linked Data in a way that takes into account both, the data
access and the data itself. Moreover, we want to make the analyses more easy
to validate by using a declarative approach for the analyses.
    Data following the Linked Data principles has been published at a vast scale.
Such data is meant to reflect the state of things in the world, which is without a
doubt dynamic. Moreover, with the advent of Read-Write Linked Data –where
Linked Data is meant to be changed using HTTP requests– and the Web of
Things –where Linked Data provides access to ever-changing sensor values– we
expect a further increase of dynamic Linked Data.
    Knowledge about the dynamics of Linked Data is relevant e. g. in choosing
architectures for Linked Data applications (ranging from data warehousing to
live querying) and optimising Linked Data processing systems (e. g. indexing
strategies) [19]. To investigate the dynamics of Linked Data, Käfer et al. set
up the Dynamic Linked Data Observatory [16]. This observatory creates weekly
snapshots of data and meta data from RDF documents retrieved from the web.
Several works have investigated the dynamics of Linked Data with analyses
written in imperative languages that post-process the snapshots collected by
the observatory [8, 7, 11]. Such imperatively given analyses may intransparently
introduce assumptions, which hinder the reproducibility and the validity of the
results. In contrast, declaratively modelled analyses raise the understandability
of the analysis process and facilitate the validation of the analyses and the results.
    In this paper, we present ongoing work to model dynamic Linked Data using
RDF such that such interdependencies can be investigated more easily. The
proposed model enables declarative analyses of dynamic Linked Data using the
SPARQL query language. In summary, the main contributions of our work are
as follows:
    – A data model that relies on RDF and captures the physical and logical
      aspects of dynamic Linked Data
    – A processing pipeline that exploits the information recorded in crawler logs
      to generate relevant meta data about dynamic Linked Data
    – A proof concept of our proposed approach that includes: five snapshots of
      Linked Data represented with the proposed data model, and three SPARQL
      queries to analyse the dynamics of Linked Data
   The paper is structured as follows: In Section 2, we introduce the standards
and practices around Linked Data, as relevant for our analyses. Moreover, we
    Modelling and Analysing Dynamic Linked Data using RDF and SPARQL           3

hint at the technical details of the crawler employed, which shape our proof of
concept. In Section 3, we present the data model and a processing pipeline to
derive data according to the model. Subsequently, in Section 4, we show the
applicability of our approach by giving queries and results as proof of concept.
Based on our proof of concept, we discuss future refinements of the approach
in Section 5. Next, in Section 6, we present related work. Last, in Section 7, we
summarise and discuss our approach, and outline future directions for our work.


2    Preliminaries
In this section, we describe Linked Data with special focus on the features we
need for generating the RDF description of the time series that incorporates
both request meta data and the RDF payload from Linked Data. Moreover, we
describe the Dynamic Linked Data Observatory, a project that publishes a time
series of RDF data and log files. For the description of our extraction, we give
the necessary introduction to LDSpider, the Linked Data crawling software used
in the Dynamic Linked Data Observatory, particularly its output data.

Linked Data Linked Data is a set of principles for publishing data on the
web [3]. The principles advocate the use of web standards for data on the web:
URIs as identifiers, HTTP-GET for data access, and RDF as data model.

URI Uniform Resource Identifiers (URIs) are names for things on the web
in the form of character sequences [4]. URIs start with a scheme, where the
scheme http denotes that HTTP-based communication with the resource may
be possible. In addition, URIs may refer to: (i) non-information resources that
denote abstract or physical things, e. g. a URI of the moon is http://dbpedia.
org/resource/Moon; (ii) information resources that denote the RDF documents
that describe the things, e. g. the URI of a document that describes the moon is
http://dbpedia.org/data/Moon.n3.

HTTP The Hypertext Transfer Protocol (HTTP) is a protocol for data trans-
fer on the web [9]. Data exchange is subdivided into requests and responses.
Requests are sent to URIs, which return a response. Requests do not return in
the case of networking issues and server outages, which are outside of the HTTP
framework. There are a number of request methods, from which Linked Data
uses the GET method for retrieving state. To send a GET request to a URI
is also called dereferencing the URI. A HTTP response consists in (1) a status
line reporting about the status of the request using a numeric status code and
a textual explanation, e. g. in the case of a successful request the status line
reads 200 OK, (2) optional headers with meta data about the response, and (3)
an optional body, which contains RDF data in the case of Linked Data. Status
codes are three-digit integers, where the first digit determines the status code
class. The HTTP specification distinguishes the following classes [10]:
4         Tobias Käfer, Alexandra Wins, and Maribel Acosta

    – 1xx (Informational): The request was received, processing continues
    – 2xx (Successful): The request was received and can be successfully answered
    – 3xx (Redirection): The request needs further client action for completion
    – 4xx (Client Error): The request cannot be fulfilled due to a client error
    – 5xx (Server Error): The request cannot be fulfilled due to a server error
The responses with 3xx status code typically contain a Location header with a
URI to which the server redirects the client for the next request. On the Linked
Data web, the use of redirects is made to distinguish two classes of URIs: Those
referring to information resources and those referring to non-information re-
sources. Information resources are differentiated when their URI is dereferenced,
return a successful response with non-empty body (e. g. a RDF document about
the moon). Non-information resources are all other resources (e. g. the moon).
Previous analyses focussed on the mere data from the response bodies of success-
ful requests (after following redirects if applicable). With the modelling presented
in this paper, we want to enable analyses that also take into account unsuccessful
HTTP requests and the process of following redirects when dereferencing.

RDF, Triples, Quads The Resource Description Framework (RDF) is a graph-
based data model [6] used in Linked Data. An RDF graph is a directed graph,
where labelled nodes are connected by labelled arcs. An RDF graph G is com-
posed of a set of triples, where a triple t = (subject, predicate, object) such
that t ∈ (U ∪ B) × (U) × (U ∪ B ∪ L), where U denotes the set of all URIs, B
the set of all blank nodes, and L the set of all literals. A triple can be extended
with a fourth element called “context” to form a quad. In this paper, the term
quad refers to a triple plus the URI of the information resource where the triple
has been obtained from, as described in [20, § 3.5].

The Dynamic Linked Data Observatory The Dynamic Linked Data Ob-
servatory is a dataset containing a time series of Linked Data from the Web [16].
The time series is being collected weekly since May 2012. For the composition
of the time series, a seed list of 95,737 URIs is dereferenced each week (following
redirects) and both the retrieved RDF data and meta data in the form of log
files is collected. While the Dynamic Linked Data Observatory also crawls from
the seed list, we neglect the crawling part in this paper. The data collection in
the Dynamic Linked Data Observatory is done using LDSpider. We use the data
from the Dynamic Linked Data Observatory to show the applicability for the
modelling/querying approach we propose in this work.

LDSpider LDSpider [14] is a crawler for Linked Data. In the Dynamic Linked
Data Observatory, LDSpider is used to download the seed list following links
and crawl from the seed list. LDSpider produces five types of outputs. The
first output of LDSpider is (1) Data that contains quads data obtained from
dereferencing the seed URIs. The quads contain the triples from the download,
and in context position the URI of the information resource from which the triple
     Modelling and Analysing Dynamic Linked Data using RDF and SPARQL             5

was downloaded. Note that the context URI is not always the URI from the seed
list: If the request to the URI from the seed list was answered using a redirect
to another URI, LDSpider dereferences that URI also. If the GET request to
the latter URI was answered successfully with RDF data, this is the data and
the context URI that go into the quad. Other relevant output of LDSpider is an
(2) HTTP Status log with information about the status codes in the responses
to the HTTP requests the crawler performed. Moreover, LDSpider outputs (3)
Redirects that records pairs describing from which URI to which URI the
crawler has been redirected. In addition, LDSpider records the (4) HTTP Headers
from the HTTP responses. While this meta data is quite comprehensive when it
comes to information on the application layer (i. e. the responses received), is not
sufficient to describe the results of all HTTP requests that have been performed.
Lastly, there is the (5) Standard Error output of the crawler containing e. g.
networking exceptions encountered during crawling. For the destillation of data
from the Dynamic Linked Data Observatory according to the model we propose
in this work, we have to consider the different outputs of LDSpider.


3     Our Approach

In our proposed solution, first we provide a model based on RDF to represent
information both at the physical and logical levels of Dynamic Linked Data.
Then, we propose a processing pipeline to derive the RDF data following our
model from current Linked Data monitoring approaches.


3.1    Modelling Dynamic Linked Data in RDF

Typically, to capture the dynamics of Linked Data, monitoring approaches pe-
riodically dereference snapshots of data from a set of seed URIs. Therefore, we
propose a model to represent snapshots of Linked Data by capturing the entire
dereferencing process of data from a list of seed URIs. The proposed model then
allows for analysing changes among time series of Linked Data by comparing
the data recorded in each snapshot. The proposed model is depicted in Fig-
ure 1. We use the UML class diagram with the following correspondence from
UML to RDFS: UML classes depict rdfs:Classes and UML associations de-
pict rdfs:domain and rdfs:range of an rdf:Property. We use list:member1
associations to state the rdfs:Class of the members of an rdf:List.
    In our model, every Linked Data snapshot is annotated with a timestamp.
The core concept in each snapshot is an observation associated with a seed URI.
    To represent the details about the physical level of Dynamic Linked Data,
our model captures in an RDF list the HTTP requests following redirects where
dereferencing a seed URI. In this way, the model preserves the order in which
the requests have been made. RDF lists are closed and ordered, both features
are desired for the requests, as the number of requests that has been made is
1
    The prefix list meaning http://www.w3.org/2000/10/swap/list#
6         Tobias Käfer, Alexandra Wins, and Maribel Acosta

                          qb:Observation


          :hasSnapshot               :hasRequestChain

 :Snapshot                           rdf:List
                :next


        dc:created             :hasSeedURI list:member                   :hasLastResponse

                                           http:Request
    xsd:date


                                http:requestURI http:mthd      http:resp

                                             http:Method        http:Response
                        xsd:anyURI


                                                           http:statusCodeValue http:body

                                                                                    rdf:List
                                                      xsd:integer


                                                                              list:member crypto:md5

                                                                    rdf:Statement
                                                                                        xsd:hexBinary


                                                           rdf:subject      rdf:predicate rdf:object

                                                                    rdfs:Resource


                                                                            :hasPLD


                                                                     rdfs:Literal


Fig. 1. The proposed model to represent Dynamic Linked Data. Each snapshot of
Linked Data is annotated with a timestamp. Each observation records the process of
dereferencing a seed URI. The RDF graphs obtained from dereferencing each seed URI
is represented as a list of reified statements. The dashed and dotted elements in the
model are possible places for improvement.


determined by what happened, and the order of the request is determined by the
redirects. Nonetheless, it is important to highlight that to traverse the elements
in RDF lists in queries it is necessary to use SPARQL property paths, which
may be rather expensive to evaluate depending on the assumed semantics [1]. To
allow for querying the information resource of a seed URI, the model includes the
:hasLastResponse property. In this way, it is not necessary to traverse the entire
list of requests thus reducing the complexity of queries that only require data
    Modelling and Analysing Dynamic Linked Data using RDF and SPARQL             7

about the information resource associated with a seed URI. For each request,
the model represents the obtained response which contains the status code and
the response body (if applicable).
    The logical level of Dynamic Linked Data corresponds to the RDF graphs
obtained in the body of a response. To associate these graphs with the corre-
sponding response, the proposed model represent the triples in the RDF graphs
as an RDF list of reified statements. In this context, reified statements allows
for referring to RDF graphs at different points in time. Another option would
have been named graphs for each information resource at each point in time.
Nonetheless, this option has several drawbacks. First, we would have to name
each named graph in FROM clauses of analysis queries. With over 210 snapshots
of about 95’737 URIs, we would have more than 20M FROM clauses in the queries,
which is rather lengthy. Second, to try different triple stores, whose treatment of
named graph varies if they are supported (some treat the triples in the named
graphs as asserted in the default graph), we want to rely on common features
independently from the triple store. In addition to the reified statements, the
model contains a hash of the full graph as binary using the crypto:md5 prop-
erty2 . This allows for efficiently detecting RDF graphs that change over time.


3.2   Processing Pipeline to Extract the Dynamics of Linked Data

The processing pipeline includes the handling of the data produced by the moni-
toring Linked Data tools. In our approach, the pipeline obtains relevant informa-
tion about the dynamics of the crawled Linked Data that is scattered in different
LDSpider outputs into one RDF graph.
    An overview of the processing pipeline is presented in Figure 2. The proposed
pipeline includes two components: one to process the physical level of Linked
Data (meta data)3 , and another to process the logical level of Linked Data
(RDF data)4 . It is important to highlight that the pipeline generates URIs for the
requests to the information resources (which is the only information that appears
in both, data and meta data) such that the output of the two components can be
correctly integrated using RDF merge. The merged data then follows the model
presented in Figure 1. The meta data processing code adds custom HTTP status
codes for URIs whose dereferencing yields non-HTTP errors such as networking
errors. Those custom HTTP status codes allow us to query both HTTP errors
and networking errors in a uniform fashion.
    An important aspect when processing the logical level of Linked Data is the
correct handling of blank nodes. In the presence of blank nodes, the same RDF
graph may use different blank node identifiers to represent the same data. Hence,
in our approach we detect isomorphisms among RDF graphs from different snap-
shots and replace the blank nodes by URIs using the hash-based skolemisation
approach described by Hogan in [13]. This allows for checking whether a triple
2
  With the prefix crypto meaning http://www.w3.org/2000/10/swap/crypto#
3
  http://github.com/kaefer3000/dyldo-http2rdf
4
  http://github.com/kaefer3000/dyldo2qb
8        Tobias Käfer, Alexandra Wins, and Maribel Acosta


                        RDF data processing
         Data                                               data-reified.nt
                          “logical level”


                               Meta data processing
HTTP Status Log                                                  request-information.ttl
                                 “physical level”


      Redirects


     HTTP Headers


    Standard Error

Fig. 2. The proposed processing pipeline. The input (on the left) is produced by LD-
Spider, the output (on the right) is RDF data using the proposed model in Figure 1.


with a blank node from the graph derived at one point in time is the same as a
triple with a blank node from another point in time using RDF URI equality.


4      Applicability of Our Approach
In this section, we repeat three analyses from the 2013 paper of Käfer et al. [15]
using SPARQL queries on the data distilled from the first five snapshots of
the Dynamic Linked Data Observatory. We first present the queries, numbered
according to the number of their corresponding figure in [15]. Then, we report
on the run time of the query execution on different SPARQL engines.

4.1     Queries
Q1 The first analysis, depicted in Figure 1 of [15] dubbed “appearances of doc-
uments”, asked how many documents appeared in which number of snapshots.
The aim of the analyses was to investigate the availability of Linked Data doc-
uments. In the spirit of our motivating example, we changed the analysis to the
share of seed URIs that dereferenced successfully. In the query, the commented
line additionally checks for non-empty HTTP response bodies. We give a query
for this analysis in Figure 3. We used the query results to produce Figure 4.
Q2 While the first analysis only looked at successful dereferencing, the second
analysis (Figure 2 of [15]) looked more closely at the availability by investigating
all HTTP status codes returned in the dereferencing process. We give a query
for that analysis in Figure 5. The corresponding produced visualisation can be
found in Figure 6.
   Modelling and Analysing Dynamic Linked Data using RDF and SPARQL                 9


PREFIX : <http://purl.org/dyldo/vocab#>
PREFIX http: <http://www.w3.org/2011/http#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT (COUNT(?seedURI) AS ?uris) ?NoSnapshots WHERE {
  {
  SELECT (COUNT(?snapshot1) AS ?NoSnapshots) ?seedURI WHERE {
      ?observation :hasSnapshot ?snapshot1;
                      :hasSeedURI ?seedURI;
                     :hasLastResponse ?res .
      ?res http:statusCodeValue "200"^^xsd:integer .
      # ?res http:body ?body.
    } GROUP BY ?seedURI
  }
} GROUP BY ?NoSnapshots


Fig. 3. Query Q1 to investigate the appearance of documents. Cf. Figure 1 of [15]. The
commented line would check for an non-empty HTTP body.


                                   0.6
              Ratio of seed URIs


                                   0.4


                                   0.2


                                         0   1          2        3       4   5
                                                 No. of weekly snapshots

Fig. 4. The appearance of documents. Cf. Figure 1 from [15], created using the results
from Q1 in Figure 3.
10      Tobias Käfer, Alexandra Wins, and Maribel Acosta


PREFIX : <http://purl.org/dyldo/vocab#>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX http: <http://www.w3.org/2011/http#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?snapshotDate
       (CONCAT(?scDigit1, "xx") AS ?statusClass)
       (COUNT(DISTINCT ?seedURI) AS ?seedUriCount)
WHERE {
  ?observation :hasSnapshot ?snapshot ;
    :hasSeedURI ?seedURI ;
    :hasLastResponse ?res .
  ?res http:statusCodeValue ?sc .
  BIND(SUBSTR(STR(?sc), 1, 1) AS ?scDigit1)
} GROUP BY ?scDigit1 ?snapshotDate


Fig. 5. Query Q2 to investigate the distribution of HTTP response classes, cf. Figure 2
from [15].


                                    1


                                   0.8
              Ratio of responses


                                   0.6


                                   0.4

                                                                other
                                   0.2                           5xx
                                                                 4xx
                                                                 2xx
                                    0
                                         1   2      3       4       5
                                                 Week No.

Fig. 6. The distribution of HTTP response codes encountered when dereferencing the
seed URIs, cf. Figure 2 from [15], created using the results from Q2 in Figure 5.
    Modelling and Analysing Dynamic Linked Data using RDF and SPARQL               11

Q5 To evaluate more challenging SPARQL queries, we last have a look at
Figure 5 of [15]. This figure is the first in [15] to investigate change in the Linked
Data sources, i. e. it involves multiple snapshots in the query. The query looks
at which number of URIs had which number of changes.


4.2   Data Loading and Query Execution

To showcase the applicability of our approach, we present loading and querying
times for the first five snapshots and the three queries.
    From the five snapshots, we yielded about 10M triples overall with informa-
tion about the physical level (including the hashes on the graphs) and 481M
triples overall with the reified data (derived from about 80M triples raw data),
the hashes, and data about pay-level domains required for more analyses from [15].
As the snapshots are processed individually and because we exploit in the model
the fact that many sources do not change so much over time, there is a high
number of duplicate triples between the data from different snapshots. There-
fore, the data to be processed by the SPARQL query engines is considerably
lower than the reported triple numbers.
    We ran the experiments on a Debian 8 (jessie) 64 bit Linux system with
4 cores of an AMD Opteron 62xx with 2 GHz and 48 GB of RAM. We report the
times in Table 1. We report the loading times for the physical data (including
the hashes on the graphs), and the full data. Moreover, we briefly report on
the lessons learned while using different SPARQL query engines to evaluate our
approach:
    Virtuoso5 (Version 7.20.3217) was not able to load the full data due to the
skews in the data introduced by the reified triples. We nevertheless include Vir-
tuoso in our measurements, as we may switch the representation of the logical
data in the future.
    Blazegraph6 (Version 2.1.4) managed to load the entire data, but query per-
formance is low when using the automatic query optimiser for queries where
different snapshots are compared, for instance it took about a week to get re-
sults for Q5. Optimising the query plan using Blazegraph’s hints allowed us to
significantly improve the performance, for Q5 down to 73 s.
    Linked Data-Fu7 (Version 0.9.12) [18] does not index the data but queries
RDF on the fly, so we there are no loading times. Moreover, Linked Data-Fu
does not support aggregates, so we had to remove them from the queries and
implemented them using AWK8 scripts. For the query evaluation, we piped the
results from Linked Data-Fu through AWK 4.1.1 to get the envisioned results. We
report the overall run time in the table for processing the physical information
and hashes.
5
  http://virtuoso.openlinksw.com/
6
  http://www.blazegraph.com/
7
  http://linked-data-fu.github.io/
8
  http://www.gnu.org/software/gawk/
12      Tobias Käfer, Alexandra Wins, and Maribel Acosta


PREFIX : <http://purl.org/dyldo/vocab#>
PREFIX crypto: <http://www.w3.org/2000/10/swap/crypto#>
PREFIX http: <http://www.w3.org/2011/http#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?changeCount (COUNT (?changeCount) as ?uriCount) WHERE {
  SELECT (COUNT(?seedURI) AS ?changeCount) WHERE {
    ?observation1 a qb:Observation ;
      :hasSnapshot ?snapshot;
      :hasLastResponse ?resp1 .
    ?resp1 http:statusCodeValue "200"^^xsd:integer ;
      http:body [ crypto:md5 ?hash1 ] .
    ?observation2 a qb:Observation ;
      :hasSnapshot ?snapshot2 ;
      :hasLastResponse ?resp2 .
    ?resp2 http:statusCodeValue "200"^^xsd:integer ;
      http:body [ crypto:md5 ?hash2 ] .
    ?snapshot :next ?snapshot2 .
    FILTER((STR(?hash1) != STR(?hash2)))
  } GROUP BY ?seedURI
} GROUP BY ?changeCount


Fig. 7. Query Q5 to investigate the number of changes per number of URIs, cf. Figure 5
from [15].


                                   0.8
              Ratio of seed URIs


                                   0.6


                                   0.4


                                   0.2


                                    0
                                         0   1          2         3   4
                                                 No. of changes

Fig. 8. Number of changes per number of URIs. Cf. Figure 5 from [15], created using
the results from Q5 in Figure 7.
    Modelling and Analysing Dynamic Linked Data using RDF and SPARQL              13

                  Table 1. Times for loading and querying data.

                                              Loading          Querying
                                            physical full     Q1 Q2 Q5
      Virtuoso 7.20.3217                       181 s failed    4 s 18 s 14 s
      Blazegraph 2.1.4                         289 s    6h    45 s 80 s 73 s∗
      Linked Data-Fu 0.9.12 + AWK 4.1.1         n/a n/a       97 s 112 s 178 s
     ∗ Using optimisation hint. Otherwise, the execution time was about 1 week.


5    Discussion of the Approach

We observed that our particular workload poses challenges to indexes and query
optimisers. Optimisation of the queries by re-ordering or re-formulating parts of
the query can take the processing time from days down to minutes. Another line
of optimisation is the data: For instance, we could reduce the overall data by
omitting triples that are depicted dotted in Figure 1 because they are the same
for all requests or triples. Second, we could use less verbose modelling for the
triples that are depicted dashed in the figure: We use the RDF list to describe
the HTTP body because it is terminated. The order of the statements does
not matter. As we want to use SPARQL for querying, where the closed-world
assumption is made, we can reduce the number of triples by introducing a blank
node or URI for the HTTP response body, and use triples with a predicate like
rdfs:member to connect the body to the statements in the RDF graph of the
response body.


6    Related Work

In this section, we analyse current approaches for monitoring Linked Data.
    The Dynamic Linked Data Observatory [16] is a framework that monitors
the dynamics of Linked Data. The Dynamic Linked Data Observatory crawls
RDF documents available on the web periodically and provisions logs and raw
data about the crawling process. The data generated by the the Dynamic Linked
Data Observatory is key for tracking changes in Linked Data sets, however, it
requires further processing in order to extract relevant information about the
dynamics of Linked Data sets. Therefore, in this work we propose an approach
that enriches the data generated by the Dynamic Linked Data Observatory. In
its raw form, the data from the Dynamic Linked Data Observatory has been
analysed by different scholars, mostly without taking networking aspects into
account [8, 11, 7, 17, 15].
    Other approaches have focused on monitoring other aspects of Linked Data
[2, 12, 5]. For example, LODStats [2] is an approach that collects statistics about
RDF datasets available on the web. LODStats provides declarative descriptions
of datasets using the LODStats Dataset Vocabulary (LDSO). LDSO extends the
14       Tobias Käfer, Alexandra Wins, and Maribel Acosta

Vocabulary of Interlinked Datasets9 (VoID) and the Data Catalog Vocabulary10
(DCAT) to model meta data and statistical metrics about Linked Data sets.
Furthermore, the work by Hasnain et al. [12] focuses on providing a catalogue
of SPARQL queries to compute statistics based on VoID descriptions of RDF
datasets available through SPARQL endpoints. SPARQLES [5], in turn, focuses
on monitoring publicly available SPARQL endpoints. SPARQLES provides a set
of predefined queries to inspect the support of SPARQL features and perfor-
mance of endpoints. In contrast to our proposed approach, related works focus
on reporting statistics about the current state of the datasets.


7      Conclusion and Future Work
In this paper, we presented an RDF model of dynamic Linked Data for declar-
atively analysing dynamic Linked Data time series using SPARQL queries. We
provide an implementation to distil data according to the model from data col-
lected using LDSpider, such as the Dynamic Linked Data Observatory. We gave
three SPARQL queries for analysing dynamic Linked Data and evaluated them
over five snapshots from the Dynamic Linked Data Observatory using three dif-
ferent SPARQL engines.
    For future work, we want to run more queries on more snapshots. As the data
that is emitted from our processing code does not have many selective predicates,
the engineering of data and queries seems not to be a trivial undertaking.

Acknowledgements
This work is partially supported by the German federal ministry of education
and research in AFAP, a Software Campus project (FKZ 01IS12051).


Bibliography
 1. Arenas, M., Conca, S., and Pérez, J.: Counting beyond a Yottabyte, or how SPARQL
    1.1 property paths will prevent adoption of the standard. In: Proceedings of the
    21st International Conference on World Wide Web (WWW) (2012)
 2. Auer, S., Demter, J., Martin, M., and Lehmann, J.: LODStats - An Extensible
    Framework for High-Performance Dataset Analytics. In: Proceedings of the 18th
    International Conference on Knowledge Engineering and Knowledge Management
    (EKAW) (2012)
 3. Berners-Lee, T.: Linked Data. Design Issues, (2006). http : / / www . w3 . org /
    DesignIssues/LinkedData.html
 4. Berners-Lee, T., Fielding, R., and Masinter, L.: Uniform Resource Identifier (URI):
    Generic Syntax. Internet Standard. RFC 3986. IETF (2005).
 5. Buil Aranda, C., Hogan, A., Umbrich, J., and Vandenbussche, P.: SPARQL Web-
    Querying Infrastructure: Ready for Action? In: Proceedings of the 12th Interna-
    tional Semantic Web Conference (ISWC) (2013)
 9
     http://www.w3.org/TR/void/
10
     http://www.w3.org/TR/vocab-dcat/
   Modelling and Analysing Dynamic Linked Data using RDF and SPARQL                 15

 6. Cyganiak, R., Wood, D., and Lanthaler, M., eds.: RDF 1.1 Concepts and Ab-
    stract Syntax. Recommendation, W3C. (2014). http://www.w3.org/TR/rdf11-
    concepts/
 7. Dividino, R. Q., Gottron, T., and Scherp, A.: Strategies for Efficiently Keeping
    Local Linked Open Data Caches Up-To-Date. In: Proceedings of the 14th Interna-
    tional Semantic Web Conference, (ISWC) (2015)
 8. Dividino, R. Q., Scherp, A., Gröner, G., and Grotton, T.: Change-a-LOD: Does
    the Schema on the Linked Data Cloud Change or Not? In: Proceedings of the
    Fourth International Workshop on Consuming Linked Data (COLD) at the 12th
    International Semantic Web Conference (ISWC) (2013)
 9. Fielding, R. and Reschke, J., eds.: Hypertext Transfer Protocol (HTTP/1.1): Mes-
    sage Syntax and Routing. RFC 7230 (Proposed Standard). IETF (2014).
10. Fielding, R. and Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Semantics
    and Content. RFC 7231 (Proposed Standard). IETF (2014).
11. Gottron, T. and Gottron, C.: Perplexity of Index Models over Evolving Linked
    Data. In: Proceedings of the 11th European Semantic Web Conference (ESWC)
    (2014)
12. Hasnain, A., Mehmood, Q., Sana e Zainab, S., and Hogan, A.: SPORTAL: Profiling
    the Content of Public SPARQL Endpoints. International Journal on Semantic Web
    and Information Systems 12(3) (2016)
13. Hogan, A.: Skolemising Blank Nodes while Preserving Isomorphism. In: Proceed-
    ings of the 24th International Conference on World Wide Web (WWW) (2015)
14. Isele, R., Umbrich, J., Bizer, C., and Harth, A.: LDSpider: An open-source crawling
    framework for the Web of Linked Data. In: Proceedings of Posters and Demos at
    the 9th International Semantic Web Conference (ISWC) (2010)
15. Käfer, T., Abdelrahman, A., Umbrich, J., O’Byrne, P., and Hogan, A.: Observ-
    ing Linked Data Dynamics. In: Proceedings of the 10th European Semantic Web
    Conference (ESWC) (2013)
16. Käfer, T., Umbrich, J., Hogan, A., and Polleres, A.: Towards a Dynamic Linked
    Data Observatory. In: Proceedings of the 5th Workshop on Linked Data on the
    Web (LDOW) at the 25th International Conference on World Wide Web (WWW)
    (2012)
17. Nishioka, C. and Scherp, A.: Information-theoretic Analysis of Entity Dynamics on
    the Linked Open Data Cloud. In: Proceedings of the 3rd International Workshop
    on Dataset PROFIling and fEderated Search for Linked Data (PROFILES) at the
    the 13th European Semantic Web Conference (ESWC) (2016)
18. Stadtmüller, S., Speiser, S., Harth, A., and Studer, R.: Data-Fu: A Language and
    an Interpreter for Interaction with Read/Write Linked Data. In: Proceedings of
    the 22nd International Conference on World Wide Web (WWW) (2013)
19. Umbrich, J., Karnstedt, M., Hogan, A., and Parreira, J. X.: Hybrid SPARQL
    Queries: Fresh vs. Fast Results. In: Proceedings of the 11th International Semantic
    Web Conference (ISWC) (2012)
20. Zimmermann, A., ed.: RDF 1.1: On Semantics of RDF Datasets. Working Group
    Note, W3C. (2014). http://www.w3.org/TR/rdf11-datasets/