Towards a Dynamic Linked Data Observatory

        Tobias Käfer                 Jürgen Umbrich                  Aidan Hogan                      Axel Polleres
     Karlsruhe Institute of            Digital Enterprise            Digital Enterprise          Siemens AG Österreich,
     Technology, Germany              Research Institute,           Research Institute,         Siemensstrasse 90, 1210,
                                     National University of        National University of            Vienna, Austria
                                   Ireland, Galway, Ireland      Ireland, Galway, Ireland


ABSTRACT                                                         time. Current results in the area rely on domain-specific
We describe work-in-progress on the design and methodol-         datasets (e.g., Popitsch and Haslhofer [26] focus on DBpe-
ogy of the Dynamic Linked Data Observatory: a framework          dia changes [2]), or infrequently updated snapshots of open
to monitor Linked Data over an extended period of time.          domain data over short monitoring time-spans (e.g., our
The core goal of our work is to collect frequent, continuous     previous work looked at 20 weekly snapshots collected in
snapshots of a subset of the Web of Data that is interesting     2008 [28]). Thus, we believe that a first, important step is to
for further study and experimentation, with an aim to cap-       build from scratch a new monitoring framework to derive a
ture raw data about the dynamics of Linked Data. The re-         bespoke, continuously updated collection of snapshots. This
sulting corpora will be made openly and continuously avail-      collection can then be freely used to study not only the high-
able to the Linked Data research community. Herein, we (1)       level dynamics of entities, but also to distil the fundamental
motivate the importance of such a corpus; (2) outline some of    underlying patterns of changes in the Web of Data across
the use-cases and requirements for the resulting snapshots;      different domains (e.g., studying if certain graph patterns
(3) discuss different “views” of the Web of Data that affect     are more dynamic than others [29]).
how we define a sample to monitor; (4) detail how we select         Clearly the design of such a framework, and gathering
the scope of the monitoring experiment through sampling,         the requirements for the resulting collection, is non-trivial:
(5) discuss the final design of the monitoring framework that    many factors and actors have to be taking into considera-
will gather regular snapshots of (subsets of) the Web of Data    tion in the context of the open Web and the Linked Data
over the coming months and years.                                community. Conceptually, we want the collection to be:
                                                                 general-purpose: suitable to study for a wide range of in-
1.    INTRODUCTION                                                    terested parties;
   Linked Data enjoys continued momentum in terms of pub-        broad : capturing a wide selection of Linked Data domains;
lishing, research and development; as a result, the Web of       substantial : the number of documents monitored should
Data continues to expand in size, scope and diversity. How-           allow for deriving confident statistical measures;
ever, we see a niche in terms of understanding how the Web
of Data evolves and changes over time. Establishing a more       granular & frequent: offering detailed data on sources;
fine-grained understanding of the nature of Linked Data dy-      contiguous: allowing comparison of sources over time; and
namics is of core importance to publishing, research and         adaptive: able to discover the arrival of new sources, and
development. With regards to publishing, a better under-             can monitor more dynamic sources more frequently.
standing of Linked Data dynamics would, for example, in-
form the design of tools for keeping published data consis-      However, some of these targets are antagonistic and demand
tent with changes in external data (e.g., [26]). With regards    a trade-off. Monitoring a substantial number of sources in
to research, more granular data on Linked Data dynamics          a granular & frequent fashion requires a practical compro-
would open up new paths for exploration, e.g., the design of     mise to be made. Similarly, contiguous data and adaptive
hybrid/live query approaches [30] that know when a query         monitoring are conflicting aims. Furthermore, the aims to
relates to dynamic data, and that retrieves fresh results di-    be general-purpose and broad need more concrete consid-
rectly from the source. With regards development, for cur-       eration in terms of what sources are monitored.
rent systems, the results of such studies would inform crawl-       For implementing the monitoring framework, other practi-
ing strategies, local index refresh rates and update strate-     cal considerations include politeness such that remote servers
gies, cache invalidation techniques and tuning, and so forth.    are not unnecessarily overburdened, stability such that the
   Towards a better understanding of Linked Data dynam-          monitoring experiment can function even in the case of hard-
ics, the community currently lacks a high-quality, broad,        ware failure, and resource overhead such that the computa-
granular corpus of raw data that provides frequent snap-         tion can be run on a single, dedicated commodity machine.
shots of Linked Data documents over a sustained period of           Taking these design criteria into account, herein we moti-
                                                                 vate and initially propose a framework for taking frequent,
Acknowledgements. This work has been funded in part by Science   continuous snapshots of a subset of Linked Data which we
Foundation Ireland under Grant No. SFI/08/CE/I1380 (Líon-2).     call DyLDO: the Dynamic Linked Data Observatory. We cur-
LDOW2012 April 16, 2012 Lyon, France                             rently focus on defining the size and scope of the monitor-
Copyright held by the author(s)/owner(s).                        ing experiment, discussing the rationale behind our choice
of sources to observe, touching upon various issues relating       UC-3: Hybrid Architectures.
to sampling the Web of Data. Later, we also sketch the                A large index of Linked Data can implement a hybrid ar-
framework itself, outlining the crawling to be performed for       chitecture based on dynamicity statistics, where one process-
each snapshot, providing rationale for different parameters        ing pipeline is optimised for static knowledge, and another
used in the crawl, as well as proposing an adaptive filter-        for dynamic knowledge.
ing scheme for monitoring more dynamic sources more fre-              In various Linked Data search and query engines, there is
quently. Our primary goal here is to inform the community          an inherent trade-off between running computation live dur-
of our intentions, outline our rationale, collect use-cases and    ing query-time or pre-computing answers offline. Abstractly,
potential consumers of the snapshot collection, and to gather      pre-computation suits static data and runtime computation
feedback and requirements prior to starting the monitoring         suits dynamic data. Example trade-offs include dynamic in-
experiments. In particular, we currently focus on defining         sertions vs. batch loading, lightweight indexing vs. heavy-
the scope of the monitoring experiment.                            weight indexing, runtime joins vs. live joins, backward-
   To begin, we motivate our work in terms of envisaged            chaining reasoning vs. forward-chaining reasoning, window-
use-cases and research questions our corpus could help with        based stream operators vs. global operators, etc. Knowledge
(§ 2), subsequently presenting some related work (§ 3). Next,      of dynamicity can help decide which methods are appropri-
in order to understand what we are monitoring/sampling             ate for which data. Furthermore, smart hybrid queries may
– to ascertain its borders – we ask the question What is           become possible: consider the query “Give me (current)
the Web of Data?, and compare two prominent “views”                temperatures of European capitals”, where knowledge
thereof: (1) the Billion Triple Challenge dataset view, and        of dynamicity would reveal that temperatures are dynamic
(2) the CKAN/LOD cloud view (§ 4). Thereafter, we de-              and should be fetched live, whereas European capitals are
scribe the sampling methodology we have used to derive a           static and can be run (efficiently) over the local index.
“seed list” of URIs that form the core of the monitoring ex-
periment (§ 5). Finally, we outline the proposed monitoring        UC-4: External-Link Maintenance.
framework, detailing setup parameters, and adaptive exten-            The link maintenance use-case addresses the challenge to
sions (§ 6). We conclude with future directions and a call for     preserve referential integrity and the correct type of links in
feedback and potential use-cases from the community (§ 7).         the presence of dynamic external data.
                                                                      Popitsch and Haslhofer [26] investigate this use case, which
                                                                   involves monitoring external Web datasets for changes to
2.     USE CASES AND OPEN QUESTIONS                                help ensure the integrity of links targeting them from local
  We now discuss the potential benefit and impact of our           data. They propose DSNotify: a solution and an evalua-
proposed observatory for Linked Data based on (1) some             tion framework which is able to replay changes in a dataset;
envisaged use-cases, and (2) some open research questions          however, the authors only have knowledge of DBpedia dy-
that our data could help to empirically investigate.               namics to leverage. Such works help ensure the quality of
  In previous work [31], we gave an overview of Web and            links between datasets, and we hope that our corpora will
Linked Data dynamics, presenting four community use cases          help extend application to the broader Web of Data.
that require technical solutions to deal with dynamic data.
We first extend these four example scenarios to motivate our       UC-5: Vocabulary Evolution and Versioning.
ongoing work on the Dynamic Linked Data Observatory.                  Knowledge of the dynamics of Linked Data vocabularies
                                                                   could lead to better versioning methods.
UC-1: Synchronisation.                                                Changes in the semantics of terms in Linked Data vo-
   Synchronisation addresses the problem of keeping an of-         cabularies can have a dramatic influence on the interpre-
fline sample/replica of the Web of Data up-to-date.                tation of remote datasets using them. For example, the
   The most common scenarios is the maintenance of locally         FOAF vocabulary is extremely widely used on the Web of
cached LOD indexes. To the best of our knowledge, none of          Data [12, 20], but often resorts to informal versioning mech-
the popular semantic web caches (such as hosted by Sindice         anisms: aside from the monotonic addition of terms, for ex-
and OpenLink) investigated index-update strategies based           ample, FOAF recently removed disjointness constraints be-
on the current knowledge of the dynamicity of Linked Data,         tween popular classes like foaf:Person and foaf:Document,
but rather rely on publisher-reported information about up-        made foaf:logo inverse-functional, etc., that may change
date frequencies where available (e.g., sitemaps1 ).               inferencing over (or even break) external data. Studying
                                                                   and understanding the dynamicity of vocabularies may mo-
UC-2: Smart Caching.                                               tivate and suggest better methodologies for versioning.
   This use-case tries to find efficient methods to optimise          We foresee that research and tools relating to these use-
systems operating live over Web data by minimising network         cases will benefit directly from having our data collection
traffic wasted on unnecessary HTTP lookups.                        available for analysis. However, aside from these concrete
   Versus synchronisation, this use-case targets systems that      use-cases and more towards a Web science viewpoint, we also
operate live and directly over the Web of Data. An exem-           see some rather more fundamental – and possibly related –
plary use-case would be the integration of smart caching           empirical questions that our collection should help answer:
for live querying (e.g., [18]) or live browsing (e.g., [1]) over
                                                                   Change frequency : Can we model change frequency of doc-
Linked Data, avoiding re-accessing a document or derefer-
                                                                       uments with mathematical models and thus predict
encing a URI if it is unlikely to have changed according to
                                                                       future changes?
knowledge of dynamicity patterns.
                                                                   Change patterns: Can we mine patterns that help to cat-
1
    http://sitemaps.org/                                               egories change behaviour?
Degree of change: If a document changes, how much of                ∼ 8 % new content every week, and regarding structural
    its content is updated?                                         changes, found that the link structure of the Web changes
Lifespan: What is the lifespan of Linked Data documents?            faster than the textual content by a factor of 3. Various au-
                                                                    thors found that with respect to the degree of change, the
Stability : How stable are Linked Data documents in terms           majority of changes in HTML documents are minor [21, 14,
     of HTTP accessibility?                                         22]. Loosely related to change triggers, Fetterly et al. [14]
Growth rate: How fast is the Web of Data evolving?                  found that certain parties simulate content changes to draw
                                                                    the attention of search engines. Regarding domain depen-
Structural changes: Do we observe any changes in the
                                                                    dent changes, various authors also showed that change fre-
     structure of the network formed by links?
                                                                    quencies vary widely across top-level domains [14, 8, 5, 6].
Change triggers: Can we find graph patterns that trigger               Relating to use-cases for studying the dynamics of Web
    or propagate changes through the network?                       documents, a variety have been introduced down through
Domain-dependent changes: Do we observe a variation                 the years, including (i) the improvements of Web proxies
   or clustering in dynamicity across different domains?            or caches looked at by, e.g., Douglis et al [13], (ii) efficient
                                                                    handling of continuous queries over documents [24] or, re-
Vocabulary-dependent changes: Do we observe different
                                                                    turning to RDF, over SPARQL endpoints [25]. We refer in-
    change patterns for data using certain vocabularies,
                                                                    terested readers to the excellent survey by Oita and Senel-
    classes or properties?
                                                                    lart [23], which provides a comprehensive overview of ex-
Vocabulary changes: How do the semantics of vocabulary              isting methodologies to detect Web page changes, and also
    terms evolve over time?                                         surveys general studies about Web dynamics.

3.   BACKGROUND                                                     4.   WHAT IS THE WEB OF DATA?
   Various papers have addressed similar research questions            Our data collection should allow researchers to study var-
for the Web. Most works thus far have focused on the dy-            ious aspects and characteristics of data dynamics across a
namicity of the traditional HTML Web, which (mostly) cen-           broad selection of Linked Data domains. However, it is
tres around dynamicity on the level of document changes.            not clear which data providers should be considered as “in
For the purposes of our use-cases, our notion of Linked Data        scope”, which are of interest for the Linked Data community
dynamics goes deeper and looks to analyse dynamic patterns          who we target, and how we should define our target popu-
within the structured data itself: i.e., dynamicity should also     lation. Linked Data itself is a set of principles and an as-
be considered on the level of resources and of statements (as       sociated methodology for publishing structured data on the
noted previously [29, 26]). Still, studies of the dynamicity        Web in accordance with Semantic Web standards and Web
of the HTML Web can yield interesting insights for our pur-         Architecture tenets [19]. Various data providers are compli-
poses. In fact, in previous works, we initially established a       ant with Linked Data principles to varying degrees: there’s
link between the frequency of change of Linked Data docu-           no one yardstick by which a dataset can be unambiguously
ments and HTML documents [28].                                      labelled as “Linked Data”.
   The study of the evolution of the Web (of Documents) and            For the purposes of the seminal Linking Open Data (LOD)
its implicit dynamics reaches back to the proposal of the first     project, Cyganiak and Jentszch use a variety of minimal
generation of autonomous World Wide Web spiders (aka.               requirements a dataset should meet to be included in the
crawlers) around 1995. Bray [4] published one of the first          LOD Cloud diagram [10], which serves as a overview of con-
studies about the characteristics of the Web and estimated          nections between Linked Data corpora. However, the LOD
its size in 1996. Around the same time, Web indexes such as         Cloud is biased towards large monolithic datasets published
AltaVista or Yahoo! began to offer one of the first concrete        on one domain, and does not cover low-volume cross-domain
use-cases for understanding the change frequency of Web             publishing as common for vocabularies such as FOAF, SIOC,
pages: the efficient maintenance of search engine indexes.          etc. For example, platforms like identi.ca/status.net,
In 1997, Coffman et al. [9] proposed a revisiting strategy          Drupal, Wordpress, etc., can export compliant, decentralised
for Web crawlers to improve the “freshness” of an index.            Linked Data – using vocabularies such as FOAF and SIOC –
This work was continuously improved over the subsequent             from the various domains where they are deployed, but their
years with additional experimental and theoretical results          exports are not in the LOD Cloud. Furthermore, it gives no
provided by Brewington [5, 6], Lim et al. [21], Cho et al. [8,      explicit mention to the vocabularies themselves, which are
7], Fetterly et al. [14] and Ntoulas et al. [22], amongst others.   of high relevance to our requirements.
   Based on large data collections, these papers presented             A broader notion to consider is the Web of Data, which
theory and/or empirical analyses of the HTML Web that               would cover these latter exporters and vocabularies, but
relate closely to the dynamicity questions we highlight. For        which is somewhat ambiguous and with ill-defined borders.
example, various authors discovered that the change be-             For our purposes, we define the Web of Data as being com-
haviour of Web pages corresponds closely with – and can             prised of interlinked RDF data published on the Web.2 No
be predicted using – a Poisson distribution [5, 6, 8, 7]; in pre-   clear road-map is available for the Web of Data per this defi-
vious work [28], we presented initial evidence that changes         nition; the LOD cloud only covers prominent subsets thereof.
in Linked Data documents also follow a Poisson distribu-            Perhaps the clearest picture of the Web of Data comes from
tion, though our data was insufficient to be conclusive. Re-        crawlers that harvest RDF from the Web. A prominent ex-
lating to high-level temporal change patterns, Ntoulas et
al. [22] analysed the different frequency of updates for indi-      2
                                                                     This definition is perhaps more restrictive than some in-
vidual weekdays and working hours. The same paper also              terpretations where, e.g., Sindice incorporates Microformats
empirically estimated the growth rate of the Web to be              into their Web of Data index [11].
                                                                                                                                         1000
                       104                                                 104
 Number of documents


                                                     Number of documents


                                                                                                                   Number of documents
                                                                                                                                         800

                                                                                                                                         600                 hi5.com
                         2                                                   2
                       10                                                  10
                                                                                                                                                      P
                                                                                                                                         400              next top 10

                                                                                                                                         200

                       100                                                 100                                                             0
                                   2           4
                                10           10                                  100       102         104                                      900     1000       1100      1200
                              Number of statements                                  Number of statements                                              Number of statements
                                 (a) Overall                                          (b) hi5.com                                                 (c) hi5.com dominates

Figure 1: Distribution of the number of statements in documents for the BTC2011 dataset (1a) overall and (1b) for hi5.com;
as well as (1c) the periodicity of distribution of statements-per-document for hi5.com that causes the split tail in (1a) & (1b).


ample is the Billion Triple Challenge Dataset, which is made                                           As such, it is more pertinent to look at what the dataset
available every year and comprises of data collected during                                         actually contains. The most recent BTC dataset – BTC
a deep crawl of RDF/XML documents on the Web of Data.                                               2011 – was crawled in May/June 2011. The final dataset
However, the precise composition of such datasets is unclear,                                       contains 2.145 billion quadruples, extracted from 7.411 mil-
and requires further study.                                                                         lion RDF/XML documents. The dataset contains RDF doc-
  As such, the core question of what it is we want to moni-                                         uments sourced from 791 pay-level domains (PLDs): a pay-
tor – i.e., what population of domains the Linked Data com-                                         level domain is a direct sub-domain of a top-level domain
munity is most interested in, and thus what population we                                           (TLD) or a second-level country domain (ccSLD), e.g., db-
should sample from – is non-trivial, and probably has no                                            pedia.org, bbc.co.uk. We prefer the notion of a pay-level
definitive answer. To get a better insight, in this section we                                      domain since fully qualified domain names (FQDNs) over-
contrast two such perspectives of the Web of Data, the:                                             exaggerate the diversity of the data: for example, sites such
                                                                                                    as livejournal.com assign different subdomains to individ-
Billion Triple Challenge 2011 dataset [27] which is col-                                            ual users (e.g., danbri.livejournal.com), leading to mil-
      lected from a Web crawl of over seven million RDF/XML                                         lions of FQDNs on one site, all under the control of one
      documents; and the                                                                            publisher. The BTC 2011 dataset contained documents from
                                                                                                    240,845 FQDNs, 233,553 of which were from the livejour-
Comprehensive Knowledge Archive Network (CKAN) 3                                                    nal.com PLD. Henceforth, when we mention domain, we
   repository – specifically the lodcloud group therein –                                           thus refer to a PLD (unless otherwise stated).
   containing high-level metrics reported by Linked Data                                               On the left-hand side of Table 1 we enumerate the top-25
   publishers, used in the creation of the LOD Cloud.                                               PLDs in terms of quadruples contributed to the BTC 2011
                                                                                                    dataset. Notably, a large chunk of the dataset (∼64 %) is
We want to see how these two road-maps of the Web of Data                                           provided by the hi5.com domain: a social gaming site that
can be used as the basis of defining a population to sample.                                        exports a FOAF file for each user. As observed for similar
                                                                                                    corpora (cf. [20, Table A.1]) hi5.com has many documents,
4.1                         The BTC 2011 dataset                                                    each with an average of over two thousand statements – an
   The BTC dataset is crawled from the Web of Data us-                                              order of magnitude higher than most other domains – lead-
ing the MultiCrawler framework [17] for the annual Billion                                          ing to it dominating the overall volume of BTC statements.
Triple Challenge [3] at the International Semantic Web Con-                                            The dominance of hi5.com – and to a lesser extent sim-
ference (ISWC). The dataset empirically captures a deep,                                            ilar sites like livejournal.com – shape the overall charac-
broad sample of the Web of Data in situ.                                                            teristics of the BTC 2011 dataset. To illustrate one promi-
   However, the details of how the Billion Triple Challenge                                         nent such example, Figure 1a gives the distribution of state-
dataset is collected are somewhat opaque. The seed list is                                          ments per document in the BTC dataset on log/log scale,
sampled from the previous year’s dataset [27], where one of                                         where one can observe a rough power-law(-esque) charac-
the initial seed-lists in past years was gathered from vari-                                        teristic. However, there is an evident three-way split in the
ous semantic search engines. The crawl is for RDF/XML                                               tail emerging at about 120 statements, and ending in an
content, and follows URIs extracted from all triple posi-                                           outlier spike at around 4,000 statements. By isolating the
tions. Scheduling (i.e., prioritising URIs to crawl) is ran-                                        distribution of statements-per-document for hi5.com in Fig-
dom, where URIs are shuffled at the end of each round.                                              ure 1b, we see that it contributes to the large discrepancies
As such, any RDF/XML document reachable through other                                               in that interval. The stripes are caused by periodic patterns
RDF/XML documents from the seed list is within scope;                                               in the data, due to its uniform creation: on the hi5.com
otherwise, what content is (or is not) in the BTC – and                                             domain, RDF documents with a statement count of 10 + 4f
how “representative” the dataset is of the Web of Data – is                                         are heavily favoured, where ten triples form the base of a
difficult to ascertain purely from the collection mechanisms.                                       user’s description and four triples are assigned to each of f
friends. Other lines are formed due to two optional fields         covering 133 domains (∼15.6 %), and the intersection of
(foaf:surname/foaf:birthday) in the user profile, giving a         both covering 70 domains (∼8.2 % overall; ∼8.8 % of BTC;
9 + 4f and 8 + 4f periodicity line. An enforced ceiling of         ∼52.6 % of CKAN/LOD). CKAN/LOD reports a total of
f ≤ 1, 000 friends explains the spike at (and around) 4,010.       28.4 billion triples, whereas the BTC (an incomplete crawl)
   The core message here is that although the BTC offers a         accounts for 2.1 billion quadruples (∼7.4 %). However, only
broad view of the Web of Data, covering 791 domains, in ab-        384.3 million quadruples in the BTC dataset (∼17.9 %) come
solute statement-count terms, the dataset is skewed by a few       from PLDs mentioned in the extracted CKAN/LOD meta-
high-volume exporters of FOAF, and in particular hi5.com.          data.
When deriving global statistics and views from the BTC, the           In Table 1, we present the BTC and CKAN/LOD state-
results say more about the code used to generate hi5.com           ment counts side-by-side. We can observe that a large num-
profiles than the efforts of thousands of publishers.4 This        ber of high-volume BTC domains are not mentioned on
is also a naturally-occurring phenomenon in other corpora          CKAN/LOD, where the datasets in question may not pub-
(e.g., [12, 20]) crawled from the Web of Data – not just iso-      lish enough RDF data to be eligible by CKAN/LOD, or may
lated to the BTC dataset(s) – and is not easily fixed. One         not follow Linked Data principles or have enough external
option to derive meaningful statistics about the Web of Data       links, or may not have self-reported.
from such datasets is to apply (aggregated) statistics over           Perhaps more surprisingly however, we note major dis-
individual domains, and never over the corpus as a whole.          crepancies in terms of the catchment of BTC statements ver-
                                                                   sus CKAN/LOD metadata. Given that BTC can only sam-
4.2    CKAN/LOD cloud metadata                                     ple larger domains, a lower statement count is to be expected
   In contrast to the crawled view of the Web of Data, the         in many cases: however, some of the largest CKAN/LOD do-
CKAN repository indexes publisher-reported statistics about        mains do not appear at all. Reasons can be found through
their dataset. These CKAN metadata are then used to de-            analysis of the BTC 2011’s publicly available access log [27].
cide eligibility for entry into the LOD cloud [10]: a highly       In Table 2, we present reasons for the top-10 highest-volume
prominent depiction of Linked Open Datasets and their in-          CKAN/LOD data providers not appearing in the BTC 2011
terlinkage. A CKAN-reported dataset is listed in the LOD           dataset (i.e., providers appearing with “—” on the right-
cloud iff it fulfils the following requirements: the dataset has   hand side of Table 1). Robots indicates that crawling was
to (1) be published according to core Linked Data principles,      prohibited by robots.txt exclusions; Http-401 and Http-
(2) contain at least one thousand statements and (3) provide       502 indicate the result of lookups for URIs on that domain;
at least 50 links to other LOD cloud datasets5 .                   Mime indicates that the content on the domain did not re-
   Given the shortcomings of the crawled perspective on the        turn application/rdf+xml used as a heuristic in the BTC
Web of Data, we explore these self-reported metadata to            crawl to filter non-RDF/XML content; Unreachable indi-
get an alternative view on what we should be sampling. On          cates that no lookups were attempted on URIs from that
September 29, 2011, we downloaded the meta-information             domain; Other refers solely to europeana.eu, which redi-
for the datasets listed in the lodcloud group on CKAN6 .           rected all requests to their home page.
The data contain example URIs for the dataset and statistics
such as the number of statements. We discovered meta-data            In summary, we see two quite divergent perspectives on
for 297 datasets, spanning 206 FQDNs and 133 PLDs. On              the Web of Data, given by the BTC 2011 dataset and the
the right hand side of Table 1, we enumerate the top-25            CKAN/LOD metadata. Towards getting a better picture
largest reported datasets in the lodcloud group on CKAN.           of what population we wish to sample for the monitoring
Note that where multiple datasets are defined on the same          experiments, and towards deciding on a sampling method-
domain, the triple count is presented as the summation of          ology, the pertinent question is then: Which perspective
said datasets. In this Table, we see a variety of domains          best suits the needs of our monitoring experiment?
claiming to host between 9.8 billion and 94 million triples.       As enumerated in Table 3, both perspectives have inher-
   Regarding the data formats present in the LOD cloud,            ent strengths and weaknesses. As such, we believe that our
most of the datasets claim to serve RDF/XML data (85 %),           sampling method should try to take the best of both per-
4 % claim to serve RDFa (of which 50 % did not also of-            spectives, towards a hybrid view of the Web of Data.
fer RDF/XML). This shows the popularity of RDF/XML,
but only supporting RDF/XML will still miss out on 15 %
of datasets. However, the syntax metadata are somewhat
unreliable, where improper mime-types are often reported.          5.   SAMPLING METHODOLOGY
                                                                      Due to the size of the Web and the need for frequent snap-
4.3    BTC vs. CKAN/LOD                                            shots, sampling is necessary to create an appropriate collec-
  Finally, we contrast the two different perspectives of the       tion of URIs that can be processed and monitored under the
Web of Data. Between both, there are 854 PLDs mentioned,           given time, hardware and bandwidth constraints. The goal
with BTC covering 791 domains (∼92.6 %), CKAN/LOD                  of our sampling method is thus two-fold: to select a set of
                                                                   URIs that (i) capture a wide cross-section of domains and (ii)
4
  Furthermore, hi5.com is not even a prominent domain on           can be monitored in a reasonable time given our resources
the Web of Data in terms of being linked, and was ranked           and in a polite fashion. Given the previous discussion, we
179/778 domains in a PageRank analysis of a similar corpus;        wish to combine the BTC/crawled and CKAN/metadata
http://aidanhogan.com/ldstudy/table21.html
5                                                                  perspectives when defining our seed-list.
  http://www.w3.org/wiki/TaskForces/
CommunityProjects/LinkingOpenData/DataSets/                           Before we continue to describe the sampling methodology
CKANmetainformation                                                we choose, it is worthwhile to first remark upon sampling
6                                                                  methods used in other dynamicity studies of the Web.
  http://thedatahub.org/group/lodcloud
                                  Top-25 BTC                                                                     Top-25 CKAN/LOD
      №
          PLD                              BTC                                           LOD       PLD                        LOD                  BTC
      1   hi5.com                            1,371,854,358                                  —      rpi.edu                 9,803,140,000      900,464
      2   livejournal.com                      169,863,721                                  —      linkedgeodata.org       3,000,000,000           —
      3   tfri.gov.tw                          153,300,321                          23,015,257     legislation.gov.uk      1,900,000,000   31,990,934
      4   scinets.org                           56,075,080                                  —      wright.edu              1,730,284,735            5
      5   ontologycentral.com                   55,124,003                         122,000,000     concordia.ca            1,500,000,000           —
      6   rdfize.com                            36,154,381                                  —      data.gov.uk             1,336,594,576   13,302,277
      7   legislation.gov.uk                    31,990,934                       1,900,000,000     dbpedia.org             1,204,000,000   25,776,027
      8   identi.ca                             30,429,795                                  —      rdfabout.com            1,017,648,918           —
      9   bibsonomy.org                         28,670,581                                  —      dbtune.org                888,089,845    1,634,891
     10   dbpedia.org                           25,776,027                       1,204,000,000     uniprot.org               786,342,579    4,004,440
     11   freebase.com                          25,488,720                         337,203,427     unime.it                  586,000,000           —
     12   opera.com                             23,994,423                                  —      uriburner.com             486,089,121           —
     13   bio2rdf.org                           20,168,230                          72,585,132     openlibrary.org           400,000,000       25,396
     14   archiplanet.org                       13,394,199                                  —      sudoc.fr                  350,000,000           —
     15   data.gov.uk                           13,302,277                       1,336,594,576     freebase.com              337,203,427   25,488,720
     16   loc.gov                                7,176,812                          24,151,586     fu-berlin.de              247,527,498    5,658,444
     17   vu.nl                                  6,106,366                          14,948,788     dataincubator.org         205,880,247    3,695,950
     18   bbc.co.uk                              5,984,102                          80,023,861     viaf.org                  200,000,000           —
     19   rambler.ru                             5,773,293                                  —      europeana.eu              185,000,000           —
     20   fu-berlin.de                           5,658,444                         247,527,498     moreways.net              160,000,000           —
     21   uniprot.org                            4,004,440                         786,342,579     rkbexplorer.com           134,543,526          220
     22   dataincubator.org                      3,695,950                         205,880,247     ontologycentral.com       122,000,000   55,124,003
     23   zitgist.com                            3,446,077                          60,000,000     opencorporates.com        100,000,000           —
     24   daml.org                               3,135,225                                  —      uberblic.org              100,000,000           —
     25   mybloglog.com                          2,952,925                                  —      geonames.org               93,896,732      458,490

Table 1: Statement counts for top-25 PLDs in the BTC with corresponding reported triple count in CKAN (left), and top-25
PLDs in CKAN with BTC quad count (right)
                                                                   Unreachable


                                                                                                                           BTC
                                                                                                    Pros:    X   covers more domains (791)
                                      Http-401

                                                 Http-502


                                                                                                             X   empirically validated
                             Robots


                                                                                 Other


                                                                                                             X   includes vocabularies
                                                            Mime


                                                                                                             X   includes decentralised datasets
      PLD
                                                                                                    Cons:    X   influence of high-volume domains
      linkedgeodata.org                          X          X                                                X   misses 47.4 % of LOD/CKAN domains
      concordia.ca                                          X
      rdfabout.com                                          X                                                          LOD/CKAN
      unime.it                                                     X                                Pros:    X   domains pass “quality control”
      uriburner.com          X                                                                               X   community validated
      sudoc.fr                                                     X
      viaf.org                                              X                                       Cons:    X   covers fewer domains (133)
      europeana.eu                                                               X                           X   self-reported statistics
      moreways.net                                                 X                                         X   misses vocabularies
      uberblic.org                    X                                                                      X   misses decentralised datasets


Table 2: Reasons for largest ten PLDs in CKAN/LOD not                                            Table 3: Advantages and disadvantages for both perspec-
appearing in BTC 2011.                                                                           tives of the Web of Data.


Published Sampling Techniques. There are several pub-                                            quently changing pages and prepared their seed list accord-
lished works that present sampling techniques in order to                                        ingly: initially starting from a list of URIs provided by a
create a corpus of Web documents that can be monitored                                           Google crawl, they performed two crawls a day and selected
over time. Having studied a variety of major works, we                                           the most dynamic URIs (in terms of content changes) that
could not find common agreement on a suitable sampling                                           could also be successfully accessed 500 times in a row. As
technique for such purposes. The targeted use-cases and re-                                      such, the authors focus on monitoring stable and dynamic
search questions directly affect the domain and number of                                        Web documents. Fetterly et al. [14] randomly sampled URIs
URIs, as well as the monitoring frequency and time frame.                                        from a crawl seeded from the Yahoo! homepage to cover
   Grimes and O’Brien [16] studied the dynamics of very fre-                                     many different topics and providers, giving broad coverage
for a general-purpose study; however, the surveyed docu-          2.2GHz Opteron x86-64, 4GB RAM on a university network.
ments would be sensitive to the underlying distribution of        Thereafter, we use the following configuration:
documents in the original Yahoo! corpus. Cho and Garcia-
Molina [8] studied the change frequency of pages using a          All RDF syntaxes: unlike BTC crawls, which only con-
dataset which was sampled by selecting the root URIs of                sider RDF/XML, we wish to consider all standard se-
the 270 most popular domains from a 25 million web page                rialisation formats for RDF (including Turtle, which is
crawl and then crawling three thousand pages for each of               soon to be standardised), as supported by any23. Fur-
the domains; this method provides a good, broad balance of             ther, we do not pre-filter content based on Content-
documents across different domains.                                    type reporting or file extensions. RDFa is becoming
                                                                       a preferred format for many publishers: when pars-
Our sampling method. We can conclude that existing sam-                ing RDFa, we monitor the output statements and ex-
pling methods select URIs from crawled documents, either               clude the content of HTML documents for which we
randomly, because of specific characteristics (e.g., dynamic           find only “accidental” triples as extracted from titles,
or highly ranked), or to ensure an even spread across dif-             stylesheets, icons, etc., and instead only consider doc-
ferent hosts. Thus, we decide to use a combination of these            uments that intend to publish non-trivial RDFa.
three methods to generate our list of URIs.                       Threads: multithreading keeps the hardware busy while
   Given the previous discussion of Section 4, we start with          slow HTTP requests are being processed; in previous
an initial seed list of URIs taken from: (1) the registered ex-       work [20], we found 64 threads to offer the best tradeoff
ample URIs for the datasets in the LOD cloud and (2) the              between parallelism and CPU/disk contention.
most popular URIs in the BTC dataset of 2011. The most            Timeouts: we terminate unresponsive connections and sock-
popular URIs are selected based on a PageRank analysis of             ets after 120 seconds. Timeouts are kept deliberately
the documents in the BTC 2011 dataset, where we select the            high to help ensure stable crawls.
top-k ranked documents from this analysis (please see [15]
for details); note that many of the top ranked documents          Links: we consider all URIs contained in the RDF data as
refer to commonly instantiated vocabularies on the Web of              potential links to follow (and, e.g., not only the value
Data, which are amongst the most linked/central Linked                 of rdfs:seeAlso or such).
Data documents. At the time of access, we found 220 exam-         Breadth-first: we crawl documents in a round-based, URI
ple URIs in the CKAN/LOD registry, and we complement                  scheduling approach, which should result in a broader
them with the top-220 document URIs from the BTC 2011                 set of diverse data (assuming a diverse seed-list) [20].
to generate a list of 440 core URIs for monitoring. The core
                                                                  Redirects: 301, 302, 303 and 307 HTTP response codes are
URIs contain 137 PLDs, 120 from the CKAN/LOD exam-
                                                                       not treated as links, but are instead followed directly
ples and 37 from the most popular BTC URIs. This selection
                                                                       in the same round until we reach a non-direct response
guarantees us to cover all relevant domains (similar to [14])
                                                                       (or hit a cycle/path-limit).
and to also consider the most popular and interlinked URIs
on the Web of Data (similar to [8]).                              Per-domain Queue: our crawler queue is divided into in-
   Obviously, 440 seed URIs are insufficient to resolve a mean-       dividual per-domain queues, which are polled round-
ingful corpus for observation over time. Thus, we decide to           robin: this helps ensure a maximal delay between ac-
use a crawl and expand outwards from these core URIs to               cessing the same domain again.
find other documents to monitor in their vicinity. Impor-         Priority Queue: within each individual per-domain queue,
tantly, we wish to stay in the close locale of the 440 core            we prioritise URIs for which we have found the most
URIs; if we go further, we will encounter the same prob-               links thus far. (This only affects the crawl if an incom-
lems as observed for the BTC 2011 dataset, where the data              plete round is performed.)
are skewed by a few high-volume exporters. To avoid being
                                                                  Politeness Policy: we implement a politeness delay of two
diluted by, e.g., hi5.com data and the likes, we thus stay
                                                                       seconds, meaning that we do not access the same PLD
within a 2-hop crawl radius from the core URIs. From the
                                                                       twice within that interval; further, for each domain, we
data thus extracted, we then sample a final set of extended
                                                                       retrieve and respect standard robots.txt exclusions.
seed URIs to monitor. The result is then our best-effort
compromise to achieve representative snapshots of Linked             The minimum amount of time taken to complete a round
Data that (i) take into account both views on Linked Data         becomes the maximum number of active URIs for a single
by including CKAN and BTC URIs in the core seed list,             domain, multiplied by the politeness delay. The combina-
(ii) extend beyond the core seed list in a defined manner (2      tion of this per-PLD delay and the distribution of documents
hops), and (iii) do not exceed our crawling resources.            per domain on the Web [20] introduces the problem of PLD
                                                                  starvation [20]: the nature of the Web of Data means that
Crawling setup. The crawling setup will have a signifi-           after numerous low-volume domains have been crawled, the
cant effect on the selection of URIs to monitor, and so           few remaining domains are not enough to keep the crawler
we provide some detail thereupon. Our implementation is           resources busy between the politeness delay. In the worst
based on two open source Java projects: (1) LDSpider7 , a         case, when one PLD is left in the queue, only one URI can
multi-threaded crawling framework for Linked Data, and (2)        be crawled every two seconds. However, the frontier – the
any238 , a parsing framework for various RDF syntaxes. The        list of URIs for the next round – may contain a diverse set
experiments are intended to run on a dedicated, single-core       of domains that can overcome this problem, and allow for
                                                                  higher crawling throughput. Hence, we also add a termina-
7
    http://ldspider.googlecode.com/                               tion condition for each round: once (1) the seed URIs have
8
    http://any23.org/                                             been processed, (2) all active redirects have been processed
                                                                                         106 statements (blue, right bars)
103 documents (red, left bars)   90                                                 18                                                         102
                                 75                                                 15


                                                                                                                              Number of PLDs
                                 60                                                 12
                                                                                                                                               101
                                 45                                                 9
                                 30                                                 6
                                 15                                                 3                                                          100
                                 0                                                  0
                                          1       2   3   4 5 6 7    8   9 10                                                                        100          101      102      103     104
                                                           Crawl №                                                                                                 Number of documents

Figure 2: Number of statements and documents per crawl                                                                       Figure 4:                 Distribution of the number of documents per
experiment.                                                                                                                  PLD.

                                                                                                                                                           №    PLD                 URIs
                                 600                             New PLDs                                                                                  1    gesis.org           7,850
    Number of PLDs


                                 500                                                                                                                       2    chem2bio2rdf.org    5,180
                                                                                                                                                           3    dbpedia.org         3,643
                                 400                                                                                                                       4    freebase.com        3,026
                                 300                                                                                                                       5    fer.hr              2,902
                                                                                                                                                           6    loc.gov             2,784
                                 200                                                                                                                       7    concordia.ca        2,784
                                 100                                                                                                                       8    dbtune.org          2,767
                                                                                                                                                           9    fu-berlin.de        2,689
                                      0
                                                                                                                                                           10   semantictweet.com   2,681
                                              1       2   3 4 5 6 7 8           9       10
                                                          Rounds of Crawl №
                                                                                                                                        Table 4: Top 10 PLDs based on the number of URIs.
Figure 3: Number of PLDs per round per crawl experiment.
                                                                                                                                In Figure 3, we show for each crawl the number of visited
                                                                                                                             PLDs per round together with the number of new PLDs
and (3) the number of active PLDs remaining in the per-                                                                      per round with respect to the previous round. The left bar
domain queue drops under the number of threads (in our                                                                       for each crawl represents Round 0, the middle bar Round 1,
setup 64), we end the current round and move to the next                                                                     and the right bar Round 2. We can observe that the relative
round (shifting remaining URIs to the frontier).                                                                             level of domains across the crawls is much more stable when
                                                                                                                             compared with the number of documents (cf. Figure 2).
Seed list. Starting from our list of 440 core URIs, we wish                                                                  Across rounds, the graph shows an average ∼1.3× increase of
to expand a 2-hop crawl using the outlined framework, from                                                                   active PLDs between Rounds 0–1, and an increase of ∼3.4×
which we will extract the final seed list of URIs to mon-                                                                    between Rounds 1–2. Further, we observe that ∼ 30 % of the
itor in our observation framework. However, due to the                                                                       PLDs in Round 1 are new compared to the previous round
unpredictability/non-determinism of remote connections, we                                                                   and roughly 70 % of the PLDs in Round 2 are not visited by
want to ensure a maximal coverage of the documents in this                                                                   the crawler in the rounds before.
neighbourhood. Along these lines, we repeated ten complete                                                                      Given the non-deterministic coverage of documents, to en-
2-hop crawls from our core URI list.                                                                                         sure comprehensive coverage of URIs in the 2-hop neigh-
   With respect to the non-determinism, Figure 2 shows for                                                                   bourhood, we take the union of URIs that dereferenced to
each round the number of documents (left bars on y-axis)                                                                     RDF content, resulting in a total set of 95,737 URIs span-
and the number of statements (right bars on y-axis). We                                                                      ning 652 domains, giving an average of 146.8 dereferenceable
can observe that two crawls (crawl number 1 and 10) have                                                                     URIs per domain. Figure 4 shows in log/log scale the distri-
a significantly higher number of statements compared to                                                                      bution of the number of PLDs (y-axis) against the number
the other crawls. One reason for this large discrepancy re-                                                                  of URIs in the union list (x-axis); we see that 379 PLDs
lates to the identi.ca domain, where a URI (referring to                                                                     (∼58.1 %) have one URI in the list, 78 PLDs (∼12.0 %) have
Tim Berners-Lee’s account; a highly ranked document in the                                                                   two URIs, and so forth. In addition, Table 4 lists the num-
BTC dataset) in the seed round of crawls 1 and 10 offered                                                                    ber of URIs for the top-10 PLDs in the set (represented by
a rich set of links within that domain, whereas the lookup                                                                   the ten rightmost dots in Figure 4).
failed in the other crawls, cutting off the crawler’s path in
that domain: for example, in the first crawl, identi.ca ac-
counted for 1.5 million statements, whereas in crawl 2, the                                                                  6.                MONITORING CONFIGURATION
domain accounted for 17 thousand statements. Such exam-                                                                         The next step in the setup of our observatory is to se-
ples illustrate the inherent non-determinism of crawling.                                                                    lect the monitoring techniques and intervals we apply. Note
that we have yet to start the monitoring experiments, where                                          successive document content change
we now instead present some initial results and outline our
proposed methodology for feedback from the community.
   In general, there exists two fundamental monitoring tech-                                     1x          2x          4x           8x    16 x
                                                                                                week        week        week         week   week
niques. The first technique is to periodically download the
content of a fixed list of URIs, as widely used in the litera-                                                     crawl frequency
ture [5, 6, 7, 8]; this technique allows to study the evolution
of sources over time in a contiguous fashion. The second
                                                                            Figure 5: Adaptive scheduling based on change events.
technique is to periodically perform a crawl from a defined
set of URIs [22]; this technique is more suitable if one wants
to study the evolution of the neighbourhood network of the                                  75,000
seed URIs in an adaptive fashion, but also can introduce a


                                                                      Number of documents
factor of randomness based on the crawling methods.                                         60,000
   We decided to again apply a hybrid approach: primar-
ily, we do not want to limit our observations to URIs online                                45,000
at the start of the experiment, although they will still be
our focus. We thus take the set of 95,737 sampling URIs                                     30,000
extracted in the previous section as a kernel of contigu-
ous URIs accessed consistently in each snapshot. From the                                   15,000
kernel, we propose to crawl as many URIs again using the
                                                                                                0
crawler configuration outlined in the previous section, form-
ing the adaptive segment of the snapshot. Roughly half of                                                    1     2    3     4      5     6       7
our snapshot would comprise of the contiguous kernel, reli-                                                Hours since the start of the crawl
ably providing data about said URIs; and the other half of
our snapshot would comprise of the adaptive crawl, reflect-               Figure 6: Number of downloaded documents over time.
ing changes in the neighbourhood of the kernel. We do not
limit PLDs in the adaptive crawl so as to not exclude data
providers that come online during the course of the exper-        The average download rate increases from 20k to 70K docu-
iment. This setup allows for studying (i) dynamics within         ments per hour after 60 minutes, but tails off severely after
the datasets (ii) dynamics between datasets (esp. links) (iii)    hour 3 due to PLD starvation. We see that processing the
and the growth of Linked Data and the arrival of new sources      kernel is thus feasible within a 10 hour interval.
(although to a lesser extent).
   Next, we must decide on the monitoring intervals for our
platform: how frequently we wish to perform our crawl. In
                                                                  7.                        CONCLUSION & OUTLOOK
the literature, it is common to take the data snapshots in          In this paper, we have presented ongoing work towards
either a daily [22, 14] or weekly [5, 6, 8, 7] fashion. Again,    building DyLDO: the Dynamic Linked Data Observatory.
in a practical sense, the intervals are highly dependent on       This observatory aims to continuously gather a high-quality
the available resources, and the size of the seed list. Given     collection of Linked Data snapshots for the community to
our resources and the monitoring requirements, we decided         gain a better insight into the underlying principles of dy-
to perform a full snapshot every week.                            namicity on the Web of Data. We motivated our proposals
   In addition, to get more granular data in a temporal sense,    based on several concrete use-cases and research questions
we propose to apply an adaptive scheduling policy that takes      that could be tackled using such a collection, and presented
more frequent snapshots for more dynamic data. As such,           related works that treat various aspects of dynamicity for
we propose to set up different monitoring intervals within        HTML documents on the traditional Web. Next we looked
the full weekly snapshots, where we increase the monitor-         at the non-trivial question of what view we should adopt for
ing interval for a kernel document if it changes within two       the Web of Data, introducing and comparing the BTC and
consequential snapshots of the previous interval. Figure 5        CKAN/LOD perspectives, showing how and where they di-
depicts the core idea. Using this adaptive approach, we can       verge, and weighing up their respective pros and cons. We
avoid the local and remote computational overhead involved        proposed selecting a kernel of sources to monitor around a
in regularly polling documents that are observed to be very       core set of URIs taken from BTC and CKAN/LOD datasets,
static. At the moment, we consider fixing the maximum             which are then extended by a 2-hop crawl. We also presented
number of intervals per week to 16, which resolves to a time      the detailed crawl configuration we planned to use for our
interval width of ∼10 hours. However, the intervals will take     monitoring experiments. Finally, we proposed our method-
a minimum of a week to “warm-up”, and will probably take          ology for performing the continuous observation of sources
longer to stabilise; thus, we can manually add further in-        in and around the kernel, as well as using adaptive intervals
tervals on-the-fly at a later stage once the experiments are      to monitor more dynamic sources more frequently.
underway (if deemed useful from the results).                       We plan to begin the monitoring experiments in the next
                                                                  few weeks, and to run them continuously and indefinitely.
Initial experiment. We conducted an initial experiment,           We still have some practical issues to tackle in terms of cre-
performing a crawl for the 95,737 URI kernel. Our frame-          ating a backup and archiving system, a site offering access
work downloaded 16 million statements from 80,000 docu-           to the community9 , as well as remote fallback procedures
ments, taking a total of 6 hours and 40 minutes. Figure 6         in the event of network or hardware failure. Thus, we are
shows the number of documents processed per crawl hour.           9
                                                                      Planned for http://swse.deri.org/DyLDO/
at a crucial stage, and are keen to gather final feedback       [16] C. Grimes and S. O’Brien. “Microscale evolution of
and requirements from the community: we are particularly             web pages”. In: WWW. 2008, pp. 1149–1150.
anxious to hear from people who would have a specific in-       [17] A. Harth, J. Umbrich, and S. Decker. “MultiCrawler:
terest or use-case for such data – be it in terms of research        A Pipelined Architecture for Crawling and Indexing
analysis, evaluation frameworks, etc. – what requirements            Semantic Web Data”. In: International Semantic Web
they have, and whether or not current proposals would be             Conference. 2006, pp. 258–271.
sufficient. Significant changes will invalidate the snapshots
collected up to that point, so we want to gather comments       [18] O. Hartig, C. Bizer, and J. C. Freytag. “Executing
and finalise and activate the framework as soon as possible.         SPARQL Queries over the Web of Linked Data”. In:
In the near future, the community can then begin to chart            ISWC. 2009, pp. 293–309.
a new dimension for the Web of Data: time.                      [19] T. Heath and C. Bizer. Linked Data: Evolving the Web
                                                                     into a Global Data Space. Vol. 1. Morgan & Claypool,
8.   References                                                      2011, pp. 1–136.
                                                                [20] A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres,
 [1] T. Berners-Lee, Y. Chen, L. Chilton, D. Connolly,               and S. Decker. “Searching and browsing Linked Data
     R. Dhanaraj, J. Hollenbach, A. Lerer, and D. Sheets.            with SWSE: The Semantic Web Search Engine”. In:
     “Tabulator: Exploring and Analyzing linked data on              J. Web Sem. 9.4 (2011), pp. 365–401.
     the Semantic Web”. In: SWUI. 2006.
                                                                [21] L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and
 [2] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker,         R. Agarwal. “Characterizing web document change”.
     R. Cyganiak, and S. Hellmann. “DBpedia - A crystal-             In: Advances in Web-Age Information Management
     lization point for the Web of Data”. In: J. Web Sem.            (2001), pp. 133–144.
     7.3 (2009), pp. 154–165.
                                                                [22] A. Ntoulas, J. Cho, and C. Olston. “What’s new on the
 [3] C. Bizer and D. Maynard. “The Semantic Web Chal-                web? The evolution of the web from a search engine
     lenge, 2010”. In: J. Web Sem. 9.3 (2011), p. 315.               perspective”. In: WWW. 2004, pp. 1–12.
 [4] T. Bray. “Measuring the Web”. In: Comput. Netw.            [23] M. Oita and P. Senellart. Deriving Dynamics of Web
     ISDN Syst. 28 (7-11 1996), pp. 993–1005.                        Pages: A Survey. INRIA TR.: inria-00588715. 2011.
 [5] B. Brewington and G. Cybenko. “How dynamic is the          [24] S. Pandey, K. Ramamritham, and S. Chakrabarti. “Mo-
     Web?” In: Computer Networks (2000).                             nitoring the dynamic web to respond to continuous
 [6] B. Brewington and G. Cybenko. “Keeping up with the              queries”. In: WWW. 2003, pp. 659–668.
     changing web”. In: Computer 33.5 (2000), pp. 52–58.        [25] A. Passant and P. Mendes. “sparqlPuSH: Proactive no-
 [7] J. Cho and H. Garcia-Molina. “Effective page refresh            tification of data updates in RDF stores using PubSub-
     policies for Web crawlers”. In: ACM Transactions on             Hubbub”. In: Scripting and Development Workshop at
     Database Systems 28.4 (Dec. 2003), pp. 390–426.                 ESWC. 2010.
 [8] J. Cho and H. Garcia-Molina. “Estimating frequency         [26] N. Popitsch and B. Haslhofer. “DSNotify - A solution
     of change”. In: ACM Transactions on Internet Tech-              for event detection and link maintenance in dynamic
     nology 3.3 (Aug. 2003), pp. 256–290.                            datasets”. In: J. Web Sem. 9.3 (2011), pp. 266–283.
 [9] E. G. Coffman Jr., Z. Liu, and R. R. Weber. “Optimal       [27] The Billion Triple Challenge. url: http://challenge.
     robot scheduling for web search engines”. In: Journal           semanticweb.org/ (visited on 02/06/2012).
     of scheduling 1 (1997), pp. 0–21.
                                                                [28] J. Umbrich, M. Hausenblas, A. Hogan, A. Polleres,
[10] R. Cyganiak and A. Jentzsch. The Linking Open Data              and S. Decker. “Towards Dataset Dynamics: Change
     cloud diagram. url: http://richard.cyganiak.de/                 Frequency of Linked Open Data Sources”. In: Proc. of
     2007/10/lod/ (visited on 02/06/2012).                           LDOW at WWW. 2010.
[11] R. Delbru, N. Toupikov, M. Catasta, and G. Tum-            [29] J. Umbrich, M. Karnstedt, and S. Land. “Towards Un-
     marello. “A Node Indexing Scheme for Web Entity                 derstanding the Changing Web: Mining the Dynam-
     Retrieval”. In: ESWC (2). 2010, pp. 240–256.                    ics of Linked-Data Sources and Entities”. In: Proc. of
[12] L. Ding and T. Finin. “Characterizing the Semantic              KDML at LWA. 2010.
     Web on the Web”. In: ISWC. 2006, pp. 242–257.              [30] J. Umbrich, M. Karnstedt, J. X. Parreira, A. Polleres,
[13] F. Douglis, A. Feldmann, B. Krishnamurthy, and J.               and M. Hauswirth. “Linked Data and Live Querying
     Mogul. “Rate of Change and other Metrics: a Live                for Enabling Support Platforms for Web Dataspaces”.
     Study of the World Wide Web”. In: USENIX Sym-                   In: DESWEB at ICDE. 2012.
     posium on Internetworking Technologies and Systems         [31] J. Umbrich, B. Villazon-Terrazas, and M. Hausenblas.
     December (1997).                                                “Dataset Dynamics Compendium: Where we are so
[14] D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener.           far!” In: Proc. of COLD at ISWC. Shanghai, 2010.
     “A large-scale study of the evolution of web pages”. In:
     WWW. 2003, pp. 669–678.
[15] B. Glimm, A. Hogan, M. Krötzsch, and A. Polleres.
     OWL: Yet to arrive on the Web of Data? CoRR. url:
     http://arxiv.org/pdf/1202.0984.pdf (visited on
     02/06/2012).