<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards a Dynamic Linked Data Observatory</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tobias Käfer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jürgen Umbrich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aidan Hogan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel Polleres</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Digital Enterprise, Research Institute, National University of</institution>
          ,
          <addr-line>Ireland, Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Karlsruhe Institute of</institution>
          ,
          <addr-line>Technology</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Siemens AG Österreich</institution>
          ,
          <addr-line>Siemensstrasse 90, 1210, Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>16</volume>
      <issue>2012</issue>
      <abstract>
        <p>We describe work-in-progress on the design and methodology of the Dynamic Linked Data Observatory: a framework to monitor Linked Data over an extended period of time. The core goal of our work is to collect frequent, continuous snapshots of a subset of the Web of Data that is interesting for further study and experimentation, with an aim to capture raw data about the dynamics of Linked Data. The resulting corpora will be made openly and continuously available to the Linked Data research community. Herein, we (1) motivate the importance of such a corpus; (2) outline some of the use-cases and requirements for the resulting snapshots; (3) discuss di erent \views" of the Web of Data that a ect how we de ne a sample to monitor; (4) detail how we select the scope of the monitoring experiment through sampling, (5) discuss the nal design of the monitoring framework that will gather regular snapshots of (subsets of) the Web of Data over the coming months and years.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Linked Data enjoys continued momentum in terms of
publishing, research and development; as a result, the Web of
Data continues to expand in size, scope and diversity.
However, we see a niche in terms of understanding how the Web
of Data evolves and changes over time. Establishing a more
ne-grained understanding of the nature of Linked Data
dynamics is of core importance to publishing, research and
development. With regards to publishing, a better
understanding of Linked Data dynamics would, for example,
inform the design of tools for keeping published data
consistent with changes in external data (e.g., [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]). With regards
to research, more granular data on Linked Data dynamics
would open up new paths for exploration, e.g., the design of
hybrid/live query approaches [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] that know when a query
relates to dynamic data, and that retrieves fresh results
directly from the source. With regards development, for
current systems, the results of such studies would inform
crawling strategies, local index refresh rates and update
strategies, cache invalidation techniques and tuning, and so forth.
      </p>
      <p>
        Towards a better understanding of Linked Data
dynamics, the community currently lacks a high-quality, broad,
granular corpus of raw data that provides frequent
snapshots of Linked Data documents over a sustained period of
Acknowledgements. This work has been funded in part by Science
Foundation Ireland under Grant No. SFI/08/CE/I1380 (Líon-2).
time. Current results in the area rely on domain-speci c
datasets (e.g., Popitsch and Haslhofer [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] focus on
DBpedia changes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), or infrequently updated snapshots of open
domain data over short monitoring time-spans (e.g., our
previous work looked at 20 weekly snapshots collected in
2008 [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]). Thus, we believe that a rst, important step is to
build from scratch a new monitoring framework to derive a
bespoke, continuously updated collection of snapshots. This
collection can then be freely used to study not only the
highlevel dynamics of entities, but also to distil the fundamental
underlying patterns of changes in the Web of Data across
di erent domains (e.g., studying if certain graph patterns
are more dynamic than others [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]).
      </p>
      <p>Clearly the design of such a framework, and gathering
the requirements for the resulting collection, is non-trivial:
many factors and actors have to be taking into
consideration in the context of the open Web and the Linked Data
community. Conceptually, we want the collection to be:
general-purpose: suitable to study for a wide range of
interested parties;
broad : capturing a wide selection of Linked Data domains;
substantial : the number of documents monitored should
allow for deriving con dent statistical measures;
granular &amp; frequent : o ering detailed data on sources;
contiguous: allowing comparison of sources over time; and
adaptive: able to discover the arrival of new sources, and
can monitor more dynamic sources more frequently.
However, some of these targets are antagonistic and demand
a trade-o . Monitoring a substantial number of sources in
a granular &amp; frequent fashion requires a practical
compromise to be made. Similarly, contiguous data and adaptive
monitoring are con icting aims. Furthermore, the aims to
be general-purpose and broad need more concrete
consideration in terms of what sources are monitored.</p>
      <p>For implementing the monitoring framework, other
practical considerations include politeness such that remote servers
are not unnecessarily overburdened, stability such that the
monitoring experiment can function even in the case of
hardware failure, and resource overhead such that the
computation can be run on a single, dedicated commodity machine.</p>
      <p>Taking these design criteria into account, herein we
motivate and initially propose a framework for taking frequent,
continuous snapshots of a subset of Linked Data which we
call DyLDO: the Dynamic Linked Data Observatory. We
currently focus on de ning the size and scope of the
monitoring experiment, discussing the rationale behind our choice
of sources to observe, touching upon various issues relating
to sampling the Web of Data. Later, we also sketch the
framework itself, outlining the crawling to be performed for
each snapshot, providing rationale for di erent parameters
used in the crawl, as well as proposing an adaptive
ltering scheme for monitoring more dynamic sources more
frequently. Our primary goal here is to inform the community
of our intentions, outline our rationale, collect use-cases and
potential consumers of the snapshot collection, and to gather
feedback and requirements prior to starting the monitoring
experiments. In particular, we currently focus on de ning
the scope of the monitoring experiment.</p>
      <p>To begin, we motivate our work in terms of envisaged
use-cases and research questions our corpus could help with
(x 2), subsequently presenting some related work (x 3). Next,
in order to understand what we are monitoring/sampling
{ to ascertain its borders { we ask the question What is
the Web of Data?, and compare two prominent \views"
thereof: (1) the Billion Triple Challenge dataset view, and
(2) the CKAN/LOD cloud view (x 4). Thereafter, we
describe the sampling methodology we have used to derive a
\seed list" of URIs that form the core of the monitoring
experiment (x 5). Finally, we outline the proposed monitoring
framework, detailing setup parameters, and adaptive
extensions (x 6). We conclude with future directions and a call for
feedback and potential use-cases from the community (x 7).
2.</p>
    </sec>
    <sec id="sec-2">
      <title>USE CASES AND OPEN QUESTIONS</title>
      <p>We now discuss the potential bene t and impact of our
proposed observatory for Linked Data based on (1) some
envisaged use-cases, and (2) some open research questions
that our data could help to empirically investigate.</p>
      <p>
        In previous work [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], we gave an overview of Web and
Linked Data dynamics, presenting four community use cases
that require technical solutions to deal with dynamic data.
We rst extend these four example scenarios to motivate our
ongoing work on the Dynamic Linked Data Observatory.
      </p>
      <sec id="sec-2-1">
        <title>UC-1: Synchronisation.</title>
        <p>Synchronisation addresses the problem of keeping an
ofine sample/replica of the Web of Data up-to-date.</p>
        <p>The most common scenarios is the maintenance of locally
cached LOD indexes. To the best of our knowledge, none of
the popular semantic web caches (such as hosted by Sindice
and OpenLink) investigated index-update strategies based
on the current knowledge of the dynamicity of Linked Data,
but rather rely on publisher-reported information about
update frequencies where available (e.g., sitemaps1).</p>
      </sec>
      <sec id="sec-2-2">
        <title>UC-2: Smart Caching.</title>
        <p>This use-case tries to nd e cient methods to optimise
systems operating live over Web data by minimising network
tra c wasted on unnecessary HTTP lookups.</p>
        <p>
          Versus synchronisation, this use-case targets systems that
operate live and directly over the Web of Data. An
exemplary use-case would be the integration of smart caching
for live querying (e.g., [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]) or live browsing (e.g., [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]) over
Linked Data, avoiding re-accessing a document or
dereferencing a URI if it is unlikely to have changed according to
knowledge of dynamicity patterns.
        </p>
        <sec id="sec-2-2-1">
          <title>1http://sitemaps.org/</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>UC-3: Hybrid Architectures.</title>
        <p>A large index of Linked Data can implement a hybrid
architecture based on dynamicity statistics, where one
processing pipeline is optimised for static knowledge, and another
for dynamic knowledge.</p>
        <p>In various Linked Data search and query engines, there is
an inherent trade-o between running computation live
during query-time or pre-computing answers o ine. Abstractly,
pre-computation suits static data and runtime computation
suits dynamic data. Example trade-o s include dynamic
insertions vs. batch loading, lightweight indexing vs.
heavyweight indexing, runtime joins vs. live joins,
backwardchaining reasoning vs. forward-chaining reasoning,
windowbased stream operators vs. global operators, etc. Knowledge
of dynamicity can help decide which methods are
appropriate for which data. Furthermore, smart hybrid queries may
become possible: consider the query \Give me (current)
temperatures of European capitals", where knowledge
of dynamicity would reveal that temperatures are dynamic
and should be fetched live, whereas European capitals are
static and can be run (e ciently) over the local index.</p>
      </sec>
      <sec id="sec-2-4">
        <title>UC-4: External-Link Maintenance.</title>
        <p>The link maintenance use-case addresses the challenge to
preserve referential integrity and the correct type of links in
the presence of dynamic external data.</p>
        <p>
          Popitsch and Haslhofer [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] investigate this use case, which
involves monitoring external Web datasets for changes to
help ensure the integrity of links targeting them from local
data. They propose DSNotify: a solution and an
evaluation framework which is able to replay changes in a dataset;
however, the authors only have knowledge of DBpedia
dynamics to leverage. Such works help ensure the quality of
links between datasets, and we hope that our corpora will
help extend application to the broader Web of Data.
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>UC-5: Vocabulary Evolution and Versioning.</title>
        <p>Knowledge of the dynamics of Linked Data vocabularies
could lead to better versioning methods.</p>
        <p>
          Changes in the semantics of terms in Linked Data
vocabularies can have a dramatic in uence on the
interpretation of remote datasets using them. For example, the
FOAF vocabulary is extremely widely used on the Web of
Data [
          <xref ref-type="bibr" rid="ref12 ref20">12, 20</xref>
          ], but often resorts to informal versioning
mechanisms: aside from the monotonic addition of terms, for
example, FOAF recently removed disjointness constraints
between popular classes like foaf:Person and foaf:Document,
made foaf:logo inverse-functional, etc., that may change
inferencing over (or even break) external data. Studying
and understanding the dynamicity of vocabularies may
motivate and suggest better methodologies for versioning.
        </p>
        <p>We foresee that research and tools relating to these
usecases will bene t directly from having our data collection
available for analysis. However, aside from these concrete
use-cases and more towards a Web science viewpoint, we also
see some rather more fundamental { and possibly related {
empirical questions that our collection should help answer:
Change frequency : Can we model change frequency of
documents with mathematical models and thus predict
future changes?
Change patterns: Can we mine patterns that help to
categories change behaviour?
Degree of change: If a document changes, how much of
its content is updated?
Lifespan : What is the lifespan of Linked Data documents?
Stability : How stable are Linked Data documents in terms
of HTTP accessibility?
Growth rate: How fast is the Web of Data evolving?
Structural changes: Do we observe any changes in the
structure of the network formed by links?
Change triggers: Can we nd graph patterns that trigger
or propagate changes through the network?
Domain-dependent changes: Do we observe a variation
or clustering in dynamicity across di erent domains?
Vocabulary-dependent changes: Do we observe di erent
change patterns for data using certain vocabularies,
classes or properties?
Vocabulary changes: How do the semantics of vocabulary
terms evolve over time?</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>BACKGROUND</title>
      <p>
        Various papers have addressed similar research questions
for the Web. Most works thus far have focused on the
dynamicity of the traditional HTML Web, which (mostly)
centres around dynamicity on the level of document changes.
For the purposes of our use-cases, our notion of Linked Data
dynamics goes deeper and looks to analyse dynamic patterns
within the structured data itself: i.e., dynamicity should also
be considered on the level of resources and of statements (as
noted previously [
        <xref ref-type="bibr" rid="ref26 ref29">29, 26</xref>
        ]). Still, studies of the dynamicity
of the HTML Web can yield interesting insights for our
purposes. In fact, in previous works, we initially established a
link between the frequency of change of Linked Data
documents and HTML documents [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
      <p>
        The study of the evolution of the Web (of Documents) and
its implicit dynamics reaches back to the proposal of the rst
generation of autonomous World Wide Web spiders (aka.
crawlers) around 1995. Bray [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] published one of the rst
studies about the characteristics of the Web and estimated
its size in 1996. Around the same time, Web indexes such as
AltaVista or Yahoo! began to o er one of the rst concrete
use-cases for understanding the change frequency of Web
pages: the e cient maintenance of search engine indexes.
In 1997, Co man et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed a revisiting strategy
for Web crawlers to improve the \freshness" of an index.
This work was continuously improved over the subsequent
years with additional experimental and theoretical results
provided by Brewington [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], Lim et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], Cho et al. [
        <xref ref-type="bibr" rid="ref7 ref8">8,
7</xref>
        ], Fetterly et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and Ntoulas et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], amongst others.
      </p>
      <p>
        Based on large data collections, these papers presented
theory and/or empirical analyses of the HTML Web that
relate closely to the dynamicity questions we highlight. For
example, various authors discovered that the change
behaviour of Web pages corresponds closely with { and can
be predicted using { a Poisson distribution [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 8, 7</xref>
        ]; in
previous work [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], we presented initial evidence that changes
in Linked Data documents also follow a Poisson
distribution, though our data was insu cient to be conclusive.
Relating to high-level temporal change patterns, Ntoulas et
al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] analysed the di erent frequency of updates for
individual weekdays and working hours. The same paper also
empirically estimated the growth rate of the Web to be
8 % new content every week, and regarding structural
changes, found that the link structure of the Web changes
faster than the textual content by a factor of 3. Various
authors found that with respect to the degree of change, the
majority of changes in HTML documents are minor [
        <xref ref-type="bibr" rid="ref14 ref21 ref22">21, 14,
22</xref>
        ]. Loosely related to change triggers, Fetterly et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
found that certain parties simulate content changes to draw
the attention of search engines. Regarding domain
dependent changes, various authors also showed that change
frequencies vary widely across top-level domains [
        <xref ref-type="bibr" rid="ref14 ref5 ref6 ref8">14, 8, 5, 6</xref>
        ].
      </p>
      <p>
        Relating to use-cases for studying the dynamics of Web
documents, a variety have been introduced down through
the years, including (i) the improvements of Web proxies
or caches looked at by, e.g., Douglis et al [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], (ii) e cient
handling of continuous queries over documents [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] or,
returning to RDF, over SPARQL endpoints [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. We refer
interested readers to the excellent survey by Oita and
Senellart [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], which provides a comprehensive overview of
existing methodologies to detect Web page changes, and also
surveys general studies about Web dynamics.
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>WHAT IS THE WEB OF DATA?</title>
      <p>
        Our data collection should allow researchers to study
various aspects and characteristics of data dynamics across a
broad selection of Linked Data domains. However, it is
not clear which data providers should be considered as \in
scope", which are of interest for the Linked Data community
who we target, and how we should de ne our target
population. Linked Data itself is a set of principles and an
associated methodology for publishing structured data on the
Web in accordance with Semantic Web standards and Web
Architecture tenets [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Various data providers are
compliant with Linked Data principles to varying degrees: there's
no one yardstick by which a dataset can be unambiguously
labelled as \Linked Data".
      </p>
      <p>
        For the purposes of the seminal Linking Open Data (LOD)
project, Cyganiak and Jentszch use a variety of minimal
requirements a dataset should meet to be included in the
LOD Cloud diagram [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which serves as a overview of
connections between Linked Data corpora. However, the LOD
Cloud is biased towards large monolithic datasets published
on one domain, and does not cover low-volume cross-domain
publishing as common for vocabularies such as FOAF, SIOC,
etc. For example, platforms like identi.ca/status.net,
Drupal, Wordpress, etc., can export compliant, decentralised
Linked Data { using vocabularies such as FOAF and SIOC {
from the various domains where they are deployed, but their
exports are not in the LOD Cloud. Furthermore, it gives no
explicit mention to the vocabularies themselves, which are
of high relevance to our requirements.
      </p>
      <p>
        A broader notion to consider is the Web of Data, which
would cover these latter exporters and vocabularies, but
which is somewhat ambiguous and with ill-de ned borders.
For our purposes, we de ne the Web of Data as being
comprised of interlinked RDF data published on the Web.2 No
clear road-map is available for the Web of Data per this de
nition; the LOD cloud only covers prominent subsets thereof.
Perhaps the clearest picture of the Web of Data comes from
crawlers that harvest RDF from the Web. A prominent
ex2This de nition is perhaps more restrictive than some
interpretations where, e.g., Sindice incorporates Microformats
into their Web of Data index [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
ample is the Billion Triple Challenge Dataset, which is made
available every year and comprises of data collected during
a deep crawl of RDF/XML documents on the Web of Data.
However, the precise composition of such datasets is unclear,
and requires further study.
      </p>
      <p>
        As such, the core question of what it is we want to
monitor { i.e., what population of domains the Linked Data
community is most interested in, and thus what population we
should sample from { is non-trivial, and probably has no
de nitive answer. To get a better insight, in this section we
contrast two such perspectives of the Web of Data, the:
Billion Triple Challenge 2011 dataset [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] which is
collected from a Web crawl of over seven million RDF/XML
documents; and the
Comprehensive Knowledge Archive Network (CKAN) 3
repository { speci cally the lodcloud group therein {
containing high-level metrics reported by Linked Data
publishers, used in the creation of the LOD Cloud.
We want to see how these two road-maps of the Web of Data
can be used as the basis of de ning a population to sample.
4.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>The BTC 2011 dataset</title>
      <p>
        The BTC dataset is crawled from the Web of Data
using the MultiCrawler framework [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for the annual Billion
Triple Challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] at the International Semantic Web
Conference (ISWC). The dataset empirically captures a deep,
broad sample of the Web of Data in situ.
      </p>
      <p>
        However, the details of how the Billion Triple Challenge
dataset is collected are somewhat opaque. The seed list is
sampled from the previous year's dataset [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], where one of
the initial seed-lists in past years was gathered from
various semantic search engines. The crawl is for RDF/XML
content, and follows URIs extracted from all triple
positions. Scheduling (i.e., prioritising URIs to crawl) is
random, where URIs are shu ed at the end of each round.
As such, any RDF/XML document reachable through other
RDF/XML documents from the seed list is within scope;
otherwise, what content is (or is not) in the BTC { and
how \representative" the dataset is of the Web of Data { is
di cult to ascertain purely from the collection mechanisms.
      </p>
      <p>As such, it is more pertinent to look at what the dataset
actually contains. The most recent BTC dataset { BTC
2011 { was crawled in May/June 2011. The nal dataset
contains 2.145 billion quadruples, extracted from 7.411
million RDF/XML documents. The dataset contains RDF
documents sourced from 791 pay-level domains (PLDs): a
paylevel domain is a direct sub-domain of a top-level domain
(TLD) or a second-level country domain (ccSLD), e.g.,
dbpedia.org, bbc.co.uk. We prefer the notion of a pay-level
domain since fully quali ed domain names (FQDNs)
overexaggerate the diversity of the data: for example, sites such
as livejournal.com assign di erent subdomains to
individual users (e.g., danbri.livejournal.com), leading to
millions of FQDNs on one site, all under the control of one
publisher. The BTC 2011 dataset contained documents from
240,845 FQDNs, 233,553 of which were from the
livejournal.com PLD. Henceforth, when we mention domain, we
thus refer to a PLD (unless otherwise stated).</p>
      <p>On the left-hand side of Table 1 we enumerate the top-25
PLDs in terms of quadruples contributed to the BTC 2011
dataset. Notably, a large chunk of the dataset ( 64 %) is
provided by the hi5.com domain: a social gaming site that
exports a FOAF le for each user. As observed for similar
corpora (cf. [20, Table A.1]) hi5.com has many documents,
each with an average of over two thousand statements { an
order of magnitude higher than most other domains {
leading to it dominating the overall volume of BTC statements.</p>
      <p>The dominance of hi5.com { and to a lesser extent
similar sites like livejournal.com { shape the overall
characteristics of the BTC 2011 dataset. To illustrate one
prominent such example, Figure 1a gives the distribution of
statements per document in the BTC dataset on log/log scale,
where one can observe a rough power-law(-esque)
characteristic. However, there is an evident three-way split in the
tail emerging at about 120 statements, and ending in an
outlier spike at around 4,000 statements. By isolating the
distribution of statements-per-document for hi5.com in
Figure 1b, we see that it contributes to the large discrepancies
in that interval. The stripes are caused by periodic patterns
in the data, due to its uniform creation: on the hi5.com
domain, RDF documents with a statement count of 10 + 4f
are heavily favoured, where ten triples form the base of a
user's description and four triples are assigned to each of f
friends. Other lines are formed due to two optional elds
(foaf:surname/foaf:birthday) in the user pro le, giving a
9 + 4f and 8 + 4f periodicity line. An enforced ceiling of
f 1; 000 friends explains the spike at (and around) 4,010.</p>
      <p>
        The core message here is that although the BTC o ers a
broad view of the Web of Data, covering 791 domains, in
absolute statement-count terms, the dataset is skewed by a few
high-volume exporters of FOAF, and in particular hi5.com.
When deriving global statistics and views from the BTC, the
results say more about the code used to generate hi5.com
pro les than the e orts of thousands of publishers.4 This
is also a naturally-occurring phenomenon in other corpora
(e.g., [
        <xref ref-type="bibr" rid="ref12 ref20">12, 20</xref>
        ]) crawled from the Web of Data { not just
isolated to the BTC dataset(s) { and is not easily xed. One
option to derive meaningful statistics about the Web of Data
from such datasets is to apply (aggregated) statistics over
individual domains, and never over the corpus as a whole.
4.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>CKAN/LOD cloud metadata</title>
      <p>
        In contrast to the crawled view of the Web of Data, the
CKAN repository indexes publisher-reported statistics about
their dataset. These CKAN metadata are then used to
decide eligibility for entry into the LOD cloud [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]: a highly
prominent depiction of Linked Open Datasets and their
interlinkage. A CKAN-reported dataset is listed in the LOD
cloud i it ful ls the following requirements: the dataset has
to (1) be published according to core Linked Data principles,
(2) contain at least one thousand statements and (3) provide
at least 50 links to other LOD cloud datasets5.
      </p>
      <p>Given the shortcomings of the crawled perspective on the
Web of Data, we explore these self-reported metadata to
get an alternative view on what we should be sampling. On
September 29, 2011, we downloaded the meta-information
for the datasets listed in the lodcloud group on CKAN6.
The data contain example URIs for the dataset and statistics
such as the number of statements. We discovered meta-data
for 297 datasets, spanning 206 FQDNs and 133 PLDs. On
the right hand side of Table 1, we enumerate the top-25
largest reported datasets in the lodcloud group on CKAN.
Note that where multiple datasets are de ned on the same
domain, the triple count is presented as the summation of
said datasets. In this Table, we see a variety of domains
claiming to host between 9.8 billion and 94 million triples.</p>
      <p>Regarding the data formats present in the LOD cloud,
most of the datasets claim to serve RDF/XML data (85 %),
4 % claim to serve RDFa (of which 50 % did not also
offer RDF/XML). This shows the popularity of RDF/XML,
but only supporting RDF/XML will still miss out on 15 %
of datasets. However, the syntax metadata are somewhat
unreliable, where improper mime-types are often reported.
4.3</p>
    </sec>
    <sec id="sec-7">
      <title>BTC vs. CKAN/LOD</title>
      <p>Finally, we contrast the two di erent perspectives of the
Web of Data. Between both, there are 854 PLDs mentioned,
with BTC covering 791 domains ( 92.6 %), CKAN/LOD
4Furthermore, hi5.com is not even a prominent domain on
the Web of Data in terms of being linked, and was ranked
179/778 domains in a PageRank analysis of a similar corpus;
http://aidanhogan.com/ldstudy/table21.html
5http://www.w3.org/wiki/TaskForces/
CommunityProjects/LinkingOpenData/DataSets/
CKANmetainformation
6http://thedatahub.org/group/lodcloud
covering 133 domains ( 15.6 %), and the intersection of
both covering 70 domains ( 8.2 % overall; 8.8 % of BTC;
52.6 % of CKAN/LOD). CKAN/LOD reports a total of
28.4 billion triples, whereas the BTC (an incomplete crawl)
accounts for 2.1 billion quadruples ( 7.4 %). However, only
384.3 million quadruples in the BTC dataset ( 17.9 %) come
from PLDs mentioned in the extracted CKAN/LOD
metadata.</p>
      <p>In Table 1, we present the BTC and CKAN/LOD
statement counts side-by-side. We can observe that a large
number of high-volume BTC domains are not mentioned on
CKAN/LOD, where the datasets in question may not
publish enough RDF data to be eligible by CKAN/LOD, or may
not follow Linked Data principles or have enough external
links, or may not have self-reported.</p>
      <p>
        Perhaps more surprisingly however, we note major
discrepancies in terms of the catchment of BTC statements
versus CKAN/LOD metadata. Given that BTC can only
sample larger domains, a lower statement count is to be expected
in many cases: however, some of the largest CKAN/LOD
domains do not appear at all. Reasons can be found through
analysis of the BTC 2011's publicly available access log [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
In Table 2, we present reasons for the top-10 highest-volume
CKAN/LOD data providers not appearing in the BTC 2011
dataset (i.e., providers appearing with \|" on the
righthand side of Table 1). Robots indicates that crawling was
prohibited by robots.txt exclusions; Http-401 and
Http502 indicate the result of lookups for URIs on that domain;
Mime indicates that the content on the domain did not
return application/rdf+xml used as a heuristic in the BTC
crawl to lter non-RDF/XML content; Unreachable
indicates that no lookups were attempted on URIs from that
domain; Other refers solely to europeana.eu, which
redirected all requests to their home page.
      </p>
      <p>In summary, we see two quite divergent perspectives on
the Web of Data, given by the BTC 2011 dataset and the
CKAN/LOD metadata. Towards getting a better picture
of what population we wish to sample for the monitoring
experiments, and towards deciding on a sampling
methodology, the pertinent question is then: Which perspective
best suits the needs of our monitoring experiment?
As enumerated in Table 3, both perspectives have
inherent strengths and weaknesses. As such, we believe that our
sampling method should try to take the best of both
perspectives, towards a hybrid view of the Web of Data.</p>
    </sec>
    <sec id="sec-8">
      <title>5. SAMPLING METHODOLOGY</title>
      <p>Due to the size of the Web and the need for frequent
snapshots, sampling is necessary to create an appropriate
collection of URIs that can be processed and monitored under the
given time, hardware and bandwidth constraints. The goal
of our sampling method is thus two-fold: to select a set of
URIs that (i) capture a wide cross-section of domains and (ii)
can be monitored in a reasonable time given our resources
and in a polite fashion. Given the previous discussion, we
wish to combine the BTC/crawled and CKAN/metadata
perspectives when de ning our seed-list.</p>
      <p>Before we continue to describe the sampling methodology
we choose, it is worthwhile to rst remark upon sampling
methods used in other dynamicity studies of the Web.
</p>
      <p>Published Sampling Techniques. There are several
published works that present sampling techniques in order to
create a corpus of Web documents that can be monitored
over time. Having studied a variety of major works, we
could not nd common agreement on a suitable sampling
technique for such purposes. The targeted use-cases and
research questions directly a ect the domain and number of
URIs, as well as the monitoring frequency and time frame.</p>
      <p>
        Grimes and O'Brien [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] studied the dynamics of very
freBTC
Pros: X covers more domains (791)
      </p>
      <p>X empirically validated
X includes vocabularies</p>
      <p>X includes decentralised datasets
Cons: X in uence of high-volume domains</p>
      <p>X misses 47.4 % of LOD/CKAN domains</p>
      <p>LOD/CKAN
Pros: X domains pass \quality control"</p>
      <p>X community validated
Cons: X</p>
      <p>
        X
X
X
covers fewer domains (133)
self-reported statistics
misses vocabularies
misses decentralised datasets
quently changing pages and prepared their seed list
accordingly: initially starting from a list of URIs provided by a
Google crawl, they performed two crawls a day and selected
the most dynamic URIs (in terms of content changes) that
could also be successfully accessed 500 times in a row. As
such, the authors focus on monitoring stable and dynamic
Web documents. Fetterly et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] randomly sampled URIs
from a crawl seeded from the Yahoo! homepage to cover
many di erent topics and providers, giving broad coverage
for a general-purpose study; however, the surveyed
documents would be sensitive to the underlying distribution of
documents in the original Yahoo! corpus. Cho and
GarciaMolina [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] studied the change frequency of pages using a
dataset which was sampled by selecting the root URIs of
the 270 most popular domains from a 25 million web page
crawl and then crawling three thousand pages for each of
the domains; this method provides a good, broad balance of
documents across di erent domains.
      </p>
      <p>Our sampling method. We can conclude that existing
sampling methods select URIs from crawled documents, either
randomly, because of speci c characteristics (e.g., dynamic
or highly ranked), or to ensure an even spread across
different hosts. Thus, we decide to use a combination of these
three methods to generate our list of URIs.</p>
      <p>
        Given the previous discussion of Section 4, we start with
an initial seed list of URIs taken from: (1) the registered
example URIs for the datasets in the LOD cloud and (2) the
most popular URIs in the BTC dataset of 2011. The most
popular URIs are selected based on a PageRank analysis of
the documents in the BTC 2011 dataset, where we select the
top-k ranked documents from this analysis (please see [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
for details); note that many of the top ranked documents
refer to commonly instantiated vocabularies on the Web of
Data, which are amongst the most linked/central Linked
Data documents. At the time of access, we found 220
example URIs in the CKAN/LOD registry, and we complement
them with the top-220 document URIs from the BTC 2011
to generate a list of 440 core URIs for monitoring. The core
URIs contain 137 PLDs, 120 from the CKAN/LOD
examples and 37 from the most popular BTC URIs. This selection
guarantees us to cover all relevant domains (similar to [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ])
and to also consider the most popular and interlinked URIs
on the Web of Data (similar to [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]).
      </p>
      <p>Obviously, 440 seed URIs are insu cient to resolve a
meaningful corpus for observation over time. Thus, we decide to
use a crawl and expand outwards from these core URIs to
nd other documents to monitor in their vicinity.
Importantly, we wish to stay in the close locale of the 440 core
URIs; if we go further, we will encounter the same
problems as observed for the BTC 2011 dataset, where the data
are skewed by a few high-volume exporters. To avoid being
diluted by, e.g., hi5.com data and the likes, we thus stay
within a 2-hop crawl radius from the core URIs. From the
data thus extracted, we then sample a nal set of extended
seed URIs to monitor. The result is then our best-e ort
compromise to achieve representative snapshots of Linked
Data that (i) take into account both views on Linked Data
by including CKAN and BTC URIs in the core seed list,
(ii) extend beyond the core seed list in a de ned manner (2
hops), and (iii) do not exceed our crawling resources.
Crawling setup. The crawling setup will have a signi
cant e ect on the selection of URIs to monitor, and so
we provide some detail thereupon. Our implementation is
based on two open source Java projects: (1) LDSpider7, a
multi-threaded crawling framework for Linked Data, and (2)
any238, a parsing framework for various RDF syntaxes. The
experiments are intended to run on a dedicated, single-core</p>
      <sec id="sec-8-1">
        <title>7http://ldspider.googlecode.com/ 8http://any23.org/</title>
        <p>
          2.2GHz Opteron x86-64, 4GB RAM on a university network.
Thereafter, we use the following con guration:
All RDF syntaxes: unlike BTC crawls, which only
consider RDF/XML, we wish to consider all standard
serialisation formats for RDF (including Turtle, which is
soon to be standardised), as supported by any23.
Further, we do not pre- lter content based on
Contenttype reporting or le extensions. RDFa is becoming
a preferred format for many publishers: when
parsing RDFa, we monitor the output statements and
exclude the content of HTML documents for which we
nd only \accidental" triples as extracted from titles,
stylesheets, icons, etc., and instead only consider
documents that intend to publish non-trivial RDFa.
Threads: multithreading keeps the hardware busy while
slow HTTP requests are being processed; in previous
work [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], we found 64 threads to o er the best tradeo
between parallelism and CPU/disk contention.
        </p>
        <p>Timeouts: we terminate unresponsive connections and
sockets after 120 seconds. Timeouts are kept deliberately
high to help ensure stable crawls.</p>
        <p>Links: we consider all URIs contained in the RDF data as
potential links to follow (and, e.g., not only the value
of rdfs:seeAlso or such).</p>
        <p>
          Breadth- rst: we crawl documents in a round-based, URI
scheduling approach, which should result in a broader
set of diverse data (assuming a diverse seed-list) [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
Redirects: 301, 302, 303 and 307 HTTP response codes are
not treated as links, but are instead followed directly
in the same round until we reach a non-direct response
(or hit a cycle/path-limit).
        </p>
        <p>Per-domain Queue: our crawler queue is divided into
individual per-domain queues, which are polled
roundrobin: this helps ensure a maximal delay between
accessing the same domain again.</p>
        <p>Priority Queue: within each individual per-domain queue,
we prioritise URIs for which we have found the most
links thus far. (This only a ects the crawl if an
incomplete round is performed.)
Politeness Policy: we implement a politeness delay of two
seconds, meaning that we do not access the same PLD
twice within that interval; further, for each domain, we
retrieve and respect standard robots.txt exclusions.</p>
        <p>
          The minimum amount of time taken to complete a round
becomes the maximum number of active URIs for a single
domain, multiplied by the politeness delay. The
combination of this per-PLD delay and the distribution of documents
per domain on the Web [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] introduces the problem of PLD
starvation [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]: the nature of the Web of Data means that
after numerous low-volume domains have been crawled, the
few remaining domains are not enough to keep the crawler
resources busy between the politeness delay. In the worst
case, when one PLD is left in the queue, only one URI can
be crawled every two seconds. However, the frontier { the
list of URIs for the next round { may contain a diverse set
of domains that can overcome this problem, and allow for
higher crawling throughput. Hence, we also add a
termination condition for each round: once (1) the seed URIs have
been processed, (2) all active redirects have been processed
and (3) the number of active PLDs remaining in the
perdomain queue drops under the number of threads (in our
setup 64), we end the current round and move to the next
round (shifting remaining URIs to the frontier).
        </p>
        <p>Seed list. Starting from our list of 440 core URIs, we wish
to expand a 2-hop crawl using the outlined framework, from
which we will extract the nal seed list of URIs to
monitor in our observation framework. However, due to the
unpredictability/non-determinism of remote connections, we
want to ensure a maximal coverage of the documents in this
neighbourhood. Along these lines, we repeated ten complete
2-hop crawls from our core URI list.</p>
        <p>With respect to the non-determinism, Figure 2 shows for
each round the number of documents (left bars on y-axis)
and the number of statements (right bars on y-axis). We
can observe that two crawls (crawl number 1 and 10) have
a signi cantly higher number of statements compared to
the other crawls. One reason for this large discrepancy
relates to the identi.ca domain, where a URI (referring to
Tim Berners-Lee's account; a highly ranked document in the
BTC dataset) in the seed round of crawls 1 and 10 o ered
a rich set of links within that domain, whereas the lookup
failed in the other crawls, cutting o the crawler's path in
that domain: for example, in the rst crawl, identi.ca
accounted for 1.5 million statements, whereas in crawl 2, the
domain accounted for 17 thousand statements. Such
examples illustrate the inherent non-determinism of crawling.</p>
        <p>Number of documents
Distribution of the number of documents per
</p>
        <p>In Figure 3, we show for each crawl the number of visited
PLDs per round together with the number of new PLDs
per round with respect to the previous round. The left bar
for each crawl represents Round 0, the middle bar Round 1,
and the right bar Round 2. We can observe that the relative
level of domains across the crawls is much more stable when
compared with the number of documents (cf. Figure 2).
Across rounds, the graph shows an average 1.3 increase of
active PLDs between Rounds 0{1, and an increase of 3.4
between Rounds 1{2. Further, we observe that 30 % of the
PLDs in Round 1 are new compared to the previous round
and roughly 70 % of the PLDs in Round 2 are not visited by
the crawler in the rounds before.</p>
        <p>Given the non-deterministic coverage of documents, to
ensure comprehensive coverage of URIs in the 2-hop
neighbourhood, we take the union of URIs that dereferenced to
RDF content, resulting in a total set of 95,737 URIs
spanning 652 domains, giving an average of 146.8 dereferenceable
URIs per domain. Figure 4 shows in log/log scale the
distribution of the number of PLDs (y-axis) against the number
of URIs in the union list (x-axis); we see that 379 PLDs
( 58.1 %) have one URI in the list, 78 PLDs ( 12.0 %) have
two URIs, and so forth. In addition, Table 4 lists the
number of URIs for the top-10 PLDs in the set (represented by
the ten rightmost dots in Figure 4).</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>MONITORING CONFIGURATION</title>
      <p>The next step in the setup of our observatory is to
select the monitoring techniques and intervals we apply. Note
that we have yet to start the monitoring experiments, where
we now instead present some initial results and outline our
proposed methodology for feedback from the community.</p>
      <p>
        In general, there exists two fundamental monitoring
techniques. The rst technique is to periodically download the
content of a xed list of URIs, as widely used in the
literature [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 7, 8</xref>
        ]; this technique allows to study the evolution
of sources over time in a contiguous fashion. The second
technique is to periodically perform a crawl from a de ned
set of URIs [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]; this technique is more suitable if one wants
to study the evolution of the neighbourhood network of the
seed URIs in an adaptive fashion, but also can introduce a
factor of randomness based on the crawling methods.
      </p>
      <p>We decided to again apply a hybrid approach:
primarily, we do not want to limit our observations to URIs online
at the start of the experiment, although they will still be
our focus. We thus take the set of 95,737 sampling URIs
extracted in the previous section as a kernel of
contiguous URIs accessed consistently in each snapshot. From the
kernel, we propose to crawl as many URIs again using the
crawler con guration outlined in the previous section,
forming the adaptive segment of the snapshot. Roughly half of
our snapshot would comprise of the contiguous kernel,
reliably providing data about said URIs; and the other half of
our snapshot would comprise of the adaptive crawl, re
ecting changes in the neighbourhood of the kernel. We do not
limit PLDs in the adaptive crawl so as to not exclude data
providers that come online during the course of the
experiment. This setup allows for studying (i) dynamics within
the datasets (ii) dynamics between datasets (esp. links) (iii)
and the growth of Linked Data and the arrival of new sources
(although to a lesser extent).</p>
      <p>
        Next, we must decide on the monitoring intervals for our
platform: how frequently we wish to perform our crawl. In
the literature, it is common to take the data snapshots in
either a daily [
        <xref ref-type="bibr" rid="ref14 ref22">22, 14</xref>
        ] or weekly [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 8, 7</xref>
        ] fashion. Again,
in a practical sense, the intervals are highly dependent on
the available resources, and the size of the seed list. Given
our resources and the monitoring requirements, we decided
to perform a full snapshot every week.
      </p>
      <p>In addition, to get more granular data in a temporal sense,
we propose to apply an adaptive scheduling policy that takes
more frequent snapshots for more dynamic data. As such,
we propose to set up di erent monitoring intervals within
the full weekly snapshots, where we increase the
monitoring interval for a kernel document if it changes within two
consequential snapshots of the previous interval. Figure 5
depicts the core idea. Using this adaptive approach, we can
avoid the local and remote computational overhead involved
in regularly polling documents that are observed to be very
static. At the moment, we consider xing the maximum
number of intervals per week to 16, which resolves to a time
interval width of 10 hours. However, the intervals will take
a minimum of a week to \warm-up", and will probably take
longer to stabilise; thus, we can manually add further
intervals on-the- y at a later stage once the experiments are
underway (if deemed useful from the results).</p>
      <p>Initial experiment. We conducted an initial experiment,
performing a crawl for the 95,737 URI kernel. Our
framework downloaded 16 million statements from 80,000
documents, taking a total of 6 hours and 40 minutes. Figure 6
shows the number of documents processed per crawl hour.
successive document content change
1 x
week
2 x
week
4 x
week
crawl frequency
8 x
week
16 x
week</p>
      <p>In this paper, we have presented ongoing work towards
building DyLDO: the Dynamic Linked Data Observatory.
This observatory aims to continuously gather a high-quality
collection of Linked Data snapshots for the community to
gain a better insight into the underlying principles of
dynamicity on the Web of Data. We motivated our proposals
based on several concrete use-cases and research questions
that could be tackled using such a collection, and presented
related works that treat various aspects of dynamicity for
HTML documents on the traditional Web. Next we looked
at the non-trivial question of what view we should adopt for
the Web of Data, introducing and comparing the BTC and
CKAN/LOD perspectives, showing how and where they
diverge, and weighing up their respective pros and cons. We
proposed selecting a kernel of sources to monitor around a
core set of URIs taken from BTC and CKAN/LOD datasets,
which are then extended by a 2-hop crawl. We also presented
the detailed crawl con guration we planned to use for our
monitoring experiments. Finally, we proposed our
methodology for performing the continuous observation of sources
in and around the kernel, as well as using adaptive intervals
to monitor more dynamic sources more frequently.</p>
      <p>We plan to begin the monitoring experiments in the next
few weeks, and to run them continuously and inde nitely.
We still have some practical issues to tackle in terms of
creating a backup and archiving system, a site o ering access
to the community9, as well as remote fallback procedures
in the event of network or hardware failure. Thus, we are
9Planned for http://swse.deri.org/DyLDO/
at a crucial stage, and are keen to gather nal feedback
and requirements from the community: we are particularly
anxious to hear from people who would have a speci c
interest or use-case for such data { be it in terms of research
analysis, evaluation frameworks, etc. { what requirements
they have, and whether or not current proposals would be
su cient. Signi cant changes will invalidate the snapshots
collected up to that point, so we want to gather comments
and nalise and activate the framework as soon as possible.
In the near future, the community can then begin to chart
a new dimension for the Web of Data: time.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Connolly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dhanaraj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hollenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Sheets</surname>
          </string-name>
          . \Tabulator:
          <article-title>Exploring and Analyzing linked data on the Semantic Web"</article-title>
          .
          <source>In: SWUI</source>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          . \
          <article-title>DBpedia - A crystallization point for the Web of Data"</article-title>
          .
          <source>In: J. Web Sem. 7</source>
          .
          <issue>3</issue>
          (
          <issue>2009</issue>
          ), pp.
          <volume>154</volume>
          {
          <fpage>165</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          . \
          <source>The Semantic Web Challenge</source>
          ,
          <year>2010</year>
          ". In: J.
          <source>Web Sem. 9</source>
          .
          <issue>3</issue>
          (
          <issue>2011</issue>
          ), p.
          <fpage>315</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bray</surname>
          </string-name>
          . \
          <article-title>Measuring the Web"</article-title>
          .
          <source>In: Comput. Netw. ISDN Syst</source>
          .
          <volume>28</volume>
          (
          <issue>7-11</issue>
          <year>1996</year>
          ), pp.
          <volume>993</volume>
          {
          <fpage>1005</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Brewington</surname>
          </string-name>
          and
          <string-name>
            <surname>G. Cybenko.</surname>
          </string-name>
          \
          <article-title>How dynamic is the Web?" In: Computer Networks (</article-title>
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Brewington</surname>
          </string-name>
          and
          <string-name>
            <surname>G. Cybenko.</surname>
          </string-name>
          \
          <article-title>Keeping up with the changing web"</article-title>
          .
          <source>In: Computer 33.5</source>
          (
          <issue>2000</issue>
          ), pp.
          <volume>52</volume>
          {
          <fpage>58</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina.</surname>
          </string-name>
          \
          <article-title>E ective page refresh policies for Web crawlers"</article-title>
          .
          <source>In: ACM Transactions on Database Systems 28.4</source>
          (
          <issue>Dec</issue>
          .
          <year>2003</year>
          ), pp.
          <volume>390</volume>
          {
          <fpage>426</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>\Estimating frequency of change"</article-title>
          .
          <source>In: ACM Transactions on Internet Technology 3.3 (Aug</source>
          .
          <year>2003</year>
          ), pp.
          <volume>256</volume>
          {
          <fpage>290</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>E. G.</surname>
          </string-name>
          <article-title>Co man Jr</article-title>
          .,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Weber</surname>
          </string-name>
          . \
          <article-title>Optimal robot scheduling for web search engines"</article-title>
          .
          <source>In: Journal of scheduling 1</source>
          (
          <year>1997</year>
          ), pp.
          <volume>0</volume>
          {
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          .
          <article-title>The Linking Open Data cloud diagram</article-title>
          . url: http://richard.cyganiak.de/ 2007/10/lod/ (visited on 02/06/
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Delbru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Toupikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Catasta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Tummarello.</surname>
          </string-name>
          \
          <string-name>
            <given-names>A Node</given-names>
            <surname>Indexing</surname>
          </string-name>
          <article-title>Scheme for Web Entity Retrieval"</article-title>
          .
          <source>In: ESWC (2)</source>
          .
          <year>2010</year>
          , pp.
          <volume>240</volume>
          {
          <fpage>256</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Finin</surname>
          </string-name>
          .
          <article-title>\Characterizing the Semantic Web on the Web"</article-title>
          .
          <source>In: ISWC</source>
          .
          <year>2006</year>
          , pp.
          <volume>242</volume>
          {
          <fpage>257</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Douglis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feldmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krishnamurthy</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Mogul. \</surname>
          </string-name>
          <article-title>Rate of Change and other Metrics: a Live Study of the World Wide Web"</article-title>
          .
          <source>In: USENIX Symposium on Internetworking Technologies and Systems December</source>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fetterly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Manasse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Wiener</surname>
          </string-name>
          . \
          <article-title>A large-scale study of the evolution of web pages"</article-title>
          .
          <source>In: WWW</source>
          .
          <year>2003</year>
          , pp.
          <volume>669</volume>
          {
          <fpage>678</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Glimm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krotzsch, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          . OWL:
          <article-title>Yet to arrive on the Web of Data? CoRR</article-title>
          . url: http://arxiv.org/pdf/1202.0984.pdf (visited on 02/06/
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Grimes</surname>
          </string-name>
          and
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>O'Brien. \Microscale evolution of web pages"</article-title>
          .
          <source>In: WWW</source>
          .
          <year>2008</year>
          , pp.
          <volume>1149</volume>
          {
          <fpage>1150</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          . \
          <article-title>MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data"</article-title>
          . In: International Semantic Web Conference.
          <year>2006</year>
          , pp.
          <volume>258</volume>
          {
          <fpage>271</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Freytag</surname>
          </string-name>
          . \
          <article-title>Executing SPARQL Queries over the Web of Linked Data"</article-title>
          .
          <source>In: ISWC</source>
          .
          <year>2009</year>
          , pp.
          <volume>293</volume>
          {
          <fpage>309</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Linked Data: Evolving the Web into a Global Data Space</article-title>
          . Vol.
          <volume>1</volume>
          . Morgan &amp; Claypool,
          <year>2011</year>
          , pp.
          <volume>1</volume>
          {
          <fpage>136</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kinsella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          . \
          <article-title>Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine"</article-title>
          .
          <source>In: J. Web Sem. 9</source>
          .
          <issue>4</issue>
          (
          <issue>2011</issue>
          ), pp.
          <volume>365</volume>
          {
          <fpage>401</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Padmanabhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vitter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          . \
          <article-title>Characterizing web document change"</article-title>
          .
          <source>In: Advances in Web-Age Information Management</source>
          (
          <year>2001</year>
          ), pp.
          <volume>133</volume>
          {
          <fpage>144</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ntoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Olston</surname>
          </string-name>
          . \
          <article-title>What's new on the web? The evolution of the web from a search engine perspective"</article-title>
          .
          <source>In: WWW</source>
          .
          <year>2004</year>
          , pp.
          <volume>1</volume>
          {
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oita</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Senellart</surname>
          </string-name>
          .
          <article-title>Deriving Dynamics of Web Pages: A Survey</article-title>
          .
          <source>INRIA TR.: inria-00588715</source>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramamritham</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          . \
          <article-title>Monitoring the dynamic web to respond to continuous queries"</article-title>
          .
          <source>In: WWW</source>
          .
          <year>2003</year>
          , pp.
          <volume>659</volume>
          {
          <fpage>668</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Passant</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendes</surname>
          </string-name>
          . \sparqlPuSH:
          <article-title>Proactive noti cation of data updates in RDF stores using PubSubHubbub"</article-title>
          .
          <source>In: Scripting and Development Workshop at ESWC</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>N.</given-names>
            <surname>Popitsch</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Haslhofer</surname>
          </string-name>
          . \
          <article-title>DSNotify - A solution for event detection and link maintenance in dynamic datasets"</article-title>
          .
          <source>In: J. Web Sem. 9</source>
          .
          <issue>3</issue>
          (
          <issue>2011</issue>
          ), pp.
          <volume>266</volume>
          {
          <fpage>283</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <article-title>The Billion Triple Challenge</article-title>
          . url: http://challenge. semanticweb.org/ (visited on 02/06/
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          . \Towards Dataset Dynamics:
          <article-title>Change Frequency of Linked Open Data Sources"</article-title>
          .
          <source>In: Proc. of LDOW at WWW</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karnstedt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Land</surname>
          </string-name>
          . \
          <article-title>Towards Understanding the Changing Web: Mining the Dynamics of Linked-Data Sources and Entities"</article-title>
          .
          <source>In: Proc. of KDML at LWA</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karnstedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Parreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hauswirth</surname>
          </string-name>
          . \
          <article-title>Linked Data and Live Querying for Enabling Support Platforms for Web Dataspaces"</article-title>
          .
          <source>In: DESWEB at ICDE</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Villazon-Terrazas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          . \Dataset Dynamics Compendium:
          <article-title>Where we are so far!"</article-title>
          <source>In: Proc. of COLD at ISWC</source>
          . Shanghai,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>