Towards Dataset Dynamics:
            Change Frequency of Linked Open Data Sources

          Jürgen Umbrich, Michael Hausenblas, Aidan Hogan, Axel Polleres, Stefan Decker
                                            Digital Enterprise Research Institute (DERI)
                                               National University of Ireland, Galway
                                            IDA Business Park, Lower Dangan, Ireland
                                                    f irstname.lastname@deri.org


ABSTRACT                                                                 • replication and synchronisation [24].
Datasets in the LOD cloud are far from being static in                 We begin in Section 2 by reviewing existing work, and con-
their nature and how they are exposed. As resources are             tinue in Section 3 by discussing and contrasting document
added and new links are set, applications consuming the             vs. entity centric perspectives concerning dynamics. There-
data should be able to deal with these changes. In this pa-         after, in Section 4 we present the background of our analy-
per we investigate how LOD datasets change and what sen-            sis, in Section 5 we describe our methodology for analysing
sible measures there are to accommodate dataset dynamics.           dataset dynamics, and in Section 6 we discuss the results of
We compare our findings with traditional, document-centric          our analysis. Finally, in Section 7, we conclude and render
studies concerning the “freshness” of the document collec-          future work.
tions and propose metrics for LOD datasets.
                                                                    2.    RELATED WORK
1.    INTRODUCTION                                                     As motivated above, the study of changes in documents
   The Linked Open Data (LOD) movement has gained re-               and data sets is very relevant for a broad range of application
markable momentum over the past years. At the time of               domains. Earlier work discussed analysis of the dynamics of
writing, well over one hundred datasets – including UK gov-         the Web circa. 2008, leveraging their findings for optimi-
ernmental data, the New York Times dataset, and Linked-             sation of re-indexing techniques [6]. The work of Cho et.
GeoData – have been published, providing several billion            al. provides a comprehensive study regarding the change
RDF triples interlinked by hundreds of millions of RDF              frequency of Web documents: earlier work focussed on how
links. Some datasets, such as DBpedia, have been available          to integrate the knowledge for an incremental crawler [8];
from the very beginning of the LOD movement and regu-               further work provided a detailed discussion for estimators
larly undergo changes on both the instance level and the            of the frequency of changes given incomplete history [9].
schema level. New resources are added and old resources             Other research has focused on, for example, investigating
are removed; new links are set to other datasets, and old           the dynamics of Wikipedia articles [3] and the evolution of
links are removed as the target has vanished. We should             database schema over time [21].
hence assume that datasets in the LOD cloud are dynamic                With respect to the Semantic Web, some research regard-
in their very nature. Dataset dynamics is a term we re-             ing dynamics has been conducted with respect to analysing
cently coined [1], essentially addressing content and inter-        the evolution of ontologies in the life science community [15].
linking changes in Linked Data sources.                             In [16] the authors reported on their work concerning DSNo-
   Our main contributions herein are: (i) define dataset dy-        tify, a system for detecting and fixing broken links in LOD
namics characteristics and how to measure them, and (ii)            datasets.
compare the dataset dynamics of the LOD cloud to the tra-              However – and to the best of our knowledge – we are
ditional Web (Web of HTML Documents). The motivating                not aware of any published studies more generally regarding
use-case for our study of dataset dynamics is to gain in-           the change frequency of resources on the Linked Open Data
sights into – and hopefully improve – concurrent work on an         Web, and thus deem the work herein to be novel.
efficient system for performing live queries over the Linked
Open Data Web [13]. However, aside from this use-case               3.    DOCUMENTS VS. ENTITIES: DIFFER-
having knowledge about dataset dynamics is essential for a
number of tasks:
                                                                          ENT PERSPECTIVES ON LINKED DATA
                                                                       There are various aspects of dataset dynamics which must
     • web crawling and caching [9];                                be considered in order to achieve a comprehensive overview
                                                                    of how Linked Open Data changes and evolves on the Web.
     • distributed query optimisation [13];                         Firstly, the change frequency of data on the Web can vary
                                                                    significantly across datasets, from rather static sources –
     • maintaining link integrity [16];
                                                                    such as archives – to high-frequently changing sources – for
     • servicing of continuous queries [22];                        example in the micro-blogging domain. Also, the change
                                                                    volume can range from small-scale updates – in our case,
Copyright is held by the author/owner(s).
LDOW2010, April 27, 2010, Raleigh, USA.                             updates involving a low number of triples – to bulk updates,
.                                                                   which potentially affect many resources. One must also pay
attention to the perspective one takes on resources: that
                                                                                        100000
is, whether we are interested in local changes of particular
datasets, or are interested in global changes with respect to
                                                                                              10000
what is said about a URI in all accessible linked datasets.


                                                                        number of documents
   Before we continue, however, we must first provide some
                                                                                               1000
preliminaries. Firstly, our notion of a ‘document’ refers to
an atomic Web ‘container’ in which Linked Data is typically
                                                                                                100
exposed: these include RDF/XML, (X)HTML+RDFa doc-
uments, etc. Secondly, we often refer to an ‘entity’ by which
                                                                                                 10
we intuitively mean anything identified by a URI in Linked
Data, including classes, properties, and the “real-world arte-
                                                                                                  1
facts” described.1 Following from both, we can now distin-
guish the following perspectives in dataset dynamics:                                                 1         10           100           1000        10000
                                                                                                                     number of entities per document
  1. A document-centric perspective, which focuses on data-
     sets and is motivated by the “traditional” Web as well
     as the REST community [12, 2]                                   Figure 1: A typical distribution of entities in docu-
                                                                     ments.
  2. An entity-centric perspective, which focuses on entities
     as described in the Linked Open Data Web [5] – we
     further separate the entity-centric perspective into:           various entities; thus, we denote the set of entities mentioned
                                                                     in document d as E(d) ⊆ E and likewise the set of all doc-
      (a) An entity-per-document perspective which takes             uments mentioning e as D(e) ⊆ D. Further, let ver(d, t)
          into account occurrences of an entity with respect         be the state of document d at time-point t – i.e., the RDF
          to a specific document                                     graph served by d at time t. It is clear, that different use
      (b) A global entity perspective which takes into ac-           cases require specific state functions ver(d, t) and equality
          count all appearances of an entity across the Web          measures; e.g. a state function could be the hash value of the
                                                                     RDF graph, a set of RDF statements or the set of inferable
   In particular, the entity-centric perspectives are more LOD-      new statements.
specific than the document-centric perspective prevalent in            Then, the document change function of document d from
more traditional views on dataset dynamics. Many applica-            time t to t0 (where t < t0 ) is defined as follows:
tions operating on the LOD cloud assume an entity-centric
view where entities become the unit of knowledge and data              Definition 1. Document Change Function
on such entities are aggregated from multiple documents.                                  
                                                                                            0 if ver(d, t) = ver(d, t0 )
Also, LOD documents may be dynamically served by an                         Cd (t, t0 ) =
                                                                                            1 otherwise
entity-centric index (e.g., a SPARQL endpoint), whereby a
change in one entity may entail changes in many documents.           Likewise, we define the entity-per-document change function
Thus, we believe the distinction between the document- and           as follows:
entity-centric perspectives to be important for our purposes
herein.                                                                Definition 2. Entity-per-document Change Function
   In fact, the global entity perspective may be infeasible to
                                                                                           if ver(d, t) ∩ e = ver(d, t0 ) ∩ e
                                                                                       
                                                                                         0
monitor as arbitrary new sources can publish data about any             Cde (t, t0 ) =
                                                                                         1 otherwise
entities. For this reason – and despite formally discussing
2b herein – note that in the present work we will focus on           where by G∩u we denote all triples in graph G mentioning u.
the analysis of 1 and 2a, and leave approximative techniques         Finally, the entity change function can be defined as follows:
for analysis of 2b as part of our future research (discussed
in Section 7).                                                         Definition 3. Entity Change Function
   Despite the two distinct perspectives, both are somehow
                                                                                                          C e (t, t0 ) = max∀d∈D(e) (Cd (t, t0 ))
related: there is naturally a relation between entities and
their appearances in different containers. Along these lines,           Please note that we pursue a purely ‘syntactic’ notion of
Figure 1 depicts a typical distribution of entities per doc-         change, and do not consider more advanced notions relat-
ument in the LOD cloud. As we have already shown else-               ing to ‘semantic’ change: for example, we would consider
where [17], this distribution follows a power law.                   a change in a datatype literal if the syntax of that literal
   In order to formalise what we mean by these different             changes even though the semantic interpretation does not
perspectives, let R = {r1 , ..., rn } be the set of all resources    – this change would then propagate to the respective en-
as of the Architecture of the World Wide Web [19]: that              tity/document despite no real change on the semantic level.
is, HTTP entities and documents. Further, we define D =              Further, we do not consider any forms of reasoning in the
{d1 , ..., dn }, D ⊂ R as the set of all documents (i.e., derefer-   changes – e.g., we do not propagate changes in a class defi-
enceable entities that point to RDF data) and E = {e1 , ..., en },   nition as changes to it’s member entities. We leave further
E ⊂ R as the set of all entities. A document di can mention          discussion and related analysis of ’semantic vs. syntactic
1
  Note that in this paper, we currently overlook entities            change’ for future work.
‘identified’ by blank-nodes; concretely, blank-node entities            It may also be interesting to consider more closely the re-
do not have consistent naming which has adverse conse-               lationship between documents and the entities they contain,
quences on the analysis presented in Section 5.3                     examining separately the change function of entities which
are considered ‘local’ with respect to the document they ap-                             Content     HTTP        Notification
pear in. To this end, we introduce the term local entity,                 availability      +         ± [10]       ± [16]
meaning an entity in a document whose pay-level domain                    reliability       +         ± [10]      unknown
(PLD) is the same as the document’s PLD: here, a PLD is                   costs           high         low        unknown
defined as any domain that requires payment at a [top-level-              scalability     high         high       unknown
domain] (TLD) or country-code TLD registrar [20]. Taking                  documents        yes         yes           yes
an example, let P LD(uri) be the PLD extraction function;                 entities         no        partially       yes
then:
             P LD(http : //www.deri.ie/) = deri.ie                  Table 1: Change detection mechanism’s aspect ma-
                                                                    trix.
     We can now define a local entity as follows:
 Definition 4. Local Entity
We define the set of local entities Elocal (d) of document d as
                                                                    and standardisation efforts ongoing, including but not lim-
          Elocal (d) = {e ∈ E(d) | P LD(e) = P LD(d)}               ited to:
  Definition 4 is closely related to a similar notion defined            • online services;2
in [7], which defines locality based on the correspondence of
hostnames. Note that, according to this definition, an entity            • earlier efforts for a lightweight notification standard:
may be local to several documents, which may not always                    for instance the Event Notification Protocol (ESN) (see
be desirable. Alternatively, one could focus on the authori-               “Requirements for Event Notification Protocol” [23]);
tative relationship between entities and documents whereby
the document an entity redirects to is the authoritative doc-            • pubsubhubbub: a simple, open, server-to-server web-
ument for that entity [18]. In this paper, we currently only               hook-based pubsub (publish/subscribe) protocol as an
consider the locality relationship between documents and                   extension to Atom and RSS.3
entities and plan to investigate stronger notions such as au-
thoritativeness in future work.                                     5.     METHODOLOGY
                                                                       To the best of our knowledge, this is the first study regard-
4.     CHANGE DETECTION MECHANISM                                   ing the dynamics of documents and entities of the Linked
  So far, we have focused on identifying and formalising            Open Data Web. Hence, the methodologies used in our
different notions of change – particularly change functions         evaluation are inspired by legacy related work for Web doc-
– as a foundational aspect of dataset dynamics. We now              uments. Specifically, we applied similar evaluation methods
discuss how such changes can be detected; one can group             – and indeed try to answer similar questions – as presented
change detection mechanisms as follows:                             in [8]. The experiments require a large data set which is
                                                                    constantly monitored over a long timespan to conclude sig-
      • HTTP-metadata monitoring: analysis of HTTP re-              nificant findings: we are not aware of any significant, het-
        sponse headers – including datestamp and ETag [11] –        erogeneous and publicly available data-set of Linked Open
        to detect whether something has changed;                    Data resources which includes a complete history of changes.
                                                                    Nevertheless, we have access to such a dataset collected for
      • content monitoring: fetching the entire content and
                                                                    an extended period in early 2009; although the dataset was
        determining locally what has changed;
                                                                    originally collected for a different purpose – and thus, as we
      • notification: active notification by a data source that     will see is not as suitable for our analysis as a bespoke corpus
        something has changed (ideally what has changed) [16].      might be – we can derive some illustrative statistics which
                                                                    give some early insights into the dynamic nature of Linked
   The Table 1 summarises aspects of the the aforemen-              Data on the Web.4 Next, we describe how this dataset was
tioned change detection mechanisms. The aspects – mo-               monitored and which methods we use for our evaluation.
tivated by [10] – are as follows: (i) availability, meaning
if the respective solution is available out-of-the-box in cur-      5.1      Monitoring
rently deployed systems on the Web; (ii) reliability, referring        To gain first insights about the dynamics of resources of
to the ability to correctly capture all changes; (iii) costs, re-   the Linked Open Data Web we analyse 24 data dumps col-
ferring to the resources needed for the approach (in terms of       lected by weekly snapshots of the 7 hop neighborhood of
band-width, storage, etc.); and (iv) scalability with respect       Tim Berners-Lee’s FOAF file5 . The weekly snapshots were
to the number of involved data publishers (in terms of infras-      collected using the MultiCrawler framework [14] with the
tructure) and consumers (concerning, for example number             following steps applied in each crawl cycle:
of concurrent “subscribers” in a notification system). Fur-
ther, we have included two Linked Data specific aspects in               1. gathering the content of a list of URIs;
Table 1: (v) support for document-centric change detection,
and (vi) support for entity-centric change detection.                    2. parsing of RDF/XML content;
   Both content and HTTP metadata monitoring mecha-                 2
                                                                      http://www.changedetection.com/
nisms are well studied and discussion about those is available      3
                                                                      http://code.google.com/p/pubsubhubbub/
elsewhere (cf. [10, 11]). The characteristics of Web-scale          4
                                                                      Notably, this dataset was already studied by Biessmann
notification mechanisms – especially concerning reliability,        et. al. [4] w.r.t. to dependency dynamics between people
costs, and scalability are subject to research at time of writ-     described in the data set.
                                                                    5
ing. However, there are some remarkable implementation                http://www.w3.org/People/Berners-Lee/card
                                                                        e
  3. extracting of all URIs at the subject and object posi-            Cd local (t, t‘), we compare only the statements which 1) are
     tion of a triple;                                                 contained in documents whose URIs matches on the PLD
                                                                       level with the entity URI and 2) in which the entity URI ap-
  4. shuffling list of extracted URIs;                                 pears in the statement. Thus, we consider only the changes
                                                                       from documents in the locality of the entity as defined in
  5. applying a per-domain limit for the URIs (5000 URIs
                                                                       Definition 4.
     per PLD).
Please note that steps 4) and 5) were done for politeness              5.5    Change Process - A Poisson Process
reasons to prevent too many parallel HTTP requests to one                 Finally, for the purposes of comparison, we use an es-
server: these steps introduce a non-deterministic element              tablished model for changes of Web documents. Previously
into our crawl and thus, we did not monitor a fixed list of            published studies [8] report that changes in Web documents
URIs every week. Indeed, this passive monitoring makes                 can be modeled as a Poisson process (Equation 1). Poisson
change frequency analysis more challenging [9]. We have to             processes are used – for example – to model arrival times of
deal with an incomplete history of sources, wherein it is very         customers, the times of radioactive emissions or the number
likely that many sources appear only once in the snapshot –            of sharks appearing on a beach in a given year. The model
thus, we sometimes present statistics which use only a small           allows to calculate the probability of a number of events oc-
subset of the total dataset: the subset derived from sources           curring in a fixed period of time given that (i) the events are
that were available in more than 20 of the 24 snapshots.               independent of the time elapsed since the last event and (ii)
                                                                       the events occur with a known average frequency rate λ. The
5.2    Data Corpus                                                     parameter λ is the expected ‘events’ or ‘arrivals’ that occur
   The data collection was performed over 24 weeks starting            per the required unit-of-time (in our case, a week). Further,
from the 2nd of November 2008 and contains 550K RDF/XML                let N(t + τ ) − N(t) be the number of changes in an interval
documents with a total of 3.3M unique subjects (∼6 enti-               (t, t + τ ] with τ given as the number of weeks – to take an
ties appearing in the subject position per source) with 2.8M           example in our scenario, if an entity e changed five times in
locally defined entities per our definition 4.                         a window of the last 10 weeks, then Ne (14+10)−Ne (14) = 5.
                                                                       Finally, let k be the number of occurrences of a document or
5.3    Change detection function                                       entity in the total monitoring time (24 weeks in our case).
  The change detection of a document Cd (t, t‘) or entity              Then, according to the Poisson process, the probability of
Cde (t, t‘) between two snapshots t, t‘ is a trivial task as long as   an event occurring within a given interval (t, t + τ ] is given
the statements of the resource do not contain blank nodes [24].        as:
For our preliminary evaluation, we used a simple change
detection algorithm – based on a merge-sort scan over the
                                                                                                     (λτ )k
weekly snapshots – as follows:                                          P [N(t + τ ) − N(t) = k] =          exp(−λτ ) for k = 1, 2... (1)
                                                                                                       k!
  1. skolemise blank nodes within a document;
                                                                       6.    FINDINGS
  2. sort all relevant statements for the change detection of
                                                                         In this section we present several early findings about the
     an document or entity by their syntactic natural order
                                                                       change frequency of resources on the Linked Data Web.
     (subject-predicate-object-[context]);
                                                                         Firstly, we examine the usage of Etag and Last-Modified
  3. perform pairwise comparison of the statements by scan-            HTTP header fields, followed by an analysis of the various
     ning two snapshots in linear time;                                dynamic aspects which are aligned to the studies of the tra-
                                                                       ditional Web in [8].
  4. trigger a detection of change (either w.r.t. a document
     or entity) as soon as the order of the statements dif-            6.1     Usage of Etag and Last-Modified
     fers between two snapshots (e.g. new statements were                 One way to detect changes is to use the information con-
     added or removed).                                                tained in HTTP response headers as discussed in Table 1.
                                                                       The HTTP protocol offers two header fields to indicate a
5.4    Evaluation                                                      change of a document, viz: the Etag and Last-Modified
  In this subsection, we describe in detail the evaluation we          fields. Using such methods of change detection is more eco-
performed on the data set.                                             nomical in that it avoids the need for content sniffing.
                                                                          We verified the usage (or lack thereof) of these two fields
Document-centric evaluation. Firstly – and as a baseline               for all the documents in our corpus; Table 2 summarises the
– we performed a document-centric evaluation which allows              findings:
us to compare our results with earlier studies about HTML
documents. For this study, we compute the changes of a                                Header field               Fraction
document as defined in Definition 1.                                                  only Etag                    7.12%
                                                                                      only Last-Modified           8.18%
Entity-centric evaluation. Secondly, we studied the change                            Both                        16.75%
frequency of entities from an entity-per-document perspec-                            None                        67.95%
tive as defined in Definition 2. In fact, more accurately
we analysed the change frequency from a local-entity-per-              Table 2: Usage of Etag and Last-Modified HTTP
document perspective – a notion which follows intuitively              header fields.
from Definitions 2 and 4: to detect a change in an entity
  Similarly to studies about the usage of these two fields                                         entities are summarised in Figure 3. The left side of the
for HTML documents [10], we found that 67.95% of the                                               diagram shows the percentage of all resources that were not
550K documents did not report either of these two fields.                                          observed to change (static resources). The right side of the
Both fields were available by 16.75% of all the documents.                                         diagram shows the percentage of non-static resources that
Thus, we have to rely on actively monitoring of documents                                          were observed to have an average change frequency within
to detect their changes.                                                                           the given interval. An interesting finding is that 62% of the
                                                                                                   total documents did not change at all, along with 68% of the
6.2              Access and lifespan distribution                                                  entities. Further, we see that the fraction of documents is
   We move now to analysis involving the content of data in                                        increasing with bigger change intervals, whereas for entities
our corpus. Firstly, we are interested in characterising the                                       it is quite the opposite: by inspecting the data closer, we fig-
distribution of the number of accesses (i.e., appearances)                                         ured out that 51% of the entities with a change frequency of
and the lifespan (i.e., the time interval between the first                                        less than 1 week appear in more than one ‘local’ document.
and last appearance) of documents and entities respectively.                                       Thus, for example, one document may change the descrip-
This is a slightly different computation from [8], where, for                                      tion of many local entities: along these lines, Figure 4 shows
example, the authors estimated the lifespan of a document                                          the distribution of the number of frequency of entities ap-
by doubling the time the document was seen in the moni-                                            pearing in a given number of documents, where again, we
toring window if the document occurred at the beginning of                                         can observe a power law distribution.
the experiment but not at the end. Figure 2 contains the
                                                                                                                          80%
plots of the frequency and lifespan distribution for the doc-                                                                                                                                          documents
uments (crosses) and entities (circles); we observe that the                                                              70%
                                                                                                                                           68.88%                                                      entities

distributions follow approximately an “80-20” law.                                                                                62.12%
   From this figure, we can also conclude that only a fraction                                                            60%
                                                                                                                                                                                                 59%

of the documents appeared frequently in the different snap-                                                                                                52%

shots – considering the importance of having as much infor-                                                               50%


                                                                                                               fraction
mation as possible to apply and verify our change frequency
                                                                                                                          40%
model (Section 5.5) – and thus to gain a good overview
about their dynamics – going forward, we will give special
                                                                                                                          30%
consideration to the subset of our corpus derived from docu-                                                                                                                24%   23%
ments that appear in at least 20 weekly snapshots and ignore                                                              20%
missing observations when considering changes. Again, this                                                                                          9%
                                                                                                                                                                    9%
                                                                                                                                                                                          10%
                                                                                                                                                                                                        14%

is necessitated by the non-deterministic factor in our inci-                                                              10%

dentally crawled snapshots.                                                                                                           static         <1 week         >1 week        >1 month           > 3 month
                                                                                                                                                                     ≤ 1 month      ≤ 3 month          ≤ 6 month

                                                                                                                                                         average change frequency
                                                                           documents (accesses)
                                                                           documents (lifespan)
                                                                           entities (accesses)
   1000000
                                                                           entities (lifespan)     Figure 3: Fraction of documents with given average
                                                                                                   change frequency.


        100000
count


                                                                                                        1000000


                                                                                                                100000

         10000
                                                                                                   number of entities


                                                                                                                        10000


                                                                                                                         1000

          1000
                  1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24                          100
                                                    number of weeks

                                                                                                                           10
Figure 2: Access and lifespan distribution of entities
and documents (y-axis logscale).                                                                                            1


                                                                                                                                           1             10                100            1000           10000
                                                                                                                                                               number of documents
6.3              How often do the resources change?
  Next, we will analyse the average change frequency of a
resource. For the purposes of this analysis, we only consider                                      Figure 4: Distribution of reuse of entities among
the subset of the corpus which features resources that ap-                                         documents (log/log scale).
pear in more than 20 snapshots. Let us assume a document d
changed 12 times during our monitoring interval of 24 weeks.
In this case we can estimate the average change frequency                                          6.4                          What fraction of the Web changed?
of d to be 24 weeks/12 = 2 weeks. Following this example,                                            Continuing, we now study how quickly and what fraction
the results for average change frequency of documents and                                          of the documents and entities changes over time. Along
these lines, we count how many documents – and respec-                                            cannot accept or reject the described change model with sta-
tively entities – changed after a certain time period. Fig-                                       tistical significancy. Further studies with more data samples
ure 5 presents the cumulative change function for documents                                       are required.
(circles) and entities (squares). The graph cumulatively
shows how many documents and entities had changed after
X weeks. The plot contains the cumulative change function
for all resources (appearing at least once), and for resources                                                              0.1
that appeared in at least 20 snapshots. Again, the plot cor-
relates with Figure 3 in that for the subset of the corpus with


                                                                                                     fraction of changes
more than 20 observations, we can also see a large amount
of entities changing after the first week, with a more grad-
                                                                                                                           0.01
ual increase in observed document changes. An interesting
observation is that the entities with more than 20 observa-
tions show a higher propensity to change; one could assume
that such entities are better linked (and thus appear more
often in our crawl) and so are reused in more documents (cf.                                                       0.001

Figure 4).

                                                                                                                                   1   2   3   4   5     6       7       8       9       10 11 12 13 14 15 16           17 18 19 20
                            documents                                                                                                                  intervals between successive changes
                            documents (#occ > 20 weeks)
                            entities
                            entities (#occ > 20 weeks)
            30 %                                                                                  Figure 6: Documents with an average change fre-
                                                                                                  quency of 4 weeks (#occ >20 weeks, y-axis logscale).
 fraction


            20 %


            10 %

                                                                                                                            0.1
                                                                                                   fraction of changes


            0%

                    2   3    4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
                                                     number of weeks
                                                                                                                           0.01


                   Figure 5: Cumulative change function.

                                                                                                                0.001
6.5                Change process - A mathematical model?
   Next, we analyse whether we can apply the Poisson model                                                                         1   2   3   4   5         6       7       8       9    10   11   12   13   14   15   16   17   18   19
                                                                                                                                                       intervals between successive changes
presented in Section 5.5 to the changes of documents and en-
tities detected in our analysis. Therefore, we must compute
the average change rate λ for each document d and entity e.                                       Figure 7: Entities with an average change frequency
We group the documents and entities with the same change                                          of 4 weeks (#occ >20 weeks, y-axis logscale).
rate and plot their distribution of successive change inter-
vals; e.g., a document which changed in week 2 and 6 has
a successive change interval of 4. If the changes can be                                          6.6                             Discussion of the results
modeled as a Poisson process, the resulting graph should be                                         We found that in 90% of all documents less than 10% of
distributed exponentially.                                                                        the entities changed, as depicted in Figure 8, which shows
   For illustration, we selectively present the graph for doc-                                    the distribution of the average fraction of entities that changed
uments with an average change frequency of 4 weeks (Fig-                                          per document. It is hence safe to assume that – in the con-
ure 6) and the graph for entities with an average change                                          text of Linked Data – the finer-grained entity-centric per-
frequency of 4 weeks (Figure 7). We performed a Pois-                                             spective for changes is superior, compared to the more tra-
son regression ( log-linear regression) and use the maximum                                       ditional document-centric point of view.
likelihood method to estimate the parameters. The predi-                                            Drawing towards a conclusion to our analysis, we now
cated poisson process is plotted in the graphs as the line and                                    discuss the observed changes for the documents over time.
describes the observed data quite well, despite some small                                        Therefore, we defined the following three main change cate-
variations. Similar effects are observed for around half of                                       gories:
the other plots. However, we also spotted several graphs
for documents and entities in which the Poisson model does                                                                 • Update (U) – that is, between two snapshots of a doc-
not well describe the observed data points: The main reason                                                                  ument, the entities described were the same but the in-
for this observation is that there are not enough available                                                                  formation about the entities changed: new statements
sample points. As a conclusion of the findings: we currently                                                                 were added and/or removed;
                         • Add (A) – that is, between two snapshots of a docu-                                             • a surprisingly small amount (∼ 35%) of the monitored
                           ment, new entities were added;                                                                    resources changed over the time interval of 24 weeks;

                         • Del (D) – that is, between two snapshots of a docu-                                             • half of the documents that changed had a change fre-
                           ment, entities were deleted;                                                                      quency of more than 3 months – in contrary, on a
                                                                                                                             entity-centric level, half of the entities had a change
                         • Combination of the three categories mentioned above:                                              frequency of less than a week applying our definition
                           UA, UD, AD; UAD.                                                                                  of local entities (based on PLD correspondences be-
                                                                                                                             tween document and entity);
Table 3 lists the fraction of documents which encountered
such a change (or combination thereof) for each of the seven                                                               • comparing our results to previous published studies we
categories. We can see that 76% of the documents have                                                                        cannot verify that the change frequency of the docu-
only entity updates as changes, whereas in 9.46% of the                                                                      ments and entities follow entirely the change model of
documents new entities were added.                                                                                           a Poisson process.
                                                                                                                         We should perhaps look at these early findings with a
                                     U          A               D          (UA|UD|AD)                      total
                                                                                                                      critical eye in that we did not actively monitor a fixed set
           U                    76.88%      9.46%           7.08%               3.87%                    97.29%       of sources. This work is very much an early attempt in this
           A                     9.46%      0.19%           2.29%               3.87%                    15.81%       field, and needs further exploration and research to fully
           D                     7.08%      2.29%           0.23%               3.87%                     13.5%       understand and exploit the change frequency of resources in
                                                                                                                      the Linked Data Web.
       Table 3: Results of document change categories.
                                                                                                                      7.1     Future Work
                                                                                                                         Large scale experiment To verify our early findings and
                                                                                                                      derive statistical significant results, we plan to expend and
                                                                                                                      run our evaluation for a larger dataset which is monitored
                   100%
                                                                                                                      over a longer time period. Further, we plan to study in more
                                                                                                                      detail the dynamics on a entity-centric level; e.g. studying
                                                                                                                      the dynamics of only authoritative entities as defined in [18]
 fraction of documents


                         10%
                                                                                                                      or the dynamics of the global entities as defined in Section 3.
                                                                                                                         Active monitoring. A major drawback of the current study
                                                                                                                      is the monitoring method used for our data set. To overcome
                                                                                                                      the problem of an incomplete change history, we will actively
                          1%                                                                                          survey a selected set of documents over a long time period,
                                                                                                                      thus creating a tailored corpus for our analysis. In addition
                                                                                                                      to active monitoring, we plan to study how we can dynam-
                                                                                                                      ically adapt the monitoring interval based on the estimated
                         0.1%                                                                                         change frequency of a resource.
                                  <10%   ≥ 10%
                                         < 20%
                                                 ≥ 20%
                                                 < 30%
                                                         ≥ 30%
                                                         < 40%
                                                                 ≥ 40%
                                                                 < 50%
                                                                         ≥ 50%
                                                                         < 60%
                                                                                 ≥ 60%
                                                                                 < 70%
                                                                                         ≥ 70%
                                                                                         < 80%
                                                                                                 ≥ 80%
                                                                                                 < 90%
                                                                                                         ≥ 90% 100%
                                                                                                         < 100%
                                                                                                                         Fine-grained analysis of changes on a entity-centric level
                                                                                                                      Finally, the findings of this work will be integrated into an
                                         average entity fraction which changes per document
                                                                                                                      existing system which aims to execute live queries over the
                                                                                                                      LOD Web, which uses efficient data summary approaches [13].
Figure 8: Average fraction of entity changes per doc-                                                                 Thus, using our analytics, we would hope to discern docu-
ument (y-axis logscale).                                                                                              ments which are highly dynamic and those which are more
                                                                                                                      static: highly dynamic documents would thus be better
                                                                                                                      suited to direct-lookup approaches, whereas static data would
                                                                                                                      be more suited to index summaries (or indeed, full-blown
7.                          CONCLUSION                                                                                data warehousing approaches) for query-answering. Simi-
  We motivated this work by highlighting the importance                                                               larly, we could also investigate what kinds of statements for
of a fundamental understanding of dataset dynamics with                                                               an entity in a document changes; e.g. a rdf:type statement
respect to Linked Open Data sources; we further claim that                                                            should be rather very static, whereas a statement describing
such knowledge can be leveraged to optimise existing sys-                                                             the values of sensor data is rather very dynamic.
tems and algorithms, such as making incremental index up-
dates techniques more efficient. Further, we discussed in                                                             8.    ACKNOWLEDGEMENTS
detail the differences between document-centric and entity-
                                                                                                                        Our work has partly been supported by the European
centric dynamics together with possible approaches for change
                                                                                                                      Commission under Grant No. SFI/08/CE/I1380 (Lion-2),
detection: content-monitoring, HTTP header monitoring,
                                                                                                                      and under Grant No. 231335, FP7/ICT- 2007.4.4 iMP project.
and active notifications.
                                                                                                                      Further, this work has greatly benefited from the feedback
  The findings we gained from weekly snapshots of the neigh-
                                                                                                                      of Bernhard Haslhofer and Nico Popitsch as well as from
borhood graph of Tim Berners-Lee FOAF file are the follow-
                                                                                                                      discussions with Andreas Harth.
ing:

                         • less than 35% of the monitored documents contained                                         9.    REFERENCES
                           Etag and Last-Modified HTTP header fields in the                                            [1] Dataset dynamics (esw wiki).
                           response;                                                                                       http://esw.w3.org/topic/DatasetDynamics.
 [2] REST and RDF Granularity.                                 [18] A. Hogan, A. Harth, and A. Polleres. Scalable
     http://dret.typepad.com/dretblog/2009/05/                      Authoritative OWL Reasoning for the Web. Int. J.
     rest-and-rdf-granularity.html, May 2009.                       Semantic Web Inf. Syst., 5(2), 2009.
 [3] R. Almeida, B. Mozafari, and J. Cho. On the               [19] I. Jacobs and N. Walsh. Architecture of the World
     evolution of wikipedia. In Int. Conf. on Weblogs and           Wide Web, Volume One. W3C Recommendation 15
     Social Media, 2007.                                            December 2004, W3C Technical Architecture Group
 [4] F. Biessmann and A. Harth. Analysing dependency                (TAG), 2004.
     dynamics in web data. In Linked AI: AAAI Spring           [20] H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov.
     Symposium ”Linked Data Meets Artificial                        Irlbot: Scaling to 6 billion pages and beyond. ACM
     Intelligence”, 2010.                                           Trans. Web, 3(3):1–34, 2009.
 [5] C. Bizer, T. Heath, and T. Berners-Lee. Linked            [21] B. S. Lerner and A. N. Habermann. Beyond schema
     Data—The Story So Far. Special Issue on Linked                 evolution to database reorganization. In
     Data, International Journal on Semantic Web and                OOPSLA/ECOOP ’90: Proceedings of the European
     Information Systems (IJSWIS), 5(3):1–22, 2009.                 conference on object-oriented programming on
 [6] B. E. Brewington and G. Cybenko. How dynamic is                Object-oriented programming systems, languages, and
     the web? Comput. Netw., 33(1-6):257–276, 2000.                 applications, pages 67–76, New York, NY, USA, 1990.
 [7] G. Cheng and Y. Qu. Term dependence on the                     ACM.
     semantic web. In ISWC ’08: Proceedings of the 7th         [22] S. Pandey, K. Ramamritham, and S. Chakrabarti.
     International Conference on The Semantic Web, pages            Monitoring the dynamic web to respond to continuous
     665–680, Berlin, Heidelberg, 2008. Springer-Verlag.            queries. In WWW ’03: Proceedings of the 12th
 [8] J. Cho and H. Garcia-Molina. The evolution of the              international conference on World Wide Web, pages
     web and implications for an incremental crawler. In            659–668, New York, NY, USA, 2003. ACM.
     VLDB, pages 200–209, 2000.                                [23] S. Reddy and M. Fisher. Requirements for Event
 [9] J. Cho and H. Garcia-Molina. Estimating frequency of           Notification Protocol. Internet Draft, May 1, 1998,
     change. ACM Trans. Internet Techn., 3(3):256–290,              IETF WEBDAV Working Group, 1998.
     2003.                                                     [24] G. Tummarello, C. Morbidoni, R. Bachmann-Gmür,
[10] L. R. Clausen. Concerning Etags and Datestamps. In             and O. Erling. Rdfsync: Efficient remote
     Proceedings of the 4th International Web Archiving             synchronization of rdf models. In ISWC/ASWC, pages
     Workshop, 2004.                                                537–551, 2007.
[11] R. Fielding, J. Gettys, J. Mogul, H. Frystyk,
     L. Masinter, P. Leach, T. Berners-Lee, Y. Lafon,
     M. Nottingham, and J. Reschke. HTTP/1.1, part 6:
     Caching. Internet Draft, Expires: April 29, 2010,
     IETF HTTPbis Working Group, 2009.
[12] R. Fielding and R. Taylor. Principled design of the
     modern Web architecture. ACM Trans. Internet
     Technol., 2(2):115–150, 2002.
[13] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U.
     Sattler, and J. Umbrich. Data summaries for
     on-demand queries over linked data. In Proceedings of
     the 19th World Wide Web Conference (WWW2010),
     Raleigh, NC, USA, Apr. 2010. ACM Press. accepted
     for publication.
[14] A. Harth, J. Umbrich, and S. Decker. Multicrawler: A
     pipelined architecture for crawling and indexing
     semantic web data. In International Semantic Web
     Conference, pages 258–271, 2006.
[15] M. Hartung, T. Kirsten, and E. Rahm. Analyzing the
     evolution of life science ontologies and mappings. In
     DILS ’08: Proceedings of the 5th international
     workshop on Data Integration in the Life Sciences,
     pages 11–27, Berlin, Heidelberg, 2008. Springer-Verlag.
[16] B. Haslhofer and N. Popitsch. DSNnotify - detecting
     and fixing broken links in linked data sets. In
     Proceedings of the 8th International Workshop on Web
     Semantics (WebS 09), co-located with DEXA 2009,
     2009.
[17] M. Hausenblas, W. Halb, Y. Raimond, and T. Heath.
     What is the Size of the Semantic Web? In I-Semantics
     2008: International Conference on Semantic Systems,
     Graz, Austria, 2008.