Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources Jürgen Umbrich, Michael Hausenblas, Aidan Hogan, Axel Polleres, Stefan Decker Digital Enterprise Research Institute (DERI) National University of Ireland, Galway IDA Business Park, Lower Dangan, Ireland f irstname.lastname@deri.org ABSTRACT • replication and synchronisation [24]. Datasets in the LOD cloud are far from being static in We begin in Section 2 by reviewing existing work, and con- their nature and how they are exposed. As resources are tinue in Section 3 by discussing and contrasting document added and new links are set, applications consuming the vs. entity centric perspectives concerning dynamics. There- data should be able to deal with these changes. In this pa- after, in Section 4 we present the background of our analy- per we investigate how LOD datasets change and what sen- sis, in Section 5 we describe our methodology for analysing sible measures there are to accommodate dataset dynamics. dataset dynamics, and in Section 6 we discuss the results of We compare our findings with traditional, document-centric our analysis. Finally, in Section 7, we conclude and render studies concerning the “freshness” of the document collec- future work. tions and propose metrics for LOD datasets. 2. RELATED WORK 1. INTRODUCTION As motivated above, the study of changes in documents The Linked Open Data (LOD) movement has gained re- and data sets is very relevant for a broad range of application markable momentum over the past years. At the time of domains. Earlier work discussed analysis of the dynamics of writing, well over one hundred datasets – including UK gov- the Web circa. 2008, leveraging their findings for optimi- ernmental data, the New York Times dataset, and Linked- sation of re-indexing techniques [6]. The work of Cho et. GeoData – have been published, providing several billion al. provides a comprehensive study regarding the change RDF triples interlinked by hundreds of millions of RDF frequency of Web documents: earlier work focussed on how links. Some datasets, such as DBpedia, have been available to integrate the knowledge for an incremental crawler [8]; from the very beginning of the LOD movement and regu- further work provided a detailed discussion for estimators larly undergo changes on both the instance level and the of the frequency of changes given incomplete history [9]. schema level. New resources are added and old resources Other research has focused on, for example, investigating are removed; new links are set to other datasets, and old the dynamics of Wikipedia articles [3] and the evolution of links are removed as the target has vanished. We should database schema over time [21]. hence assume that datasets in the LOD cloud are dynamic With respect to the Semantic Web, some research regard- in their very nature. Dataset dynamics is a term we re- ing dynamics has been conducted with respect to analysing cently coined [1], essentially addressing content and inter- the evolution of ontologies in the life science community [15]. linking changes in Linked Data sources. In [16] the authors reported on their work concerning DSNo- Our main contributions herein are: (i) define dataset dy- tify, a system for detecting and fixing broken links in LOD namics characteristics and how to measure them, and (ii) datasets. compare the dataset dynamics of the LOD cloud to the tra- However – and to the best of our knowledge – we are ditional Web (Web of HTML Documents). The motivating not aware of any published studies more generally regarding use-case for our study of dataset dynamics is to gain in- the change frequency of resources on the Linked Open Data sights into – and hopefully improve – concurrent work on an Web, and thus deem the work herein to be novel. efficient system for performing live queries over the Linked Open Data Web [13]. However, aside from this use-case 3. DOCUMENTS VS. ENTITIES: DIFFER- having knowledge about dataset dynamics is essential for a number of tasks: ENT PERSPECTIVES ON LINKED DATA There are various aspects of dataset dynamics which must • web crawling and caching [9]; be considered in order to achieve a comprehensive overview of how Linked Open Data changes and evolves on the Web. • distributed query optimisation [13]; Firstly, the change frequency of data on the Web can vary significantly across datasets, from rather static sources – • maintaining link integrity [16]; such as archives – to high-frequently changing sources – for • servicing of continuous queries [22]; example in the micro-blogging domain. Also, the change volume can range from small-scale updates – in our case, Copyright is held by the author/owner(s). LDOW2010, April 27, 2010, Raleigh, USA. updates involving a low number of triples – to bulk updates, . which potentially affect many resources. One must also pay attention to the perspective one takes on resources: that 100000 is, whether we are interested in local changes of particular datasets, or are interested in global changes with respect to 10000 what is said about a URI in all accessible linked datasets. number of documents Before we continue, however, we must first provide some 1000 preliminaries. Firstly, our notion of a ‘document’ refers to an atomic Web ‘container’ in which Linked Data is typically 100 exposed: these include RDF/XML, (X)HTML+RDFa doc- uments, etc. Secondly, we often refer to an ‘entity’ by which 10 we intuitively mean anything identified by a URI in Linked Data, including classes, properties, and the “real-world arte- 1 facts” described.1 Following from both, we can now distin- guish the following perspectives in dataset dynamics: 1 10 100 1000 10000 number of entities per document 1. A document-centric perspective, which focuses on data- sets and is motivated by the “traditional” Web as well as the REST community [12, 2] Figure 1: A typical distribution of entities in docu- ments. 2. An entity-centric perspective, which focuses on entities as described in the Linked Open Data Web [5] – we further separate the entity-centric perspective into: various entities; thus, we denote the set of entities mentioned in document d as E(d) ⊆ E and likewise the set of all doc- (a) An entity-per-document perspective which takes uments mentioning e as D(e) ⊆ D. Further, let ver(d, t) into account occurrences of an entity with respect be the state of document d at time-point t – i.e., the RDF to a specific document graph served by d at time t. It is clear, that different use (b) A global entity perspective which takes into ac- cases require specific state functions ver(d, t) and equality count all appearances of an entity across the Web measures; e.g. a state function could be the hash value of the RDF graph, a set of RDF statements or the set of inferable In particular, the entity-centric perspectives are more LOD- new statements. specific than the document-centric perspective prevalent in Then, the document change function of document d from more traditional views on dataset dynamics. Many applica- time t to t0 (where t < t0 ) is defined as follows: tions operating on the LOD cloud assume an entity-centric view where entities become the unit of knowledge and data Definition 1. Document Change Function on such entities are aggregated from multiple documents.  0 if ver(d, t) = ver(d, t0 ) Also, LOD documents may be dynamically served by an Cd (t, t0 ) = 1 otherwise entity-centric index (e.g., a SPARQL endpoint), whereby a change in one entity may entail changes in many documents. Likewise, we define the entity-per-document change function Thus, we believe the distinction between the document- and as follows: entity-centric perspectives to be important for our purposes herein. Definition 2. Entity-per-document Change Function In fact, the global entity perspective may be infeasible to if ver(d, t) ∩ e = ver(d, t0 ) ∩ e  0 monitor as arbitrary new sources can publish data about any Cde (t, t0 ) = 1 otherwise entities. For this reason – and despite formally discussing 2b herein – note that in the present work we will focus on where by G∩u we denote all triples in graph G mentioning u. the analysis of 1 and 2a, and leave approximative techniques Finally, the entity change function can be defined as follows: for analysis of 2b as part of our future research (discussed in Section 7). Definition 3. Entity Change Function Despite the two distinct perspectives, both are somehow C e (t, t0 ) = max∀d∈D(e) (Cd (t, t0 )) related: there is naturally a relation between entities and their appearances in different containers. Along these lines, Please note that we pursue a purely ‘syntactic’ notion of Figure 1 depicts a typical distribution of entities per doc- change, and do not consider more advanced notions relat- ument in the LOD cloud. As we have already shown else- ing to ‘semantic’ change: for example, we would consider where [17], this distribution follows a power law. a change in a datatype literal if the syntax of that literal In order to formalise what we mean by these different changes even though the semantic interpretation does not perspectives, let R = {r1 , ..., rn } be the set of all resources – this change would then propagate to the respective en- as of the Architecture of the World Wide Web [19]: that tity/document despite no real change on the semantic level. is, HTTP entities and documents. Further, we define D = Further, we do not consider any forms of reasoning in the {d1 , ..., dn }, D ⊂ R as the set of all documents (i.e., derefer- changes – e.g., we do not propagate changes in a class defi- enceable entities that point to RDF data) and E = {e1 , ..., en }, nition as changes to it’s member entities. We leave further E ⊂ R as the set of all entities. A document di can mention discussion and related analysis of ’semantic vs. syntactic 1 Note that in this paper, we currently overlook entities change’ for future work. ‘identified’ by blank-nodes; concretely, blank-node entities It may also be interesting to consider more closely the re- do not have consistent naming which has adverse conse- lationship between documents and the entities they contain, quences on the analysis presented in Section 5.3 examining separately the change function of entities which are considered ‘local’ with respect to the document they ap- Content HTTP Notification pear in. To this end, we introduce the term local entity, availability + ± [10] ± [16] meaning an entity in a document whose pay-level domain reliability + ± [10] unknown (PLD) is the same as the document’s PLD: here, a PLD is costs high low unknown defined as any domain that requires payment at a [top-level- scalability high high unknown domain] (TLD) or country-code TLD registrar [20]. Taking documents yes yes yes an example, let P LD(uri) be the PLD extraction function; entities no partially yes then: P LD(http : //www.deri.ie/) = deri.ie Table 1: Change detection mechanism’s aspect ma- trix. We can now define a local entity as follows: Definition 4. Local Entity We define the set of local entities Elocal (d) of document d as and standardisation efforts ongoing, including but not lim- Elocal (d) = {e ∈ E(d) | P LD(e) = P LD(d)} ited to: Definition 4 is closely related to a similar notion defined • online services;2 in [7], which defines locality based on the correspondence of hostnames. Note that, according to this definition, an entity • earlier efforts for a lightweight notification standard: may be local to several documents, which may not always for instance the Event Notification Protocol (ESN) (see be desirable. Alternatively, one could focus on the authori- “Requirements for Event Notification Protocol” [23]); tative relationship between entities and documents whereby the document an entity redirects to is the authoritative doc- • pubsubhubbub: a simple, open, server-to-server web- ument for that entity [18]. In this paper, we currently only hook-based pubsub (publish/subscribe) protocol as an consider the locality relationship between documents and extension to Atom and RSS.3 entities and plan to investigate stronger notions such as au- thoritativeness in future work. 5. METHODOLOGY To the best of our knowledge, this is the first study regard- 4. CHANGE DETECTION MECHANISM ing the dynamics of documents and entities of the Linked So far, we have focused on identifying and formalising Open Data Web. Hence, the methodologies used in our different notions of change – particularly change functions evaluation are inspired by legacy related work for Web doc- – as a foundational aspect of dataset dynamics. We now uments. Specifically, we applied similar evaluation methods discuss how such changes can be detected; one can group – and indeed try to answer similar questions – as presented change detection mechanisms as follows: in [8]. The experiments require a large data set which is constantly monitored over a long timespan to conclude sig- • HTTP-metadata monitoring: analysis of HTTP re- nificant findings: we are not aware of any significant, het- sponse headers – including datestamp and ETag [11] – erogeneous and publicly available data-set of Linked Open to detect whether something has changed; Data resources which includes a complete history of changes. Nevertheless, we have access to such a dataset collected for • content monitoring: fetching the entire content and an extended period in early 2009; although the dataset was determining locally what has changed; originally collected for a different purpose – and thus, as we • notification: active notification by a data source that will see is not as suitable for our analysis as a bespoke corpus something has changed (ideally what has changed) [16]. might be – we can derive some illustrative statistics which give some early insights into the dynamic nature of Linked The Table 1 summarises aspects of the the aforemen- Data on the Web.4 Next, we describe how this dataset was tioned change detection mechanisms. The aspects – mo- monitored and which methods we use for our evaluation. tivated by [10] – are as follows: (i) availability, meaning if the respective solution is available out-of-the-box in cur- 5.1 Monitoring rently deployed systems on the Web; (ii) reliability, referring To gain first insights about the dynamics of resources of to the ability to correctly capture all changes; (iii) costs, re- the Linked Open Data Web we analyse 24 data dumps col- ferring to the resources needed for the approach (in terms of lected by weekly snapshots of the 7 hop neighborhood of band-width, storage, etc.); and (iv) scalability with respect Tim Berners-Lee’s FOAF file5 . The weekly snapshots were to the number of involved data publishers (in terms of infras- collected using the MultiCrawler framework [14] with the tructure) and consumers (concerning, for example number following steps applied in each crawl cycle: of concurrent “subscribers” in a notification system). Fur- ther, we have included two Linked Data specific aspects in 1. gathering the content of a list of URIs; Table 1: (v) support for document-centric change detection, and (vi) support for entity-centric change detection. 2. parsing of RDF/XML content; Both content and HTTP metadata monitoring mecha- 2 http://www.changedetection.com/ nisms are well studied and discussion about those is available 3 http://code.google.com/p/pubsubhubbub/ elsewhere (cf. [10, 11]). The characteristics of Web-scale 4 Notably, this dataset was already studied by Biessmann notification mechanisms – especially concerning reliability, et. al. [4] w.r.t. to dependency dynamics between people costs, and scalability are subject to research at time of writ- described in the data set. 5 ing. However, there are some remarkable implementation http://www.w3.org/People/Berners-Lee/card e 3. extracting of all URIs at the subject and object posi- Cd local (t, t‘), we compare only the statements which 1) are tion of a triple; contained in documents whose URIs matches on the PLD level with the entity URI and 2) in which the entity URI ap- 4. shuffling list of extracted URIs; pears in the statement. Thus, we consider only the changes from documents in the locality of the entity as defined in 5. applying a per-domain limit for the URIs (5000 URIs Definition 4. per PLD). Please note that steps 4) and 5) were done for politeness 5.5 Change Process - A Poisson Process reasons to prevent too many parallel HTTP requests to one Finally, for the purposes of comparison, we use an es- server: these steps introduce a non-deterministic element tablished model for changes of Web documents. Previously into our crawl and thus, we did not monitor a fixed list of published studies [8] report that changes in Web documents URIs every week. Indeed, this passive monitoring makes can be modeled as a Poisson process (Equation 1). Poisson change frequency analysis more challenging [9]. We have to processes are used – for example – to model arrival times of deal with an incomplete history of sources, wherein it is very customers, the times of radioactive emissions or the number likely that many sources appear only once in the snapshot – of sharks appearing on a beach in a given year. The model thus, we sometimes present statistics which use only a small allows to calculate the probability of a number of events oc- subset of the total dataset: the subset derived from sources curring in a fixed period of time given that (i) the events are that were available in more than 20 of the 24 snapshots. independent of the time elapsed since the last event and (ii) the events occur with a known average frequency rate λ. The 5.2 Data Corpus parameter λ is the expected ‘events’ or ‘arrivals’ that occur The data collection was performed over 24 weeks starting per the required unit-of-time (in our case, a week). Further, from the 2nd of November 2008 and contains 550K RDF/XML let N(t + τ ) − N(t) be the number of changes in an interval documents with a total of 3.3M unique subjects (∼6 enti- (t, t + τ ] with τ given as the number of weeks – to take an ties appearing in the subject position per source) with 2.8M example in our scenario, if an entity e changed five times in locally defined entities per our definition 4. a window of the last 10 weeks, then Ne (14+10)−Ne (14) = 5. Finally, let k be the number of occurrences of a document or 5.3 Change detection function entity in the total monitoring time (24 weeks in our case). The change detection of a document Cd (t, t‘) or entity Then, according to the Poisson process, the probability of Cde (t, t‘) between two snapshots t, t‘ is a trivial task as long as an event occurring within a given interval (t, t + τ ] is given the statements of the resource do not contain blank nodes [24]. as: For our preliminary evaluation, we used a simple change detection algorithm – based on a merge-sort scan over the (λτ )k weekly snapshots – as follows: P [N(t + τ ) − N(t) = k] = exp(−λτ ) for k = 1, 2... (1) k! 1. skolemise blank nodes within a document; 6. FINDINGS 2. sort all relevant statements for the change detection of In this section we present several early findings about the an document or entity by their syntactic natural order change frequency of resources on the Linked Data Web. (subject-predicate-object-[context]); Firstly, we examine the usage of Etag and Last-Modified 3. perform pairwise comparison of the statements by scan- HTTP header fields, followed by an analysis of the various ning two snapshots in linear time; dynamic aspects which are aligned to the studies of the tra- ditional Web in [8]. 4. trigger a detection of change (either w.r.t. a document or entity) as soon as the order of the statements dif- 6.1 Usage of Etag and Last-Modified fers between two snapshots (e.g. new statements were One way to detect changes is to use the information con- added or removed). tained in HTTP response headers as discussed in Table 1. The HTTP protocol offers two header fields to indicate a 5.4 Evaluation change of a document, viz: the Etag and Last-Modified In this subsection, we describe in detail the evaluation we fields. Using such methods of change detection is more eco- performed on the data set. nomical in that it avoids the need for content sniffing. We verified the usage (or lack thereof) of these two fields Document-centric evaluation. Firstly – and as a baseline for all the documents in our corpus; Table 2 summarises the – we performed a document-centric evaluation which allows findings: us to compare our results with earlier studies about HTML documents. For this study, we compute the changes of a Header field Fraction document as defined in Definition 1. only Etag 7.12% only Last-Modified 8.18% Entity-centric evaluation. Secondly, we studied the change Both 16.75% frequency of entities from an entity-per-document perspec- None 67.95% tive as defined in Definition 2. In fact, more accurately we analysed the change frequency from a local-entity-per- Table 2: Usage of Etag and Last-Modified HTTP document perspective – a notion which follows intuitively header fields. from Definitions 2 and 4: to detect a change in an entity Similarly to studies about the usage of these two fields entities are summarised in Figure 3. The left side of the for HTML documents [10], we found that 67.95% of the diagram shows the percentage of all resources that were not 550K documents did not report either of these two fields. observed to change (static resources). The right side of the Both fields were available by 16.75% of all the documents. diagram shows the percentage of non-static resources that Thus, we have to rely on actively monitoring of documents were observed to have an average change frequency within to detect their changes. the given interval. An interesting finding is that 62% of the total documents did not change at all, along with 68% of the 6.2 Access and lifespan distribution entities. Further, we see that the fraction of documents is We move now to analysis involving the content of data in increasing with bigger change intervals, whereas for entities our corpus. Firstly, we are interested in characterising the it is quite the opposite: by inspecting the data closer, we fig- distribution of the number of accesses (i.e., appearances) ured out that 51% of the entities with a change frequency of and the lifespan (i.e., the time interval between the first less than 1 week appear in more than one ‘local’ document. and last appearance) of documents and entities respectively. Thus, for example, one document may change the descrip- This is a slightly different computation from [8], where, for tion of many local entities: along these lines, Figure 4 shows example, the authors estimated the lifespan of a document the distribution of the number of frequency of entities ap- by doubling the time the document was seen in the moni- pearing in a given number of documents, where again, we toring window if the document occurred at the beginning of can observe a power law distribution. the experiment but not at the end. Figure 2 contains the 80% plots of the frequency and lifespan distribution for the doc- documents uments (crosses) and entities (circles); we observe that the 70% 68.88% entities distributions follow approximately an “80-20” law. 62.12% From this figure, we can also conclude that only a fraction 60% 59% of the documents appeared frequently in the different snap- 52% shots – considering the importance of having as much infor- 50% fraction mation as possible to apply and verify our change frequency 40% model (Section 5.5) – and thus to gain a good overview about their dynamics – going forward, we will give special 30% consideration to the subset of our corpus derived from docu- 24% 23% ments that appear in at least 20 weekly snapshots and ignore 20% missing observations when considering changes. Again, this 9% 9% 10% 14% is necessitated by the non-deterministic factor in our inci- 10% dentally crawled snapshots. static <1 week >1 week >1 month > 3 month ≤ 1 month ≤ 3 month ≤ 6 month average change frequency documents (accesses) documents (lifespan) entities (accesses) 1000000 entities (lifespan) Figure 3: Fraction of documents with given average change frequency. 100000 count 1000000 100000 10000 number of entities 10000 1000 1000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 100 number of weeks 10 Figure 2: Access and lifespan distribution of entities and documents (y-axis logscale). 1 1 10 100 1000 10000 number of documents 6.3 How often do the resources change? Next, we will analyse the average change frequency of a resource. For the purposes of this analysis, we only consider Figure 4: Distribution of reuse of entities among the subset of the corpus which features resources that ap- documents (log/log scale). pear in more than 20 snapshots. Let us assume a document d changed 12 times during our monitoring interval of 24 weeks. In this case we can estimate the average change frequency 6.4 What fraction of the Web changed? of d to be 24 weeks/12 = 2 weeks. Following this example, Continuing, we now study how quickly and what fraction the results for average change frequency of documents and of the documents and entities changes over time. Along these lines, we count how many documents – and respec- cannot accept or reject the described change model with sta- tively entities – changed after a certain time period. Fig- tistical significancy. Further studies with more data samples ure 5 presents the cumulative change function for documents are required. (circles) and entities (squares). The graph cumulatively shows how many documents and entities had changed after X weeks. The plot contains the cumulative change function for all resources (appearing at least once), and for resources 0.1 that appeared in at least 20 snapshots. Again, the plot cor- relates with Figure 3 in that for the subset of the corpus with fraction of changes more than 20 observations, we can also see a large amount of entities changing after the first week, with a more grad- 0.01 ual increase in observed document changes. An interesting observation is that the entities with more than 20 observa- tions show a higher propensity to change; one could assume that such entities are better linked (and thus appear more often in our crawl) and so are reused in more documents (cf. 0.001 Figure 4). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 documents intervals between successive changes documents (#occ > 20 weeks) entities entities (#occ > 20 weeks) 30 % Figure 6: Documents with an average change fre- quency of 4 weeks (#occ >20 weeks, y-axis logscale). fraction 20 % 10 % 0.1 fraction of changes 0% 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 number of weeks 0.01 Figure 5: Cumulative change function. 0.001 6.5 Change process - A mathematical model? Next, we analyse whether we can apply the Poisson model 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 intervals between successive changes presented in Section 5.5 to the changes of documents and en- tities detected in our analysis. Therefore, we must compute the average change rate λ for each document d and entity e. Figure 7: Entities with an average change frequency We group the documents and entities with the same change of 4 weeks (#occ >20 weeks, y-axis logscale). rate and plot their distribution of successive change inter- vals; e.g., a document which changed in week 2 and 6 has a successive change interval of 4. If the changes can be 6.6 Discussion of the results modeled as a Poisson process, the resulting graph should be We found that in 90% of all documents less than 10% of distributed exponentially. the entities changed, as depicted in Figure 8, which shows For illustration, we selectively present the graph for doc- the distribution of the average fraction of entities that changed uments with an average change frequency of 4 weeks (Fig- per document. It is hence safe to assume that – in the con- ure 6) and the graph for entities with an average change text of Linked Data – the finer-grained entity-centric per- frequency of 4 weeks (Figure 7). We performed a Pois- spective for changes is superior, compared to the more tra- son regression ( log-linear regression) and use the maximum ditional document-centric point of view. likelihood method to estimate the parameters. The predi- Drawing towards a conclusion to our analysis, we now cated poisson process is plotted in the graphs as the line and discuss the observed changes for the documents over time. describes the observed data quite well, despite some small Therefore, we defined the following three main change cate- variations. Similar effects are observed for around half of gories: the other plots. However, we also spotted several graphs for documents and entities in which the Poisson model does • Update (U) – that is, between two snapshots of a doc- not well describe the observed data points: The main reason ument, the entities described were the same but the in- for this observation is that there are not enough available formation about the entities changed: new statements sample points. As a conclusion of the findings: we currently were added and/or removed; • Add (A) – that is, between two snapshots of a docu- • a surprisingly small amount (∼ 35%) of the monitored ment, new entities were added; resources changed over the time interval of 24 weeks; • Del (D) – that is, between two snapshots of a docu- • half of the documents that changed had a change fre- ment, entities were deleted; quency of more than 3 months – in contrary, on a entity-centric level, half of the entities had a change • Combination of the three categories mentioned above: frequency of less than a week applying our definition UA, UD, AD; UAD. of local entities (based on PLD correspondences be- tween document and entity); Table 3 lists the fraction of documents which encountered such a change (or combination thereof) for each of the seven • comparing our results to previous published studies we categories. We can see that 76% of the documents have cannot verify that the change frequency of the docu- only entity updates as changes, whereas in 9.46% of the ments and entities follow entirely the change model of documents new entities were added. a Poisson process. We should perhaps look at these early findings with a U A D (UA|UD|AD) total critical eye in that we did not actively monitor a fixed set U 76.88% 9.46% 7.08% 3.87% 97.29% of sources. This work is very much an early attempt in this A 9.46% 0.19% 2.29% 3.87% 15.81% field, and needs further exploration and research to fully D 7.08% 2.29% 0.23% 3.87% 13.5% understand and exploit the change frequency of resources in the Linked Data Web. Table 3: Results of document change categories. 7.1 Future Work Large scale experiment To verify our early findings and derive statistical significant results, we plan to expend and run our evaluation for a larger dataset which is monitored 100% over a longer time period. Further, we plan to study in more detail the dynamics on a entity-centric level; e.g. studying the dynamics of only authoritative entities as defined in [18] fraction of documents 10% or the dynamics of the global entities as defined in Section 3. Active monitoring. A major drawback of the current study is the monitoring method used for our data set. To overcome the problem of an incomplete change history, we will actively 1% survey a selected set of documents over a long time period, thus creating a tailored corpus for our analysis. In addition to active monitoring, we plan to study how we can dynam- ically adapt the monitoring interval based on the estimated 0.1% change frequency of a resource. <10% ≥ 10% < 20% ≥ 20% < 30% ≥ 30% < 40% ≥ 40% < 50% ≥ 50% < 60% ≥ 60% < 70% ≥ 70% < 80% ≥ 80% < 90% ≥ 90% 100% < 100% Fine-grained analysis of changes on a entity-centric level Finally, the findings of this work will be integrated into an average entity fraction which changes per document existing system which aims to execute live queries over the LOD Web, which uses efficient data summary approaches [13]. Figure 8: Average fraction of entity changes per doc- Thus, using our analytics, we would hope to discern docu- ument (y-axis logscale). ments which are highly dynamic and those which are more static: highly dynamic documents would thus be better suited to direct-lookup approaches, whereas static data would be more suited to index summaries (or indeed, full-blown 7. CONCLUSION data warehousing approaches) for query-answering. Simi- We motivated this work by highlighting the importance larly, we could also investigate what kinds of statements for of a fundamental understanding of dataset dynamics with an entity in a document changes; e.g. a rdf:type statement respect to Linked Open Data sources; we further claim that should be rather very static, whereas a statement describing such knowledge can be leveraged to optimise existing sys- the values of sensor data is rather very dynamic. tems and algorithms, such as making incremental index up- dates techniques more efficient. Further, we discussed in 8. ACKNOWLEDGEMENTS detail the differences between document-centric and entity- Our work has partly been supported by the European centric dynamics together with possible approaches for change Commission under Grant No. SFI/08/CE/I1380 (Lion-2), detection: content-monitoring, HTTP header monitoring, and under Grant No. 231335, FP7/ICT- 2007.4.4 iMP project. and active notifications. Further, this work has greatly benefited from the feedback The findings we gained from weekly snapshots of the neigh- of Bernhard Haslhofer and Nico Popitsch as well as from borhood graph of Tim Berners-Lee FOAF file are the follow- discussions with Andreas Harth. ing: • less than 35% of the monitored documents contained 9. REFERENCES Etag and Last-Modified HTTP header fields in the [1] Dataset dynamics (esw wiki). response; http://esw.w3.org/topic/DatasetDynamics. [2] REST and RDF Granularity. [18] A. Hogan, A. Harth, and A. Polleres. Scalable http://dret.typepad.com/dretblog/2009/05/ Authoritative OWL Reasoning for the Web. Int. J. rest-and-rdf-granularity.html, May 2009. Semantic Web Inf. Syst., 5(2), 2009. [3] R. Almeida, B. Mozafari, and J. Cho. On the [19] I. Jacobs and N. Walsh. Architecture of the World evolution of wikipedia. In Int. Conf. on Weblogs and Wide Web, Volume One. W3C Recommendation 15 Social Media, 2007. December 2004, W3C Technical Architecture Group [4] F. Biessmann and A. Harth. Analysing dependency (TAG), 2004. dynamics in web data. In Linked AI: AAAI Spring [20] H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. Symposium ”Linked Data Meets Artificial Irlbot: Scaling to 6 billion pages and beyond. ACM Intelligence”, 2010. Trans. Web, 3(3):1–34, 2009. [5] C. Bizer, T. Heath, and T. Berners-Lee. Linked [21] B. S. Lerner and A. N. Habermann. Beyond schema Data—The Story So Far. Special Issue on Linked evolution to database reorganization. In Data, International Journal on Semantic Web and OOPSLA/ECOOP ’90: Proceedings of the European Information Systems (IJSWIS), 5(3):1–22, 2009. conference on object-oriented programming on [6] B. E. Brewington and G. Cybenko. How dynamic is Object-oriented programming systems, languages, and the web? Comput. Netw., 33(1-6):257–276, 2000. applications, pages 67–76, New York, NY, USA, 1990. [7] G. Cheng and Y. Qu. Term dependence on the ACM. semantic web. In ISWC ’08: Proceedings of the 7th [22] S. Pandey, K. Ramamritham, and S. Chakrabarti. International Conference on The Semantic Web, pages Monitoring the dynamic web to respond to continuous 665–680, Berlin, Heidelberg, 2008. Springer-Verlag. queries. In WWW ’03: Proceedings of the 12th [8] J. Cho and H. Garcia-Molina. The evolution of the international conference on World Wide Web, pages web and implications for an incremental crawler. In 659–668, New York, NY, USA, 2003. ACM. VLDB, pages 200–209, 2000. [23] S. Reddy and M. Fisher. Requirements for Event [9] J. Cho and H. Garcia-Molina. Estimating frequency of Notification Protocol. Internet Draft, May 1, 1998, change. ACM Trans. Internet Techn., 3(3):256–290, IETF WEBDAV Working Group, 1998. 2003. [24] G. Tummarello, C. Morbidoni, R. Bachmann-Gmür, [10] L. R. Clausen. Concerning Etags and Datestamps. In and O. Erling. Rdfsync: Efficient remote Proceedings of the 4th International Web Archiving synchronization of rdf models. In ISWC/ASWC, pages Workshop, 2004. 537–551, 2007. [11] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, Y. Lafon, M. Nottingham, and J. Reschke. HTTP/1.1, part 6: Caching. Internet Draft, Expires: April 29, 2010, IETF HTTPbis Working Group, 2009. [12] R. Fielding and R. Taylor. Principled design of the modern Web architecture. ACM Trans. Internet Technol., 2(2):115–150, 2002. [13] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In Proceedings of the 19th World Wide Web Conference (WWW2010), Raleigh, NC, USA, Apr. 2010. ACM Press. accepted for publication. [14] A. Harth, J. Umbrich, and S. Decker. Multicrawler: A pipelined architecture for crawling and indexing semantic web data. In International Semantic Web Conference, pages 258–271, 2006. [15] M. Hartung, T. Kirsten, and E. Rahm. Analyzing the evolution of life science ontologies and mappings. In DILS ’08: Proceedings of the 5th international workshop on Data Integration in the Life Sciences, pages 11–27, Berlin, Heidelberg, 2008. Springer-Verlag. [16] B. Haslhofer and N. Popitsch. DSNnotify - detecting and fixing broken links in linked data sets. In Proceedings of the 8th International Workshop on Web Semantics (WebS 09), co-located with DEXA 2009, 2009. [17] M. Hausenblas, W. Halb, Y. Raimond, and T. Heath. What is the Size of the Semantic Web? In I-Semantics 2008: International Conference on Semantic Systems, Graz, Austria, 2008.