Towards a Dynamic Linked Data Observatory Tobias Käfer Jürgen Umbrich Aidan Hogan Axel Polleres Karlsruhe Institute of Digital Enterprise Digital Enterprise Siemens AG Österreich, Technology, Germany Research Institute, Research Institute, Siemensstrasse 90, 1210, National University of National University of Vienna, Austria Ireland, Galway, Ireland Ireland, Galway, Ireland ABSTRACT time. Current results in the area rely on domain-specific We describe work-in-progress on the design and methodol- datasets (e.g., Popitsch and Haslhofer [26] focus on DBpe- ogy of the Dynamic Linked Data Observatory: a framework dia changes [2]), or infrequently updated snapshots of open to monitor Linked Data over an extended period of time. domain data over short monitoring time-spans (e.g., our The core goal of our work is to collect frequent, continuous previous work looked at 20 weekly snapshots collected in snapshots of a subset of the Web of Data that is interesting 2008 [28]). Thus, we believe that a first, important step is to for further study and experimentation, with an aim to cap- build from scratch a new monitoring framework to derive a ture raw data about the dynamics of Linked Data. The re- bespoke, continuously updated collection of snapshots. This sulting corpora will be made openly and continuously avail- collection can then be freely used to study not only the high- able to the Linked Data research community. Herein, we (1) level dynamics of entities, but also to distil the fundamental motivate the importance of such a corpus; (2) outline some of underlying patterns of changes in the Web of Data across the use-cases and requirements for the resulting snapshots; different domains (e.g., studying if certain graph patterns (3) discuss different “views” of the Web of Data that affect are more dynamic than others [29]). how we define a sample to monitor; (4) detail how we select Clearly the design of such a framework, and gathering the scope of the monitoring experiment through sampling, the requirements for the resulting collection, is non-trivial: (5) discuss the final design of the monitoring framework that many factors and actors have to be taking into considera- will gather regular snapshots of (subsets of) the Web of Data tion in the context of the open Web and the Linked Data over the coming months and years. community. Conceptually, we want the collection to be: general-purpose: suitable to study for a wide range of in- 1. INTRODUCTION terested parties; Linked Data enjoys continued momentum in terms of pub- broad : capturing a wide selection of Linked Data domains; lishing, research and development; as a result, the Web of substantial : the number of documents monitored should Data continues to expand in size, scope and diversity. How- allow for deriving confident statistical measures; ever, we see a niche in terms of understanding how the Web of Data evolves and changes over time. Establishing a more granular & frequent: offering detailed data on sources; fine-grained understanding of the nature of Linked Data dy- contiguous: allowing comparison of sources over time; and namics is of core importance to publishing, research and adaptive: able to discover the arrival of new sources, and development. With regards to publishing, a better under- can monitor more dynamic sources more frequently. standing of Linked Data dynamics would, for example, in- form the design of tools for keeping published data consis- However, some of these targets are antagonistic and demand tent with changes in external data (e.g., [26]). With regards a trade-off. Monitoring a substantial number of sources in to research, more granular data on Linked Data dynamics a granular & frequent fashion requires a practical compro- would open up new paths for exploration, e.g., the design of mise to be made. Similarly, contiguous data and adaptive hybrid/live query approaches [30] that know when a query monitoring are conflicting aims. Furthermore, the aims to relates to dynamic data, and that retrieves fresh results di- be general-purpose and broad need more concrete consid- rectly from the source. With regards development, for cur- eration in terms of what sources are monitored. rent systems, the results of such studies would inform crawl- For implementing the monitoring framework, other practi- ing strategies, local index refresh rates and update strate- cal considerations include politeness such that remote servers gies, cache invalidation techniques and tuning, and so forth. are not unnecessarily overburdened, stability such that the Towards a better understanding of Linked Data dynam- monitoring experiment can function even in the case of hard- ics, the community currently lacks a high-quality, broad, ware failure, and resource overhead such that the computa- granular corpus of raw data that provides frequent snap- tion can be run on a single, dedicated commodity machine. shots of Linked Data documents over a sustained period of Taking these design criteria into account, herein we moti- vate and initially propose a framework for taking frequent, Acknowledgements. This work has been funded in part by Science continuous snapshots of a subset of Linked Data which we Foundation Ireland under Grant No. SFI/08/CE/I1380 (Líon-2). call DyLDO: the Dynamic Linked Data Observatory. We cur- LDOW2012 April 16, 2012 Lyon, France rently focus on defining the size and scope of the monitor- Copyright held by the author(s)/owner(s). ing experiment, discussing the rationale behind our choice of sources to observe, touching upon various issues relating UC-3: Hybrid Architectures. to sampling the Web of Data. Later, we also sketch the A large index of Linked Data can implement a hybrid ar- framework itself, outlining the crawling to be performed for chitecture based on dynamicity statistics, where one process- each snapshot, providing rationale for different parameters ing pipeline is optimised for static knowledge, and another used in the crawl, as well as proposing an adaptive filter- for dynamic knowledge. ing scheme for monitoring more dynamic sources more fre- In various Linked Data search and query engines, there is quently. Our primary goal here is to inform the community an inherent trade-off between running computation live dur- of our intentions, outline our rationale, collect use-cases and ing query-time or pre-computing answers offline. Abstractly, potential consumers of the snapshot collection, and to gather pre-computation suits static data and runtime computation feedback and requirements prior to starting the monitoring suits dynamic data. Example trade-offs include dynamic in- experiments. In particular, we currently focus on defining sertions vs. batch loading, lightweight indexing vs. heavy- the scope of the monitoring experiment. weight indexing, runtime joins vs. live joins, backward- To begin, we motivate our work in terms of envisaged chaining reasoning vs. forward-chaining reasoning, window- use-cases and research questions our corpus could help with based stream operators vs. global operators, etc. Knowledge (§ 2), subsequently presenting some related work (§ 3). Next, of dynamicity can help decide which methods are appropri- in order to understand what we are monitoring/sampling ate for which data. Furthermore, smart hybrid queries may – to ascertain its borders – we ask the question What is become possible: consider the query “Give me (current) the Web of Data?, and compare two prominent “views” temperatures of European capitals”, where knowledge thereof: (1) the Billion Triple Challenge dataset view, and of dynamicity would reveal that temperatures are dynamic (2) the CKAN/LOD cloud view (§ 4). Thereafter, we de- and should be fetched live, whereas European capitals are scribe the sampling methodology we have used to derive a static and can be run (efficiently) over the local index. “seed list” of URIs that form the core of the monitoring ex- periment (§ 5). Finally, we outline the proposed monitoring UC-4: External-Link Maintenance. framework, detailing setup parameters, and adaptive exten- The link maintenance use-case addresses the challenge to sions (§ 6). We conclude with future directions and a call for preserve referential integrity and the correct type of links in feedback and potential use-cases from the community (§ 7). the presence of dynamic external data. Popitsch and Haslhofer [26] investigate this use case, which involves monitoring external Web datasets for changes to 2. USE CASES AND OPEN QUESTIONS help ensure the integrity of links targeting them from local We now discuss the potential benefit and impact of our data. They propose DSNotify: a solution and an evalua- proposed observatory for Linked Data based on (1) some tion framework which is able to replay changes in a dataset; envisaged use-cases, and (2) some open research questions however, the authors only have knowledge of DBpedia dy- that our data could help to empirically investigate. namics to leverage. Such works help ensure the quality of In previous work [31], we gave an overview of Web and links between datasets, and we hope that our corpora will Linked Data dynamics, presenting four community use cases help extend application to the broader Web of Data. that require technical solutions to deal with dynamic data. We first extend these four example scenarios to motivate our UC-5: Vocabulary Evolution and Versioning. ongoing work on the Dynamic Linked Data Observatory. Knowledge of the dynamics of Linked Data vocabularies could lead to better versioning methods. UC-1: Synchronisation. Changes in the semantics of terms in Linked Data vo- Synchronisation addresses the problem of keeping an of- cabularies can have a dramatic influence on the interpre- fline sample/replica of the Web of Data up-to-date. tation of remote datasets using them. For example, the The most common scenarios is the maintenance of locally FOAF vocabulary is extremely widely used on the Web of cached LOD indexes. To the best of our knowledge, none of Data [12, 20], but often resorts to informal versioning mech- the popular semantic web caches (such as hosted by Sindice anisms: aside from the monotonic addition of terms, for ex- and OpenLink) investigated index-update strategies based ample, FOAF recently removed disjointness constraints be- on the current knowledge of the dynamicity of Linked Data, tween popular classes like foaf:Person and foaf:Document, but rather rely on publisher-reported information about up- made foaf:logo inverse-functional, etc., that may change date frequencies where available (e.g., sitemaps1 ). inferencing over (or even break) external data. Studying and understanding the dynamicity of vocabularies may mo- UC-2: Smart Caching. tivate and suggest better methodologies for versioning. This use-case tries to find efficient methods to optimise We foresee that research and tools relating to these use- systems operating live over Web data by minimising network cases will benefit directly from having our data collection traffic wasted on unnecessary HTTP lookups. available for analysis. However, aside from these concrete Versus synchronisation, this use-case targets systems that use-cases and more towards a Web science viewpoint, we also operate live and directly over the Web of Data. An exem- see some rather more fundamental – and possibly related – plary use-case would be the integration of smart caching empirical questions that our collection should help answer: for live querying (e.g., [18]) or live browsing (e.g., [1]) over Change frequency : Can we model change frequency of doc- Linked Data, avoiding re-accessing a document or derefer- uments with mathematical models and thus predict encing a URI if it is unlikely to have changed according to future changes? knowledge of dynamicity patterns. Change patterns: Can we mine patterns that help to cat- 1 http://sitemaps.org/ egories change behaviour? Degree of change: If a document changes, how much of ∼ 8 % new content every week, and regarding structural its content is updated? changes, found that the link structure of the Web changes Lifespan: What is the lifespan of Linked Data documents? faster than the textual content by a factor of 3. Various au- thors found that with respect to the degree of change, the Stability : How stable are Linked Data documents in terms majority of changes in HTML documents are minor [21, 14, of HTTP accessibility? 22]. Loosely related to change triggers, Fetterly et al. [14] Growth rate: How fast is the Web of Data evolving? found that certain parties simulate content changes to draw the attention of search engines. Regarding domain depen- Structural changes: Do we observe any changes in the dent changes, various authors also showed that change fre- structure of the network formed by links? quencies vary widely across top-level domains [14, 8, 5, 6]. Change triggers: Can we find graph patterns that trigger Relating to use-cases for studying the dynamics of Web or propagate changes through the network? documents, a variety have been introduced down through Domain-dependent changes: Do we observe a variation the years, including (i) the improvements of Web proxies or clustering in dynamicity across different domains? or caches looked at by, e.g., Douglis et al [13], (ii) efficient handling of continuous queries over documents [24] or, re- Vocabulary-dependent changes: Do we observe different turning to RDF, over SPARQL endpoints [25]. We refer in- change patterns for data using certain vocabularies, terested readers to the excellent survey by Oita and Senel- classes or properties? lart [23], which provides a comprehensive overview of ex- Vocabulary changes: How do the semantics of vocabulary isting methodologies to detect Web page changes, and also terms evolve over time? surveys general studies about Web dynamics. 3. BACKGROUND 4. WHAT IS THE WEB OF DATA? Various papers have addressed similar research questions Our data collection should allow researchers to study var- for the Web. Most works thus far have focused on the dy- ious aspects and characteristics of data dynamics across a namicity of the traditional HTML Web, which (mostly) cen- broad selection of Linked Data domains. However, it is tres around dynamicity on the level of document changes. not clear which data providers should be considered as “in For the purposes of our use-cases, our notion of Linked Data scope”, which are of interest for the Linked Data community dynamics goes deeper and looks to analyse dynamic patterns who we target, and how we should define our target popu- within the structured data itself: i.e., dynamicity should also lation. Linked Data itself is a set of principles and an as- be considered on the level of resources and of statements (as sociated methodology for publishing structured data on the noted previously [29, 26]). Still, studies of the dynamicity Web in accordance with Semantic Web standards and Web of the HTML Web can yield interesting insights for our pur- Architecture tenets [19]. Various data providers are compli- poses. In fact, in previous works, we initially established a ant with Linked Data principles to varying degrees: there’s link between the frequency of change of Linked Data docu- no one yardstick by which a dataset can be unambiguously ments and HTML documents [28]. labelled as “Linked Data”. The study of the evolution of the Web (of Documents) and For the purposes of the seminal Linking Open Data (LOD) its implicit dynamics reaches back to the proposal of the first project, Cyganiak and Jentszch use a variety of minimal generation of autonomous World Wide Web spiders (aka. requirements a dataset should meet to be included in the crawlers) around 1995. Bray [4] published one of the first LOD Cloud diagram [10], which serves as a overview of con- studies about the characteristics of the Web and estimated nections between Linked Data corpora. However, the LOD its size in 1996. Around the same time, Web indexes such as Cloud is biased towards large monolithic datasets published AltaVista or Yahoo! began to offer one of the first concrete on one domain, and does not cover low-volume cross-domain use-cases for understanding the change frequency of Web publishing as common for vocabularies such as FOAF, SIOC, pages: the efficient maintenance of search engine indexes. etc. For example, platforms like identi.ca/status.net, In 1997, Coffman et al. [9] proposed a revisiting strategy Drupal, Wordpress, etc., can export compliant, decentralised for Web crawlers to improve the “freshness” of an index. Linked Data – using vocabularies such as FOAF and SIOC – This work was continuously improved over the subsequent from the various domains where they are deployed, but their years with additional experimental and theoretical results exports are not in the LOD Cloud. Furthermore, it gives no provided by Brewington [5, 6], Lim et al. [21], Cho et al. [8, explicit mention to the vocabularies themselves, which are 7], Fetterly et al. [14] and Ntoulas et al. [22], amongst others. of high relevance to our requirements. Based on large data collections, these papers presented A broader notion to consider is the Web of Data, which theory and/or empirical analyses of the HTML Web that would cover these latter exporters and vocabularies, but relate closely to the dynamicity questions we highlight. For which is somewhat ambiguous and with ill-defined borders. example, various authors discovered that the change be- For our purposes, we define the Web of Data as being com- haviour of Web pages corresponds closely with – and can prised of interlinked RDF data published on the Web.2 No be predicted using – a Poisson distribution [5, 6, 8, 7]; in pre- clear road-map is available for the Web of Data per this defi- vious work [28], we presented initial evidence that changes nition; the LOD cloud only covers prominent subsets thereof. in Linked Data documents also follow a Poisson distribu- Perhaps the clearest picture of the Web of Data comes from tion, though our data was insufficient to be conclusive. Re- crawlers that harvest RDF from the Web. A prominent ex- lating to high-level temporal change patterns, Ntoulas et al. [22] analysed the different frequency of updates for indi- 2 This definition is perhaps more restrictive than some in- vidual weekdays and working hours. The same paper also terpretations where, e.g., Sindice incorporates Microformats empirically estimated the growth rate of the Web to be into their Web of Data index [11]. 1000 104 104 Number of documents Number of documents Number of documents 800 600 hi5.com 2 2 10 10 P 400 next top 10 200 100 100 0 2 4 10 10 100 102 104 900 1000 1100 1200 Number of statements Number of statements Number of statements (a) Overall (b) hi5.com (c) hi5.com dominates Figure 1: Distribution of the number of statements in documents for the BTC2011 dataset (1a) overall and (1b) for hi5.com; as well as (1c) the periodicity of distribution of statements-per-document for hi5.com that causes the split tail in (1a) & (1b). ample is the Billion Triple Challenge Dataset, which is made As such, it is more pertinent to look at what the dataset available every year and comprises of data collected during actually contains. The most recent BTC dataset – BTC a deep crawl of RDF/XML documents on the Web of Data. 2011 – was crawled in May/June 2011. The final dataset However, the precise composition of such datasets is unclear, contains 2.145 billion quadruples, extracted from 7.411 mil- and requires further study. lion RDF/XML documents. The dataset contains RDF doc- As such, the core question of what it is we want to moni- uments sourced from 791 pay-level domains (PLDs): a pay- tor – i.e., what population of domains the Linked Data com- level domain is a direct sub-domain of a top-level domain munity is most interested in, and thus what population we (TLD) or a second-level country domain (ccSLD), e.g., db- should sample from – is non-trivial, and probably has no pedia.org, bbc.co.uk. We prefer the notion of a pay-level definitive answer. To get a better insight, in this section we domain since fully qualified domain names (FQDNs) over- contrast two such perspectives of the Web of Data, the: exaggerate the diversity of the data: for example, sites such as livejournal.com assign different subdomains to individ- Billion Triple Challenge 2011 dataset [27] which is col- ual users (e.g., danbri.livejournal.com), leading to mil- lected from a Web crawl of over seven million RDF/XML lions of FQDNs on one site, all under the control of one documents; and the publisher. The BTC 2011 dataset contained documents from 240,845 FQDNs, 233,553 of which were from the livejour- Comprehensive Knowledge Archive Network (CKAN) 3 nal.com PLD. Henceforth, when we mention domain, we repository – specifically the lodcloud group therein – thus refer to a PLD (unless otherwise stated). containing high-level metrics reported by Linked Data On the left-hand side of Table 1 we enumerate the top-25 publishers, used in the creation of the LOD Cloud. PLDs in terms of quadruples contributed to the BTC 2011 dataset. Notably, a large chunk of the dataset (∼64 %) is We want to see how these two road-maps of the Web of Data provided by the hi5.com domain: a social gaming site that can be used as the basis of defining a population to sample. exports a FOAF file for each user. As observed for similar corpora (cf. [20, Table A.1]) hi5.com has many documents, 4.1 The BTC 2011 dataset each with an average of over two thousand statements – an The BTC dataset is crawled from the Web of Data us- order of magnitude higher than most other domains – lead- ing the MultiCrawler framework [17] for the annual Billion ing to it dominating the overall volume of BTC statements. Triple Challenge [3] at the International Semantic Web Con- The dominance of hi5.com – and to a lesser extent sim- ference (ISWC). The dataset empirically captures a deep, ilar sites like livejournal.com – shape the overall charac- broad sample of the Web of Data in situ. teristics of the BTC 2011 dataset. To illustrate one promi- However, the details of how the Billion Triple Challenge nent such example, Figure 1a gives the distribution of state- dataset is collected are somewhat opaque. The seed list is ments per document in the BTC dataset on log/log scale, sampled from the previous year’s dataset [27], where one of where one can observe a rough power-law(-esque) charac- the initial seed-lists in past years was gathered from vari- teristic. However, there is an evident three-way split in the ous semantic search engines. The crawl is for RDF/XML tail emerging at about 120 statements, and ending in an content, and follows URIs extracted from all triple posi- outlier spike at around 4,000 statements. By isolating the tions. Scheduling (i.e., prioritising URIs to crawl) is ran- distribution of statements-per-document for hi5.com in Fig- dom, where URIs are shuffled at the end of each round. ure 1b, we see that it contributes to the large discrepancies As such, any RDF/XML document reachable through other in that interval. The stripes are caused by periodic patterns RDF/XML documents from the seed list is within scope; in the data, due to its uniform creation: on the hi5.com otherwise, what content is (or is not) in the BTC – and domain, RDF documents with a statement count of 10 + 4f how “representative” the dataset is of the Web of Data – is are heavily favoured, where ten triples form the base of a difficult to ascertain purely from the collection mechanisms. user’s description and four triples are assigned to each of f friends. Other lines are formed due to two optional fields covering 133 domains (∼15.6 %), and the intersection of (foaf:surname/foaf:birthday) in the user profile, giving a both covering 70 domains (∼8.2 % overall; ∼8.8 % of BTC; 9 + 4f and 8 + 4f periodicity line. An enforced ceiling of ∼52.6 % of CKAN/LOD). CKAN/LOD reports a total of f ≤ 1, 000 friends explains the spike at (and around) 4,010. 28.4 billion triples, whereas the BTC (an incomplete crawl) The core message here is that although the BTC offers a accounts for 2.1 billion quadruples (∼7.4 %). However, only broad view of the Web of Data, covering 791 domains, in ab- 384.3 million quadruples in the BTC dataset (∼17.9 %) come solute statement-count terms, the dataset is skewed by a few from PLDs mentioned in the extracted CKAN/LOD meta- high-volume exporters of FOAF, and in particular hi5.com. data. When deriving global statistics and views from the BTC, the In Table 1, we present the BTC and CKAN/LOD state- results say more about the code used to generate hi5.com ment counts side-by-side. We can observe that a large num- profiles than the efforts of thousands of publishers.4 This ber of high-volume BTC domains are not mentioned on is also a naturally-occurring phenomenon in other corpora CKAN/LOD, where the datasets in question may not pub- (e.g., [12, 20]) crawled from the Web of Data – not just iso- lish enough RDF data to be eligible by CKAN/LOD, or may lated to the BTC dataset(s) – and is not easily fixed. One not follow Linked Data principles or have enough external option to derive meaningful statistics about the Web of Data links, or may not have self-reported. from such datasets is to apply (aggregated) statistics over Perhaps more surprisingly however, we note major dis- individual domains, and never over the corpus as a whole. crepancies in terms of the catchment of BTC statements ver- sus CKAN/LOD metadata. Given that BTC can only sam- 4.2 CKAN/LOD cloud metadata ple larger domains, a lower statement count is to be expected In contrast to the crawled view of the Web of Data, the in many cases: however, some of the largest CKAN/LOD do- CKAN repository indexes publisher-reported statistics about mains do not appear at all. Reasons can be found through their dataset. These CKAN metadata are then used to de- analysis of the BTC 2011’s publicly available access log [27]. cide eligibility for entry into the LOD cloud [10]: a highly In Table 2, we present reasons for the top-10 highest-volume prominent depiction of Linked Open Datasets and their in- CKAN/LOD data providers not appearing in the BTC 2011 terlinkage. A CKAN-reported dataset is listed in the LOD dataset (i.e., providers appearing with “—” on the right- cloud iff it fulfils the following requirements: the dataset has hand side of Table 1). Robots indicates that crawling was to (1) be published according to core Linked Data principles, prohibited by robots.txt exclusions; Http-401 and Http- (2) contain at least one thousand statements and (3) provide 502 indicate the result of lookups for URIs on that domain; at least 50 links to other LOD cloud datasets5 . Mime indicates that the content on the domain did not re- Given the shortcomings of the crawled perspective on the turn application/rdf+xml used as a heuristic in the BTC Web of Data, we explore these self-reported metadata to crawl to filter non-RDF/XML content; Unreachable indi- get an alternative view on what we should be sampling. On cates that no lookups were attempted on URIs from that September 29, 2011, we downloaded the meta-information domain; Other refers solely to europeana.eu, which redi- for the datasets listed in the lodcloud group on CKAN6 . rected all requests to their home page. The data contain example URIs for the dataset and statistics such as the number of statements. We discovered meta-data In summary, we see two quite divergent perspectives on for 297 datasets, spanning 206 FQDNs and 133 PLDs. On the Web of Data, given by the BTC 2011 dataset and the the right hand side of Table 1, we enumerate the top-25 CKAN/LOD metadata. Towards getting a better picture largest reported datasets in the lodcloud group on CKAN. of what population we wish to sample for the monitoring Note that where multiple datasets are defined on the same experiments, and towards deciding on a sampling method- domain, the triple count is presented as the summation of ology, the pertinent question is then: Which perspective said datasets. In this Table, we see a variety of domains best suits the needs of our monitoring experiment? claiming to host between 9.8 billion and 94 million triples. As enumerated in Table 3, both perspectives have inher- Regarding the data formats present in the LOD cloud, ent strengths and weaknesses. As such, we believe that our most of the datasets claim to serve RDF/XML data (85 %), sampling method should try to take the best of both per- 4 % claim to serve RDFa (of which 50 % did not also of- spectives, towards a hybrid view of the Web of Data. fer RDF/XML). This shows the popularity of RDF/XML, but only supporting RDF/XML will still miss out on 15 % of datasets. However, the syntax metadata are somewhat unreliable, where improper mime-types are often reported. 5. SAMPLING METHODOLOGY Due to the size of the Web and the need for frequent snap- 4.3 BTC vs. CKAN/LOD shots, sampling is necessary to create an appropriate collec- Finally, we contrast the two different perspectives of the tion of URIs that can be processed and monitored under the Web of Data. Between both, there are 854 PLDs mentioned, given time, hardware and bandwidth constraints. The goal with BTC covering 791 domains (∼92.6 %), CKAN/LOD of our sampling method is thus two-fold: to select a set of URIs that (i) capture a wide cross-section of domains and (ii) 4 Furthermore, hi5.com is not even a prominent domain on can be monitored in a reasonable time given our resources the Web of Data in terms of being linked, and was ranked and in a polite fashion. Given the previous discussion, we 179/778 domains in a PageRank analysis of a similar corpus; wish to combine the BTC/crawled and CKAN/metadata http://aidanhogan.com/ldstudy/table21.html 5 perspectives when defining our seed-list. http://www.w3.org/wiki/TaskForces/ CommunityProjects/LinkingOpenData/DataSets/ Before we continue to describe the sampling methodology CKANmetainformation we choose, it is worthwhile to first remark upon sampling 6 methods used in other dynamicity studies of the Web. http://thedatahub.org/group/lodcloud Top-25 BTC Top-25 CKAN/LOD № PLD BTC LOD PLD LOD BTC 1 hi5.com 1,371,854,358 — rpi.edu 9,803,140,000 900,464 2 livejournal.com 169,863,721 — linkedgeodata.org 3,000,000,000 — 3 tfri.gov.tw 153,300,321 23,015,257 legislation.gov.uk 1,900,000,000 31,990,934 4 scinets.org 56,075,080 — wright.edu 1,730,284,735 5 5 ontologycentral.com 55,124,003 122,000,000 concordia.ca 1,500,000,000 — 6 rdfize.com 36,154,381 — data.gov.uk 1,336,594,576 13,302,277 7 legislation.gov.uk 31,990,934 1,900,000,000 dbpedia.org 1,204,000,000 25,776,027 8 identi.ca 30,429,795 — rdfabout.com 1,017,648,918 — 9 bibsonomy.org 28,670,581 — dbtune.org 888,089,845 1,634,891 10 dbpedia.org 25,776,027 1,204,000,000 uniprot.org 786,342,579 4,004,440 11 freebase.com 25,488,720 337,203,427 unime.it 586,000,000 — 12 opera.com 23,994,423 — uriburner.com 486,089,121 — 13 bio2rdf.org 20,168,230 72,585,132 openlibrary.org 400,000,000 25,396 14 archiplanet.org 13,394,199 — sudoc.fr 350,000,000 — 15 data.gov.uk 13,302,277 1,336,594,576 freebase.com 337,203,427 25,488,720 16 loc.gov 7,176,812 24,151,586 fu-berlin.de 247,527,498 5,658,444 17 vu.nl 6,106,366 14,948,788 dataincubator.org 205,880,247 3,695,950 18 bbc.co.uk 5,984,102 80,023,861 viaf.org 200,000,000 — 19 rambler.ru 5,773,293 — europeana.eu 185,000,000 — 20 fu-berlin.de 5,658,444 247,527,498 moreways.net 160,000,000 — 21 uniprot.org 4,004,440 786,342,579 rkbexplorer.com 134,543,526 220 22 dataincubator.org 3,695,950 205,880,247 ontologycentral.com 122,000,000 55,124,003 23 zitgist.com 3,446,077 60,000,000 opencorporates.com 100,000,000 — 24 daml.org 3,135,225 — uberblic.org 100,000,000 — 25 mybloglog.com 2,952,925 — geonames.org 93,896,732 458,490 Table 1: Statement counts for top-25 PLDs in the BTC with corresponding reported triple count in CKAN (left), and top-25 PLDs in CKAN with BTC quad count (right) Unreachable BTC Pros: X covers more domains (791) Http-401 Http-502 X empirically validated Robots Other X includes vocabularies Mime X includes decentralised datasets PLD Cons: X influence of high-volume domains linkedgeodata.org X X X misses 47.4 % of LOD/CKAN domains concordia.ca X rdfabout.com X LOD/CKAN unime.it X Pros: X domains pass “quality control” uriburner.com X X community validated sudoc.fr X viaf.org X Cons: X covers fewer domains (133) europeana.eu X X self-reported statistics moreways.net X X misses vocabularies uberblic.org X X misses decentralised datasets Table 2: Reasons for largest ten PLDs in CKAN/LOD not Table 3: Advantages and disadvantages for both perspec- appearing in BTC 2011. tives of the Web of Data. Published Sampling Techniques. There are several pub- quently changing pages and prepared their seed list accord- lished works that present sampling techniques in order to ingly: initially starting from a list of URIs provided by a create a corpus of Web documents that can be monitored Google crawl, they performed two crawls a day and selected over time. Having studied a variety of major works, we the most dynamic URIs (in terms of content changes) that could not find common agreement on a suitable sampling could also be successfully accessed 500 times in a row. As technique for such purposes. The targeted use-cases and re- such, the authors focus on monitoring stable and dynamic search questions directly affect the domain and number of Web documents. Fetterly et al. [14] randomly sampled URIs URIs, as well as the monitoring frequency and time frame. from a crawl seeded from the Yahoo! homepage to cover Grimes and O’Brien [16] studied the dynamics of very fre- many different topics and providers, giving broad coverage for a general-purpose study; however, the surveyed docu- 2.2GHz Opteron x86-64, 4GB RAM on a university network. ments would be sensitive to the underlying distribution of Thereafter, we use the following configuration: documents in the original Yahoo! corpus. Cho and Garcia- Molina [8] studied the change frequency of pages using a All RDF syntaxes: unlike BTC crawls, which only con- dataset which was sampled by selecting the root URIs of sider RDF/XML, we wish to consider all standard se- the 270 most popular domains from a 25 million web page rialisation formats for RDF (including Turtle, which is crawl and then crawling three thousand pages for each of soon to be standardised), as supported by any23. Fur- the domains; this method provides a good, broad balance of ther, we do not pre-filter content based on Content- documents across different domains. type reporting or file extensions. RDFa is becoming a preferred format for many publishers: when pars- Our sampling method. We can conclude that existing sam- ing RDFa, we monitor the output statements and ex- pling methods select URIs from crawled documents, either clude the content of HTML documents for which we randomly, because of specific characteristics (e.g., dynamic find only “accidental” triples as extracted from titles, or highly ranked), or to ensure an even spread across dif- stylesheets, icons, etc., and instead only consider doc- ferent hosts. Thus, we decide to use a combination of these uments that intend to publish non-trivial RDFa. three methods to generate our list of URIs. Threads: multithreading keeps the hardware busy while Given the previous discussion of Section 4, we start with slow HTTP requests are being processed; in previous an initial seed list of URIs taken from: (1) the registered ex- work [20], we found 64 threads to offer the best tradeoff ample URIs for the datasets in the LOD cloud and (2) the between parallelism and CPU/disk contention. most popular URIs in the BTC dataset of 2011. The most Timeouts: we terminate unresponsive connections and sock- popular URIs are selected based on a PageRank analysis of ets after 120 seconds. Timeouts are kept deliberately the documents in the BTC 2011 dataset, where we select the high to help ensure stable crawls. top-k ranked documents from this analysis (please see [15] for details); note that many of the top ranked documents Links: we consider all URIs contained in the RDF data as refer to commonly instantiated vocabularies on the Web of potential links to follow (and, e.g., not only the value Data, which are amongst the most linked/central Linked of rdfs:seeAlso or such). Data documents. At the time of access, we found 220 exam- Breadth-first: we crawl documents in a round-based, URI ple URIs in the CKAN/LOD registry, and we complement scheduling approach, which should result in a broader them with the top-220 document URIs from the BTC 2011 set of diverse data (assuming a diverse seed-list) [20]. to generate a list of 440 core URIs for monitoring. The core Redirects: 301, 302, 303 and 307 HTTP response codes are URIs contain 137 PLDs, 120 from the CKAN/LOD exam- not treated as links, but are instead followed directly ples and 37 from the most popular BTC URIs. This selection in the same round until we reach a non-direct response guarantees us to cover all relevant domains (similar to [14]) (or hit a cycle/path-limit). and to also consider the most popular and interlinked URIs on the Web of Data (similar to [8]). Per-domain Queue: our crawler queue is divided into in- Obviously, 440 seed URIs are insufficient to resolve a mean- dividual per-domain queues, which are polled round- ingful corpus for observation over time. Thus, we decide to robin: this helps ensure a maximal delay between ac- use a crawl and expand outwards from these core URIs to cessing the same domain again. find other documents to monitor in their vicinity. Impor- Priority Queue: within each individual per-domain queue, tantly, we wish to stay in the close locale of the 440 core we prioritise URIs for which we have found the most URIs; if we go further, we will encounter the same prob- links thus far. (This only affects the crawl if an incom- lems as observed for the BTC 2011 dataset, where the data plete round is performed.) are skewed by a few high-volume exporters. To avoid being Politeness Policy: we implement a politeness delay of two diluted by, e.g., hi5.com data and the likes, we thus stay seconds, meaning that we do not access the same PLD within a 2-hop crawl radius from the core URIs. From the twice within that interval; further, for each domain, we data thus extracted, we then sample a final set of extended retrieve and respect standard robots.txt exclusions. seed URIs to monitor. The result is then our best-effort compromise to achieve representative snapshots of Linked The minimum amount of time taken to complete a round Data that (i) take into account both views on Linked Data becomes the maximum number of active URIs for a single by including CKAN and BTC URIs in the core seed list, domain, multiplied by the politeness delay. The combina- (ii) extend beyond the core seed list in a defined manner (2 tion of this per-PLD delay and the distribution of documents hops), and (iii) do not exceed our crawling resources. per domain on the Web [20] introduces the problem of PLD starvation [20]: the nature of the Web of Data means that Crawling setup. The crawling setup will have a signifi- after numerous low-volume domains have been crawled, the cant effect on the selection of URIs to monitor, and so few remaining domains are not enough to keep the crawler we provide some detail thereupon. Our implementation is resources busy between the politeness delay. In the worst based on two open source Java projects: (1) LDSpider7 , a case, when one PLD is left in the queue, only one URI can multi-threaded crawling framework for Linked Data, and (2) be crawled every two seconds. However, the frontier – the any238 , a parsing framework for various RDF syntaxes. The list of URIs for the next round – may contain a diverse set experiments are intended to run on a dedicated, single-core of domains that can overcome this problem, and allow for higher crawling throughput. Hence, we also add a termina- 7 http://ldspider.googlecode.com/ tion condition for each round: once (1) the seed URIs have 8 http://any23.org/ been processed, (2) all active redirects have been processed 106 statements (blue, right bars) 103 documents (red, left bars) 90 18 102 75 15 Number of PLDs 60 12 101 45 9 30 6 15 3 100 0 0 1 2 3 4 5 6 7 8 9 10 100 101 102 103 104 Crawl № Number of documents Figure 2: Number of statements and documents per crawl Figure 4: Distribution of the number of documents per experiment. PLD. № PLD URIs 600 New PLDs 1 gesis.org 7,850 Number of PLDs 500 2 chem2bio2rdf.org 5,180 3 dbpedia.org 3,643 400 4 freebase.com 3,026 300 5 fer.hr 2,902 6 loc.gov 2,784 200 7 concordia.ca 2,784 100 8 dbtune.org 2,767 9 fu-berlin.de 2,689 0 10 semantictweet.com 2,681 1 2 3 4 5 6 7 8 9 10 Rounds of Crawl № Table 4: Top 10 PLDs based on the number of URIs. Figure 3: Number of PLDs per round per crawl experiment. In Figure 3, we show for each crawl the number of visited PLDs per round together with the number of new PLDs and (3) the number of active PLDs remaining in the per- per round with respect to the previous round. The left bar domain queue drops under the number of threads (in our for each crawl represents Round 0, the middle bar Round 1, setup 64), we end the current round and move to the next and the right bar Round 2. We can observe that the relative round (shifting remaining URIs to the frontier). level of domains across the crawls is much more stable when compared with the number of documents (cf. Figure 2). Seed list. Starting from our list of 440 core URIs, we wish Across rounds, the graph shows an average ∼1.3× increase of to expand a 2-hop crawl using the outlined framework, from active PLDs between Rounds 0–1, and an increase of ∼3.4× which we will extract the final seed list of URIs to mon- between Rounds 1–2. Further, we observe that ∼ 30 % of the itor in our observation framework. However, due to the PLDs in Round 1 are new compared to the previous round unpredictability/non-determinism of remote connections, we and roughly 70 % of the PLDs in Round 2 are not visited by want to ensure a maximal coverage of the documents in this the crawler in the rounds before. neighbourhood. Along these lines, we repeated ten complete Given the non-deterministic coverage of documents, to en- 2-hop crawls from our core URI list. sure comprehensive coverage of URIs in the 2-hop neigh- With respect to the non-determinism, Figure 2 shows for bourhood, we take the union of URIs that dereferenced to each round the number of documents (left bars on y-axis) RDF content, resulting in a total set of 95,737 URIs span- and the number of statements (right bars on y-axis). We ning 652 domains, giving an average of 146.8 dereferenceable can observe that two crawls (crawl number 1 and 10) have URIs per domain. Figure 4 shows in log/log scale the distri- a significantly higher number of statements compared to bution of the number of PLDs (y-axis) against the number the other crawls. One reason for this large discrepancy re- of URIs in the union list (x-axis); we see that 379 PLDs lates to the identi.ca domain, where a URI (referring to (∼58.1 %) have one URI in the list, 78 PLDs (∼12.0 %) have Tim Berners-Lee’s account; a highly ranked document in the two URIs, and so forth. In addition, Table 4 lists the num- BTC dataset) in the seed round of crawls 1 and 10 offered ber of URIs for the top-10 PLDs in the set (represented by a rich set of links within that domain, whereas the lookup the ten rightmost dots in Figure 4). failed in the other crawls, cutting off the crawler’s path in that domain: for example, in the first crawl, identi.ca ac- counted for 1.5 million statements, whereas in crawl 2, the 6. MONITORING CONFIGURATION domain accounted for 17 thousand statements. Such exam- The next step in the setup of our observatory is to se- ples illustrate the inherent non-determinism of crawling. lect the monitoring techniques and intervals we apply. Note that we have yet to start the monitoring experiments, where successive document content change we now instead present some initial results and outline our proposed methodology for feedback from the community. In general, there exists two fundamental monitoring tech- 1x 2x 4x 8x 16 x week week week week week niques. The first technique is to periodically download the content of a fixed list of URIs, as widely used in the litera- crawl frequency ture [5, 6, 7, 8]; this technique allows to study the evolution of sources over time in a contiguous fashion. The second Figure 5: Adaptive scheduling based on change events. technique is to periodically perform a crawl from a defined set of URIs [22]; this technique is more suitable if one wants to study the evolution of the neighbourhood network of the 75,000 seed URIs in an adaptive fashion, but also can introduce a Number of documents factor of randomness based on the crawling methods. 60,000 We decided to again apply a hybrid approach: primar- ily, we do not want to limit our observations to URIs online 45,000 at the start of the experiment, although they will still be our focus. We thus take the set of 95,737 sampling URIs 30,000 extracted in the previous section as a kernel of contigu- ous URIs accessed consistently in each snapshot. From the 15,000 kernel, we propose to crawl as many URIs again using the 0 crawler configuration outlined in the previous section, form- ing the adaptive segment of the snapshot. Roughly half of 1 2 3 4 5 6 7 our snapshot would comprise of the contiguous kernel, reli- Hours since the start of the crawl ably providing data about said URIs; and the other half of our snapshot would comprise of the adaptive crawl, reflect- Figure 6: Number of downloaded documents over time. ing changes in the neighbourhood of the kernel. We do not limit PLDs in the adaptive crawl so as to not exclude data providers that come online during the course of the exper- The average download rate increases from 20k to 70K docu- iment. This setup allows for studying (i) dynamics within ments per hour after 60 minutes, but tails off severely after the datasets (ii) dynamics between datasets (esp. links) (iii) hour 3 due to PLD starvation. We see that processing the and the growth of Linked Data and the arrival of new sources kernel is thus feasible within a 10 hour interval. (although to a lesser extent). Next, we must decide on the monitoring intervals for our platform: how frequently we wish to perform our crawl. In 7. CONCLUSION & OUTLOOK the literature, it is common to take the data snapshots in In this paper, we have presented ongoing work towards either a daily [22, 14] or weekly [5, 6, 8, 7] fashion. Again, building DyLDO: the Dynamic Linked Data Observatory. in a practical sense, the intervals are highly dependent on This observatory aims to continuously gather a high-quality the available resources, and the size of the seed list. Given collection of Linked Data snapshots for the community to our resources and the monitoring requirements, we decided gain a better insight into the underlying principles of dy- to perform a full snapshot every week. namicity on the Web of Data. We motivated our proposals In addition, to get more granular data in a temporal sense, based on several concrete use-cases and research questions we propose to apply an adaptive scheduling policy that takes that could be tackled using such a collection, and presented more frequent snapshots for more dynamic data. As such, related works that treat various aspects of dynamicity for we propose to set up different monitoring intervals within HTML documents on the traditional Web. Next we looked the full weekly snapshots, where we increase the monitor- at the non-trivial question of what view we should adopt for ing interval for a kernel document if it changes within two the Web of Data, introducing and comparing the BTC and consequential snapshots of the previous interval. Figure 5 CKAN/LOD perspectives, showing how and where they di- depicts the core idea. Using this adaptive approach, we can verge, and weighing up their respective pros and cons. We avoid the local and remote computational overhead involved proposed selecting a kernel of sources to monitor around a in regularly polling documents that are observed to be very core set of URIs taken from BTC and CKAN/LOD datasets, static. At the moment, we consider fixing the maximum which are then extended by a 2-hop crawl. We also presented number of intervals per week to 16, which resolves to a time the detailed crawl configuration we planned to use for our interval width of ∼10 hours. However, the intervals will take monitoring experiments. Finally, we proposed our method- a minimum of a week to “warm-up”, and will probably take ology for performing the continuous observation of sources longer to stabilise; thus, we can manually add further in- in and around the kernel, as well as using adaptive intervals tervals on-the-fly at a later stage once the experiments are to monitor more dynamic sources more frequently. underway (if deemed useful from the results). We plan to begin the monitoring experiments in the next few weeks, and to run them continuously and indefinitely. Initial experiment. We conducted an initial experiment, We still have some practical issues to tackle in terms of cre- performing a crawl for the 95,737 URI kernel. Our frame- ating a backup and archiving system, a site offering access work downloaded 16 million statements from 80,000 docu- to the community9 , as well as remote fallback procedures ments, taking a total of 6 hours and 40 minutes. Figure 6 in the event of network or hardware failure. Thus, we are shows the number of documents processed per crawl hour. 9 Planned for http://swse.deri.org/DyLDO/ at a crucial stage, and are keen to gather final feedback [16] C. Grimes and S. O’Brien. “Microscale evolution of and requirements from the community: we are particularly web pages”. In: WWW. 2008, pp. 1149–1150. anxious to hear from people who would have a specific in- [17] A. Harth, J. Umbrich, and S. Decker. “MultiCrawler: terest or use-case for such data – be it in terms of research A Pipelined Architecture for Crawling and Indexing analysis, evaluation frameworks, etc. – what requirements Semantic Web Data”. In: International Semantic Web they have, and whether or not current proposals would be Conference. 2006, pp. 258–271. sufficient. Significant changes will invalidate the snapshots collected up to that point, so we want to gather comments [18] O. Hartig, C. Bizer, and J. C. Freytag. “Executing and finalise and activate the framework as soon as possible. SPARQL Queries over the Web of Linked Data”. In: In the near future, the community can then begin to chart ISWC. 2009, pp. 293–309. a new dimension for the Web of Data: time. [19] T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Vol. 1. Morgan & Claypool, 8. References 2011, pp. 1–136. [20] A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres, [1] T. Berners-Lee, Y. Chen, L. Chilton, D. Connolly, and S. Decker. “Searching and browsing Linked Data R. Dhanaraj, J. Hollenbach, A. Lerer, and D. Sheets. with SWSE: The Semantic Web Search Engine”. In: “Tabulator: Exploring and Analyzing linked data on J. Web Sem. 9.4 (2011), pp. 365–401. the Semantic Web”. In: SWUI. 2006. [21] L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and [2] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Agarwal. “Characterizing web document change”. R. Cyganiak, and S. Hellmann. “DBpedia - A crystal- In: Advances in Web-Age Information Management lization point for the Web of Data”. In: J. Web Sem. (2001), pp. 133–144. 7.3 (2009), pp. 154–165. [22] A. Ntoulas, J. Cho, and C. Olston. “What’s new on the [3] C. Bizer and D. Maynard. “The Semantic Web Chal- web? The evolution of the web from a search engine lenge, 2010”. In: J. Web Sem. 9.3 (2011), p. 315. perspective”. In: WWW. 2004, pp. 1–12. [4] T. Bray. “Measuring the Web”. In: Comput. Netw. [23] M. Oita and P. Senellart. Deriving Dynamics of Web ISDN Syst. 28 (7-11 1996), pp. 993–1005. Pages: A Survey. INRIA TR.: inria-00588715. 2011. [5] B. Brewington and G. Cybenko. “How dynamic is the [24] S. Pandey, K. Ramamritham, and S. Chakrabarti. “Mo- Web?” In: Computer Networks (2000). nitoring the dynamic web to respond to continuous [6] B. Brewington and G. Cybenko. “Keeping up with the queries”. In: WWW. 2003, pp. 659–668. changing web”. In: Computer 33.5 (2000), pp. 52–58. [25] A. Passant and P. Mendes. “sparqlPuSH: Proactive no- [7] J. Cho and H. Garcia-Molina. “Effective page refresh tification of data updates in RDF stores using PubSub- policies for Web crawlers”. In: ACM Transactions on Hubbub”. In: Scripting and Development Workshop at Database Systems 28.4 (Dec. 2003), pp. 390–426. ESWC. 2010. [8] J. Cho and H. Garcia-Molina. “Estimating frequency [26] N. Popitsch and B. Haslhofer. “DSNotify - A solution of change”. In: ACM Transactions on Internet Tech- for event detection and link maintenance in dynamic nology 3.3 (Aug. 2003), pp. 256–290. datasets”. In: J. Web Sem. 9.3 (2011), pp. 266–283. [9] E. G. Coffman Jr., Z. Liu, and R. R. Weber. “Optimal [27] The Billion Triple Challenge. url: http://challenge. robot scheduling for web search engines”. In: Journal semanticweb.org/ (visited on 02/06/2012). of scheduling 1 (1997), pp. 0–21. [28] J. Umbrich, M. Hausenblas, A. Hogan, A. Polleres, [10] R. Cyganiak and A. Jentzsch. The Linking Open Data and S. Decker. “Towards Dataset Dynamics: Change cloud diagram. url: http://richard.cyganiak.de/ Frequency of Linked Open Data Sources”. In: Proc. of 2007/10/lod/ (visited on 02/06/2012). LDOW at WWW. 2010. [11] R. Delbru, N. Toupikov, M. Catasta, and G. Tum- [29] J. Umbrich, M. Karnstedt, and S. Land. “Towards Un- marello. “A Node Indexing Scheme for Web Entity derstanding the Changing Web: Mining the Dynam- Retrieval”. In: ESWC (2). 2010, pp. 240–256. ics of Linked-Data Sources and Entities”. In: Proc. of [12] L. Ding and T. Finin. “Characterizing the Semantic KDML at LWA. 2010. Web on the Web”. In: ISWC. 2006, pp. 242–257. [30] J. Umbrich, M. Karnstedt, J. X. Parreira, A. Polleres, [13] F. Douglis, A. Feldmann, B. Krishnamurthy, and J. and M. Hauswirth. “Linked Data and Live Querying Mogul. “Rate of Change and other Metrics: a Live for Enabling Support Platforms for Web Dataspaces”. Study of the World Wide Web”. In: USENIX Sym- In: DESWEB at ICDE. 2012. posium on Internetworking Technologies and Systems [31] J. Umbrich, B. Villazon-Terrazas, and M. Hausenblas. December (1997). “Dataset Dynamics Compendium: Where we are so [14] D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. far!” In: Proc. of COLD at ISWC. Shanghai, 2010. “A large-scale study of the evolution of web pages”. In: WWW. 2003, pp. 669–678. [15] B. Glimm, A. Hogan, M. Krötzsch, and A. Polleres. OWL: Yet to arrive on the Web of Data? CoRR. url: http://arxiv.org/pdf/1202.0984.pdf (visited on 02/06/2012).