A Query-Driven Characterization of Linked Data Harry Halpin Institute for Communicating and Collaborative Systems University of Edinburgh 2 Buccleuch Place Edinburgh, United Kingdom H.Halpin@ed.ac.uk ABSTRACT can satisfy these information needs. We present an analysis Due to the Linked Data initiative, the once unpopulated Se- of a search-engine query log from a major hypertext search mantic Web is now rapidly being populated with millions engine, Microsoft’s Live.com, and use this query log to sam- of facts stored in RDF. Could any of this data possibly be ple Linked Data. As an added benefit, such an empirical interesting to ordinary users? In this study, we run queries analysis can prove or disprove some widely held assump- extracted from a query log from a major hypertext search tions, such as whether or not there is an endemic over-use of engine against a Semantic Web search engine to determine owl:sameAs and whether or the Linked Data best practice if the Semantic Web has anything of interest to the aver- recommendation of 303 redirection is being followed. age Web user. There is indeed much Semantic Web infor- mation that could be relevant for many queries for enti- 2. PREVIOUS WORK ties (like people and places) and abstract concepts, although For the first-generation of the Semantic Web, there was these possibly relevant results are overwhelmingly clustered very little data-driven analysis of the ontologies, primarily around DBPedia. We present an empirical analysis of the because so few were actually in existence. The first large- results, focusing on their major sources, the structure of the scale analysis of the Semantic Web was done via an inspec- triples, the use of various RDF and OWL constructs, and tion of the index of Swoogle by Ding and Finin [16]. Ding the power-law distributions produced by both the URIs that and Finin first estimated the size of the Semantic Web to be serve Linked Data and the URIs in the triples themselves. in 2006 4.91 million Semantic Web documents via search- The issue of 303 redirection and URI identity is given in- ing Google for the media type application/rdf+xml [16]. depth treatment. As this might not include data that is hosted using the wrong media type, they estimated, using Google to include Categories and Subject Descriptors all FOAF files served as HTML and RSS 1.0 files, the size H.3.d [Information Technology and Systems]: Meta- of the Semantic Web would optimistically be increased by data two orders of magnitudes. Although the study of Ding and Finin was of great importance as it was the first empirical study of the Semantic Web, this work has a number of lim- General Terms itations [16]. It’s primary limitation was it was unknown Experimentation if any of the Semantic Web documents indexed contained information that anyone would want to actually re-use. In- Keywords tuitively, most of the data on this first-generation Semantic Web was likely to be of limited value. For example, the vast Linked Data statistics, query logs, information retrieval,power majority of data on the Semantic Web in 2006 was caused law by Livejournal exporting every user’s profile as FOAF – usu- ally without the user’s knowledge – without linking to other 1. INTRODUCTION URIs, serving with the correct MIME type, and deploying What are the characteristics of the Linked Data in the 303 re-direction. The second main source of data in Ding wild? There are two primary questions we are hoping to and Finin’s study, RSS 1.0, is also of limited value. RSS, answer. First, has Linked Data changed from earlier ‘first originally an XML-based protocol generally used for news- generation’ Semantic Web efforts? Second, is there any- feeds, was given a RDF-compatible syntax, creating RSS 1.0 thing worth finding for ordinary users in Linked Data? Only [6]. The very application of RDF in RSS 1.0 is questionable, a moderately large-scape sampling and analysis of Linked as the data is primarily information about site updates, and Data can answer this central question. Our method of in- so RSS 1.0 data is rarely merged, re-used, or even linked to vestigation is to inspect what information needs actual users in a manner that takes advantage of RDF. Due to the id- are expressing via using a hypertext search engine, and then iosyncratic nature of the data sources of the first generation use a sample of these queries to determine if Linked Data Semantic Web, it is not surprising that the majority of the Copyright is held by the International World Wide Web Conference Com- data likely contained little information that could satisfy the mittee (IW3C2). Distribution of these papers is limited to classroom use, information need of the average user of the Web. and personal use by others. Due to the Linked Data initiative, the size of the Seman- LDOW 2009, April 20, 2009, Madrid, Spain. tic Web has recently increased in size by several orders of ACM 978-1-60558-487-4/09/04. magnitudes due to the conversion of a large number of high- noticeable ‘fits and starts’ as large data-sets are released, so quality databases into RDF [12]. Since the study by Ding each data-set can vastly alter any empirical analysis. The and Finin missed the rise of Linked Data, the time is ripe question is not how to avoid bias in sampling, but to choose for more empirical studies of the Semantic Web. It is un- the kind of bias one wants. We are aiming for a bias towards clear how the dynamics of the Semantic Web are changing. the ordinary user of the Web. While the number of URIs indexed by Linked Data search What information is available on the Semantic Web that engines like Sindice shows that the general trend of the num- ordinary users are actually interested in, and how do we ber of URIs on the Semantic Web visually follows a ‘power- sample this data? The obvious candidate for exploring this law,’ the correct mathematical analysis has not been done to would be look at a major search engine query log, as it gives show this to be the case [26]. The only large-scale study of a sample of the interests of many users in aggregate. Since Linked Data at this time has been by Hausenblas et al., and Semantic Web search engines are currently used mostly by it estimated the size of the Linked Data at approximately Semantic Web developers and not by ordinary users, the 2 billion triples [19]. The focus of that study was only on query log of a popular hypertext search engine should be interlinking between data-sets, and it estimated that there sampled as opposed to a more specialized search engine. were approximately 3 million interlinks between the various The entire bet of the Semantic Web is that it will contain data-sets. The most popular interlinking property by far information that many ordinary users will want to re-use was dbpedia:hasPhotoCollection, with approximately 2 mil- and merge via Semantic-Web enabled applications, and that lion occurrences, most likely to be due to the term being this information will primarily be about non-information re- used by a Linked Data exporter around the popular photo- sources such as entities like people and places and abstract hosting service Flickr [2]. In summary, the Linked Data concepts. Thus, the ideal sampling of the Semantic Web phenomenon is huge, much larger than the first-generation would be to extract query terms referring to physical entities Semantic Web, and its properties have not been fully stud- and abstract concepts from a hypertext search engine query ied. In particular, there has been little work on determining log, and then by virtue of a Semantic Web search engine we how the issues of the reference of URIs play out in the wild can determine precisely how much information Linked Data given by Linked Data. contains on these subjects. 3.1 The Live.com Query Log 3. SAMPLING LINKED DATA VIA QUERY There has been a much work in query log analysis in or- LOGS der to discover how to best satisfy the information needs of The main problem facing any empirical analysis of the Se- users on the Web. Since most search query logs of any size mantic Web is one of sampling. As almost any database can belong to search engines companies, it is often difficult for easily be exported to RDF, any sample of the Semantic Web researchers outside those companies to analyze these query can be biased by the automated release of large, if ultimately logs, and therefore most research in search query logs deal useless, data-sets. This was demonstrated in an exemplary with small or special-purpose query logs, such as the Web fashion by the release of RSS 1.0 data. RDF vocabulary track in the TREC competition [20]. A few employees of terms that have little content, such as rss:item, quickly bias large search corporations have released detailed studies of the statistical analysis. With the advent of Linked Data, this their search engine query logs. In particular Silverstein et has to some extent already happened with large numbers of al.’s analysis of a billion queries in the Altavista query log is databases being released as Linked Data ranging from the considered to be a large ‘gold-standard’ study of query logs BBC’s John Peel recordings to the MusicBrainz audio CD [29]. In order to extract concepts and entities, we analyze collection [19]. How much of Linked Data is aimed for gen- the query log of approximately 15 million distinct queries eral use? Obviously, components like DBPedia, the export from Microsoft Live Search, and all reference to the ‘query of Wikipedia to Linked Data, could be very useful [2]. The log’ are to this Microsoft query log, which is provided by vast majority of data released into the Semantic Web is of Microsoft due to a 2007 ‘Beyond Search’ award. This query appeal only to a niche audience, such as the large appeal of log contains 14,921,285 queries. Of these queries, 7,095,302 Bio2RDF to health care and life-sciences. Just as RSS 1.0 (48%) were unique. Corrected for capitalization, 4,465,912 and the Livejournal export of FOAF biased sampling of the (30%) were unique. Of all queries, only 228,593 (2%) queries first-generation Semantic Web, the release of a large Linked used some form of advanced keywords, while 709,102 (5%) Data set such as the Bio2RDF, containing approximately used boolean operators and 266,308 (2%) used quotation, 65 million triples and so rivaling the size of DBPedia, can leading to a total of 1,204,003 (17%) queries using some ad- bias any sampling of Linked Data [7]. For example, if one vanced techniques provided by the search engines. The av- just counted the number of URIs used on the Semantic Web, erage number of terms per query was 1.76. Note that these one would quickly find that bio2rdf:xProteinLinks would extremely brief queries are normal for hypertext Web search prove to be, in sheer number, a very popular term despite engines, with an average query length of 2.35 being reported its relative lack of use outside the biomedical community. It by Silverstein et al. for the Altavista query log [29]. Since is a small step then to imagine ‘semantic spamming’ that re- we did not want to deal with queries that were only typed leases large amounts of bogus URIs into the Semantic Web. once or a few times, as these may not be representative of Furthermore, due to open nature of the Web, it is difficult, most user’s interests, we did not select for further use any if not impossible, to determine how many actual separate queries with a frequency less than 10, resulting in onlyfrom providers of Semantic Web data there are, so a priori choos- the total query log of 7,095,302, a reduction of 37%. ing seed samples or to ‘weight’ any sample is difficult. Unlike the original Web, which grew at least in an organic fashion 3.2 Extracting Queries for Entities and Con- for its first few years, the Web of Linked Data grows in very cepts Automatically classifying informational queries is difficult. 7311 david blaine Rule-based approaches that claim to work over entire query 4039 kelly blue book logs like those of Jansen et al. [21] are dubious at best, 3053 chase since they work by applying very loose specifications such 2997 jessica alba as “query length greater than 2” and “any query using natu- 2100 nick ral language terms.” More promising work has applied both 1415 office max supervised and unsupervised machine-learning to discover 1280 michael hayden informational queries, but only achieved an accuracy of 50% 1139 harley davidson [3]. A number of machine-learning algorithms could be em- 1098 marcus vick ployed to learn named entities, but the sparse amount of lin- 1092 keith urban guistic context in query logs makes identifying a named enti- ties difficult in a unsupervised manner, and there is virtually Table 1: Top 10 Entity Queries in Query Log no labeled data for supervised learning [33]. Even most rule- based approaches for named entity recognition rely heavily upon capitalization and punctuation, such as ‘I.B.M.’ and ‘Gustave Eiffel,’ features that are lacking from query logs length one where the query had a hyponym and hypernym, [23]. due to the difficulty of WordNet dealing with some multi- We call queries that are automatically identified to be about word queries. This assured that the query was for a class physical entities in the query log entity queries. For the that was suitably abstract (having a hyponym) but not so discovery of entity queries, people and places are obvious abstract as to be virtually meaningless (had a hypernym). places to begin. An updated version of the system that This resulted in a more restricted 16,698 concept queries was the highest performer at MUC-7 [23], a straightforward (.4% of total query log). The top 10 concepts queries are gazetteer-based and rule-based named entity recognizer, was given in Table 2. Again, a number of clearly transactional employed to discover the names of people and places. The queries have managed to find themselves into the concept gazetteer for names was based on a list of names maintained queries, such as ‘chase’ and ‘drudge,’ as well as a number by the Social Security Administration and the gazetteer for of queries where the sense of a word has been taken over place names was based on the gazetteer provided by the by a proper name, such as ‘sprint’ and ‘aim.’ Again, this Alexandria Digital Library Project. Although it could be is due to the preponderance of navigational names towards possible to separate out people and places, this was not the top of the query distribution. Of a random sample of done. First, both of these are types of entities. Second, 100 concept queries, a judge considered 98% to be correct. the names of many location such as ‘Paris’ or places like The top ten concept queries are presented in Table 2. While ‘Georgia’ can also be used as a name. This gazetteer-based some of the queries could be considered somewhat naviga- approach was chosen to provide high precision, even at the tional (such as those for maps and dictionaries), they could cost of a dramatically reduced recall. This is an acceptable all be considered informational queries about some abstract trade-off as we are attempting only to sample the number of concept. queries that would likely to be have URIs on the Semantic Web. A high-quality sample of the query log is more impor- 11383 weather tant than a large one for this purpose. Of a random sample 10321 dictionary of 100 entity queries, a judge considered 94% to be correctly 3675 people categorized as entities such as people or places. 3217 music From the pruned unique queries in the query log, totaling 2192 autism 4,465,912 queries, a total of 509,659 queries (11%) were iden- 1468 map tified as either people or places by the named-entity recog- 1198 travel nizer. The top 10 entity queries are given in Table 1. Some 1191 pregnancy transactional and navigational queries, despite their rela- 1104 news tively lower frequency overall in the query log, are highly 1052 charter clustered towards the top of the query distribution. These navigational queries such as ‘chase’ and ‘office max’ have Table 2: Top 10 Concept Queries in Query Log clearly snuck into the top ten due to their use of common names in their website names. A legitimate number of real names, such as ‘jessica alba’ and ‘marcus vick’ were discov- ered. A method for discovering abstract concepts in the query 3.3 Power-Law Detection log is more challenging. These queries are called concept The frequency of queries, when rank-ordered, follows what queries, queries that are automatically identified to be about is known as a ‘power-law’ distribution, with a relatively abstract concepts in query log. Previous attempts at dis- small number of very popular queries and a long-tail of covering abstract concepts have employed machine-learning queries only occurring once or twice, where most of the mass over truly massive query logs and document collections from of the distribution is in the long tail and the ‘top’ of the dis- Google [27]. Since this massive amount of data was not tribution exponentially decreases. Since this distribution is available, we employed WordNet instead. WordNet consists common in search on the Web, we will define it precisely: A of approximately 207,000 words with unique synsets. Our power-law is a relationship between two scalar quantities algorithm for discovering abstract concepts in query logs us- x and y of the form: ing WordNet was straightforward: we only chose queries of y = cxα + b (1) where α and c are constants characterizing the given power- the conservative p < .1. The Kolmogorov-Smirnov test is law, and b being some constant or variable dependent on x valid even for power-law distributions since Q’s cumulative that becomes constant asymptotically. Typically it is ap- density function is asymptotically normally distributed and plied to rank-ordered frequency diagrams, where the fre- this can be compared to the cumulative density function of quency of some measurement is given on the horizontal axis P. while the rank order of the measurements in terms of their The query frequencies for entity and concept queries are frequency is given on the vertical axis. The α exponent is plotted in logarithmic space in Figure 1. Both entity and the scaling exponent that determines the slope of the top concept queries appear to be linear in log-space, and so can of the distribution and provides the remarkable property of be considered candidates for power-laws. Using the method scale-invariance, such that if a true power-law is observed, described above, the α of the queries for entities was cal- as more samples are added to the distribution, the α re- culated to be 2.31, with long tail behavior starting around mains constant, i.e. the distribution is ‘scale-free’ [32]. It a frequency of 17 and a Kolmogorov-Smirnov D-statistic is crucial to note that a power-law distribution violates as- of .0241, indicating a significant good fit. The α of the sumptions of the normal Gaussian distribution, such that queries for concept queries was calculated to be 2.12, with routine statistics such as averages and standard deviations long tail behavior starting around a frequency of 36 with a can be and usually are misleading. In fact, one of the surest Kolmogorov-Smirnov D-statistic of .0170, also indicating a sign of a non-normal distribution like a power-law distribu- significant good fit for the power law. Given their two re- tion is a very large standard deviation. Is such a distribution markably similar α statistics and high goodness of fits, one evident from Linked Data? One important question is how can safely conclude that these query logs do indeed follow to detect power-law distributions in actual data. Equation power-law distributions. This indicates our sample of enti- 1 can also be written as: ties and concepts are representative of the larger query log, which are well-known to follow power-law distributions [4]. log y = α log x + log c (2) 5 10 When written in this form, a fundamental property of power-laws becomes apparent: When plotted in log-log space, power-laws are ‘straight’ lines. Thus,the most widely used 4 10 method to check whether a distribution follows a power-law is to apply a logarithmic transformation, and then perform linear regression, estimating the slope of the function in log- Popularity Query 3 arithmic space to be α, as done by Ding and Finin [16]. 10 However, standard least-square regression has been shown to produce systematic bias, in particular due to fluctuations of the long tail [14]. To determine a power-law accurately 10 2 requires minimizing the bias in the value of the scaling ex- ponent and the beginning of the long tail via maximum like- lihood estimation. See Newman [25] and Clauset et al. [14] 10 1 0 1 2 3 4 10 10 10 10 10 for the technical details. Popularity−ordered queries Determining whether a particular distribution is a ‘good fit’ for a power-law is difficult, as most ‘goodness-of-fit’ tests employ normal Gaussian assumptions violated by poten- Figure 1: The rank-ordered frequency distribution of tial power-law distributions. Luckily, the non-parametric extracted entity and concept queries, with the entity Kolmogorov-Smirnov test can be employed for any distribu- queries given by green and the concept queries by blue. tion and so is thus ideal for use measuring ‘goodness-of-fit’ of a given finite distribution to a power-law function. While the details are given at length in Clauset et al. [14], intu- 3.4 Querying Linked Data with FALCON-S itively the Kolmogorov-Smirnov test can be thought of as Both the concept queries and the entity queries are used follows: Given a reference distribution P , such as an ideal to query the Semantic Web. Since our goal was to discover power-law distribution generating function, and a sample how much of interest for ordinary users was present on the distribution Q of size n suspected of being a power-law, Semantic Web, one problem with using the entire query log where one is testing the null hypothesis that Q is drawn was that it would contain a vast amount of unique queries from P , then the Kolmogorov-Smirnov test compares the that would likely to be never be repeated. So, we excluded cumulative frequency of both P and Q to discover the great- a portion of the long tail from the study by removing all est discrepancy (the D-statistic) between the two distribu- queries of less than a frequency of 10. The parameter 10 was tions. This D-statistic is then tested against the critical chosen as it was the number that could reduce both entity value p of the D-statistic at n, which varies per function. and concept queries to the same order of magnitude. Due to The null hypothesis is rejected if the D statistic is less than the power-law behavior of both entity and concept queries, the critical p-value for n, p being the probability that the this truncation consists of ‘removing’ a large amount of the distribution was drawn from a power-law generating func- long tail, while maintaining the entire ‘top’ of the power- tion given the estimated parameters. In order to determine law distribution, as well as some significant component of how well the power-law method fits, whenever a power-law the long tail. This procedure is justified insofar as the ‘long- is reported, the D-statistic is also reported, and we will de- tail’ likely consists of queries that are never or very rarely termine whether or not the fit was significant according to repeated, while the remaining queries represents queries that are likely to be repeated. This pruning of low-frequency 10 7 queries from our sampling does exclude many ‘difficult’ or ‘specialist’ queries, but we are aiming for queries that are 10 6 general-purpose and popular. We call these queries with Frequency of Semantic Web URIs returned more than 10 URIs returned from the Semantic Web the 10 5 crawled queries to distinguish them from the greater query log. Likewise, crawled entity queries are entity queries 10 4 with more than 10 URIs returned from the Semantic Web, and similarly for crawled concept queries. 10 3 This truncation reduced the amount of queries signifi- cantly, from 587,283 to 7,848 queries, removing 99% of the 10 2 queries. It reduced the number of entity queries from 570,585 to 5,308 (a 91% reduction) and from the amount of concept 10 1 queries from 16,698 to 2,540 (an 85% reduction). This gap in the result of pruning off the ‘long tail’ is interesting, as it 0 10 0 1 2 3 4 10 10 10 10 10 shows that while there is a lower amount of concept queries Frequency−ordered Returned Semantic Web URIs than entity queries overall, concept queries are repeated by a order of magnitude or so more often than entity queries. The Figure 2: The rank-ordered frequency distribution of only caveat is that our identification of concept queries via the number of URIs returned from entity and concept WordNet is likely more stringent than our identification of queries, with the entity queries given by green and the entity queries, and thus leads to less concept queries overall. concept queries by blue. Furthermore, the vast majority of entity queries, as opposed to concept queries, appear to be queries that are only once or a very few times. This would make a certain amount of sense, as many queries for people and places are not for the insignificant .0077 (p > .05), while for concept queries, famous people and places, but for infrequently-mentioned the correlation was the still insignificant at .0125 (p > .05). people and places, such as wayne way san mateo and sara Just because a query is popular or unpopular does not mean matthews. Some concepts that were as diverse as gastropod the Semantic Web will be more or less likely to satisfy the and accolade. Still, the crawled queries are still biased sig- information need of the query. This makes sense, as the vast nificantly in favor of entity queries, being composed of 68% majority of queries are heavily dependent on current events being entity queries and only 32% concept queries. and fashion, and the Linked Data data sources are not up- The FALCON-S Object Semantic Web search engine [13] dated often enough to deal with this kind of information, so was used to query the Semantic Web for selected entity and there is an inevitable temporal lag between the time infor- concept queries between August 3rd and 4th 2008. We rec- mation appears in the world outside the Semantic Web and ognize that this a major weakness of the study, as its index its digitization on the Semantic Web. Yet as shown by Fig- may not be a representative sample of the entire Linked ure 2, the amount of possibly useful information for the vast Data Web, but it is a significant sample regardless. At the majority of queries is still surprisingly large, although how time, FALCON-S seemed to have the best rankings, and a many of the returned URIs are actually relevant to human comparable index to other engines. The results of running users is not yet known. the crawled queries against a Semantic Web search engine were surprisingly fruitful, although varying immensely. For entity queries, there was an average of 1,339 URIs (S.D. 5 5 x 10 8,000) returned per query. On the other hand, for concept 4.5 queries, there were an average of 26,294 URIs (S.D. 14,1580) returned per query, with no queries returning zero docu- 4 ments. Given the high standard deviation of these results, 3.5 it is likely that there is either a power-law in the resulting 3 URIs for the queries, or some other non-normal distribu- URIs 2.5 tion. As shown in Figure 2, when plotted in logarithmic space, both entity queries and concept queries show a distri- 2 bution that is heavily skewed towards a very large number of 1.5 high-frequency results, with a steep drop-off to almost zero 1 results instead of the characteristic long tail of a power law. Far from having no information that might be relevant to 0.5 ordinary user queries, the Semantic Web search engines re- 0 500 1000 1500 2000 2500 3000 turned either too many URIs possibly relevant to the query Popularity−ordered Queries or none at all. Another question is whether or not there is any correlation between the amount of URIs returned from the Semantic Figure 3: The rank-ordered popularity of entity and concept queries is on the x-axis, with the y axis displaying Web and the popularity of the query. As shown by Figure 3, the number of Semantic Web URIs returned, with the there is no correlation between the amount of URIs returned entity queries given by green and the concept queries by from the Semantic Web and the popularity of the query. For blue. entity queries, the Spearman’s rank correlation statistic was 4. EMPIRICAL ANALYSIS OF THE SEMAN- Linked Data [11]. This statistic as regards usage of the 303 TIC WEB convention is misleading in the broad sense, as most of the URIs are from a single source, DBPedia, as shown later in Surprisingly, there is a deluge of possible Semantic Web Table 4. URIs for any given query. Due to the high number of re- The majority of URIs, 51,873 (74%), served a Semantic sults for each query, we restricted our analysis to the top Web document via 303 redirection, and so returned the 200 10 Semantic Web URI results for each query as given by status code when the Semantic Web document was accessed FALCON-S’s ranking algorithm, and distinguish this subset after the redirection. 200 status codes without 303 redi- from all the URIs returned by the Semantic Web, by calling rection still form a substantial fraction of Semantic Web these this subset the crawled URIs. Concept URIs are URIs. There are several reasons this; all hash convention crawled URIs from the crawled concept queries while entity URIs would by default still technically commit a redirect URIs are crawled URIs from the crawled entity queries. Al- to be served by a 200 status code. However, this is only a though crawled URIs are a small subset of the total URIs re- minority (27%) of those URIs returning a 200 status code. trieved, given that user behavior in general inspects the first The rest are likely caused by people serving RDF that does ten URIs returned by this search [18], it makes more sense to not have the access to the Web server configuration needed sample these ten URIs per query than to sample every URI to serve RDF using the 303 redirection, while many others retrieved. The crawled URIs totaled 70,128 URIs, composed may have started serving RDF before the W3C TAG deci- of 25,400 (36%) concept URIs and 44,728 (63.78%) entity sion [28] was made or are not aware of Linked Data best URIs. These URIs were crawled using HTTP GET with a practices. For example, some earlier RDF-enabled reposito- preference for application-type of application+rdf/xml in ries like W3C WordNet did redirection by 300 redirection. A order to prefer RDF files served by content negotiation, and small percentage may be ordinary web-pages, perhaps con- any 303 redirection was followed. taining some meta-data as enabled by GRDDL, that just Of all crawled queries, a total of 6,673 (85%) had at least happened to be indexed by the Semantic Web search en- 10 crawled URIs. All concept queries had at least 10 crawled gine [15]. Furthermore, of these crawled URIs, 9,156 (13%) URIs and only 4,133 of the entity queries (12% of all entity URIs had no Semantic Web document that was accessible queries) did not have 10 queries. Inspecting just the set via HTTP, shown by the use of a 4xx or a 5xx-level status of queries that did not have 10 crawled URIs, the average code. number of URIs when 10 URIs were not returned were 2.89 (S.D. 2.88). So, the trend observed earlier was repeated in this smaller data-set, namely that while most of the time too 51,873 73.97% 303 many URIs are retrieved from the Semantic Web, sometimes 6,061 8.65% 200 there are no URIs are retrieved from the Semantic Web for 4,517 6.44% 404 certain entity queries. Looking at the data more closely, 357 4,257 6.07% 500 (30%) of the crawled URIs with less than 10 results returned 3,147 4.49% 300 no URIs, while 138 (12%) returned a single URI and 113 re- 246 0.35% 406 turned two URIs (10%). These queries with zero results 20 0.03% 403 seem to be mostly for not well-known places such as playa 4 0.00% 302 linda (a hotel in Majorica) or fairly unknown people such 3 0.00% 502 as william ravies or misspellings or popular truncations of names for people such as steven colbertbush. This obser- Table 3: Top 10 HTTP Status Codes for crawled vation helps explains the sudden drop in Semantic Web URIs URIs returned for queries in Figure 3. There was little overlap be- tween the the crawled URIs retrieved by different queries, with an overlap in entity queries of 546 URIs (.01%) and an The top 10 hosts of Semantic Web data in the crawled overlap in concept queries of 1031 URIs (.04%). In other URIs is given by Table 4. DBPedia, the export of Wikipedia words, the various queries weren’t just retrieving the same to RDF, dominates the results with 83% of all URIs com- small group of URIs over and over again. ing from either Wikipedia or DBPedia [2]. The W3C it- self is the third largest exporter of RDF with a share of 4.1 URI-based Statistics 5%. Upon closer inspection, most of the URIs crawled from In this section, we inspect the various kinds of statistics we the W3C derive from the W3C-hosted export of the linguis- can detect on the ‘macro-level’ of the crawled URIs without tic database Wordnet. The domain of the Freie Universität actually accessing any Semantic Web documents from the Berlin has a significant 2% of all RDF data, which is due pri- URIs. marily for its Flickr photo export to RDF. An RDF-version The HTTP status returned by attempting to access the of Cyc and the biomedical data hosting site Bio2RDF also various crawled URIs are given in Table 3. In particular, host small but significant amounts of Semantic Web data the most revealing statistic is the majority of the Seman- [22]. The Russian-blog hosting site Liveinternet.ru carries tic Web sampled by the crawled URIs is served using the on the tradition of FOAF exporting of Livejournal. True- 303 convention, not the hash convention. In fact, a total of sense is another export of WordNet to RDF, although not 51,762 (73%) of crawled URIs use the 303 convention, while as frequently used as W3C Wordnet. Towards the end of only 1,662 (2%) of the crawled URIs use the hash conven- the ranking there is the RDF version of Univeristät Trier’s tion. Of these URIs returning the hash convention, manual widely used DBLP academic citation database and inspection showed many to be FOAF files. This shows the Ontoworld.org, a RDF-enabled wiki for the Semantic Web vast majority of Linked Data is following the 303 conven- research community [31]. tion and so obeying the W3C and the guide to publishing The average number of URIs hosted by a domain name 10 6 accessible crawled URIs contained 24,074 accessible crawled entity URIs concept URIs Total Semantic Web URIs concept URIs (95% of all crawled concept URIs) and 36,898 10 5 (82% of all crawled entity URIs) accessible crawled entity URIs. Thus, the accessible crawled URIs maintained a bias 10 4 towards entity URIs (61% of all accessible crawled URIs) as compared to concept URIs (39% of all accessible crawled Number of URIs crawled URIs). Each of the crawled accessible URIs was accessed, 3 10 2 and this resulted in a total of 59,228 Web representations 10 with only 48 URIs not allowing access to a Semantic Web 10 1 document. These non-Semantic Web documents were usu- ally ordinary web-pages from which RDF triples could not 10 0 0 1 2 3 be extracted via GRDDL [15] or RDFa [1]. These crawled 10 10 10 10 URI frequency−ordered domain names Semantic Web Documents we will call the crawled Seman- tic Web documents, and the total sum of triples in these documents are called the crawled triples. Figure 4: The rank-ordered distribution of the domain There were a total of 411,574 RDF triples in the crawled names hosting Semantic Web data from the crawled triples, with 242,829 (59%) triples for concepts and 168,745 URIs ordered by number of URIs hosted. (41%) triples for entity URIs. Concepts, despite being fewer in number, seem to require more triples to describe than entities. The internal structure of these triples is of surpris- was 1,268 (S.D. 16,060), with the average number of entity ing interest. Of these triples, there were a total of 1,051 URIs hosted by any domain being 1,236 (S.D. 15,458) and triples containing blank nodes, a measly .25% of all triples the average number of concept URIs hosted by a domain in the corpus, of which 772 (73%) were subjects and only being 1,0327 (S.D. 6,650). The very high standard devia- 279 (27%) were in the object position. This means that tions are usually a sign of power-law distribution, as shown the use of blank nodes, whose purpose is as syntactic place- in in Figure 4. Attempting to fit a power-law distribution, holders in URIs for objects like lists and in representing n- the α of the rank-ordered domain list frequency distribu- ary arguments in RDF, is almost non-existent in our sample. tion is 1.53, with long tail behavior starting around 175 and Removing blank nodes, the composition was split between a Kolmogorov-Smirnov D-statistic of .1414, indicating in- URI nodes (66%) and a surprisingly large minority of RDF significant fit for the power-law distribution. In other words, literals nodes (34%). These literals contain some form of in- while a few sources like DBPedia dominates the crawled formation in either ‘unstructured’ natural language or some URIs, with an rapidly decreasing number of smaller sites form of structured information in a formal language, such such as Cyc and the W3C, the long-tail individuals URIs as integer values. hosting their FOAF files on their personal websites is still Of the literals, a total of 403,119 were RDF string lit- rather insignificant compared to the ‘top’ major sites host- erals, while only 2% were of some other data type, with ing Linked Data. This is because the Linked Data is being top 10 frequent data-types given in Table 5. The most fre- artificially generated in large ‘chunks’ by projects like W3C quent data-types are from XML Schema [10], while others Wordnet and DBpedia, and so do not organically form the are customized for DBPedia. It appears that the vast ma- power-law distribution characteristic of naturally-evolving jority of RDF in the Semantic Web of interest to average complex systems. users are simple URI-based triples with rich information in natural language. This also goes against the intuition that the vast majority of Semantic Web data that is of interest 54,698 78.00% dbpedia.org to ordinary users would be highly structured data of ex- 3,584 5.11% wikipedia.org ported databases [8]. Instead, what is of interest in Linked 3,448 4.92% w3.org Data is stored mainly in natural language, with RDF adding 1,704 2.43% fuberlin.de only a minimal structure to essentially fragments of natu- 811 1.16% cyc.com ral language. While it could be argued that this particular 701 1.00% bio2rdf.org finding is merely an artifact of DBpedia, however, it should 599 0.85% liveinternet.ru be acknowledged that DBpedia is, given that our querying 417 0.59% truesense.net includes other data-sets, this finding may well be generaliz- 322 0.46% dblp.unitrier.de able. We are not studying the Semantic Web as some of its 314 0.47% ontoworld.org designers would like to have it, but as it actually exists, and part of its existence is that DBpedia forms a huge central Table 4: Top 10 Domain Names for URIs for cluster that for ordinary users is the most interesting and Crawled URIs useful part of Linked Data. One interesting question is the predominance of the vari- ous kinds of Semantic Web knowledge representations terms 4.2 Triple-based Statistics on the Semantic Web, since this would show what kinds of inference could actually be deployed on the Semantic In this section, we move our analysis down from the level Web. First, of the total 1,093,212 URIs in triples harvested of URIs to the level of the triples accessible from the URIs. from the crawled accessible URIs, only 243,776 (22%) were Since a number of crawled URIs were inaccessible, this re- from one of the primary W3C Semantic Web knowledge duced the total number of accessible crawled URIs to representation languages, either RDF, RDF(S), or OWL. 60,972, a reduction of (13%) from the crawled URIs. The 403,119 97.95% RDF plain literal controversial owl:sameAs term, which is used to declare some 3,103 0.75% w3c:/XMLSchema#integer sort of global equivalence between two URIs. While a tiny 2,789 0.68% w3c:/XMLSchema#string portion (.47%) of overall Semantic Web modelling term us- 1,185 0.29% w3c:/XMLSchema#double age, it is far from insignificant, with 1,157 occurrences. The 522 0.13% w3c:/XMLSchema#date use of owl:sameAs in the wild is far different than the role it 248 0.06% w3c:/XMLSchema#float plays in popular debate within the Semantic Web commu- 136 0.03% w3c:/XMLSchema#gYear nity would suppose. Logicians hold that owl:sameAs is only 65 0.02% w3c:/XMLSchema#gYearMonth for what is properly considered individuals in description 59 0.01% dbpedia:Rank logic, so that classes and properties should use the more re- 46 0.01% dbpedia:Dollar stricted and semantically correct owl:equivalentClass and 14 0.00% w3c:/XMLSchema#int owl:equivalentProperty. Yet this best practice in logic 9 0.00% dbpedia:Percent hasn’t the Linked Data community, as owl:equivalentClass has only 2 occurrences and there are none of Table 5: Common Data Types in Crawled Triples owl:equivalentClass. Instead, the Linked Data movement uses owl:sameAs to simply “state that another data source also provides information about a specific non-information resource,” so leading owl:sameAs to tend to mean ‘more-or- less the same thing as’ [11]. This practice leads to the fear Of these, the RDF vocabulary itself was the most popu- that the use of owl:sameAs would propagate too far, such lar, with 109,300 URIs (45%), followed fairly closely by the that many URIs for the perhaps differing referents would be RDF(S) vocabulary with 100,340 URIs (41%), and OWL declared identical [17]. being dwarfed by RDF and RDF(S) with only 34,136 URIs Both critiques of owl:sameAs appear to be wrong. Given (14%). This does not mean that OWL is irrelevant to the the amount of Semantic Web URIs returned by the queries, other corpus, as ontologies constructed with OWL could be while there is considerable use of owl:sameAs, it appears deployed to model the concepts and entities employed in that the manual discovery and publication of co-referential ‘instance’ data. Yet while OWL has been an academic suc- URIs using owl:sameAs falls far behind the actual growth of cess story, insofar as practical deployment, RDF terms and Linked Data. One could say that owl:sameAs is not being RDF(S)-based inference seems to be the foundation of the used enough. The real problem is not that distinct things Semantic Web in practice. are being given the same URI, but the reverse; namely that What precise URI-based terms are used in these knowl- it appears endemic that the same thing has multiple URIs. edge representation languages? The top constructs in ei- So Berners-Lee’s hypothesis appears to be wrong: A single ther RDF, RDF(S), or OWL in crawled triples are given in thing is likely identified by more than a single URI on the Table 6. To summarize, RDF(S) class and sub-class rea- Semantic Web. soning is very popular, with this construction consisting of nearly half (48%) of knowledge representation use of the Se- mantic Web. The second most popular use of knowledge 73,451 30.31% rdfs:Class representation (22%) is for natural language annotation, de- 47,044 19.30% rdfs:comment scribing a particular Semantic Web resource using natural 44,113 18.10% rdfs:subClassOf language and connecting this natural language description to 8,630 3.54% owl:Ontology the URI via the use of rdfs:comment or rdfs:label. There 7,256 2.97% rdfs:label are surprisingly few (4%) actual ontologies in the crawled 6,618 2.14% rdf:Subject Semantic Web resources. Furthermore, non-traditional fea- 5,107 2.09% owl:ObjectProperty tures of RDF(S), such as the use of rdfs:property, are fre- 3,642 1.49% rdfs:subPropertyOf quent occurrences. Even reification of RDF triples, officially 1,157 0.47% owl:sameAs discouraged by the Semantic Web community, accounts for 535 0.29% rdfs:range only 95 triples, and there is also fairly heavy use of discour- aged RDF constructs to represent different kinds of lists, Table 6: RDF and OWL Constructs in Crawled such as rdf:Alt (349 occurrences) and rdf:Bag (344 oc- Triples currences). Lastly, while many Semantic Web researchers originally hoped that the use of inverse functional proper- ties would allow the merger of Semantic Web data, there were zero explicitly declared usages of The top 10 Semantic Web vocabularies used in the crawled owl:inverseFunctionalProperty. Overall, the usage of OWL, triples, including those outside of the W3C-approved Seman- RDF(S), and RDF terms in the corpus also follows to some tic Web knowledge representation languages, are shown in degree a power-law like distribution, where α equal to 1.5, Table 7. The results should not be that surprising, in par- with long tail behavior starting around 90, although the ticular the vast dominance of DBPedia. Perhaps surprising Kolmogorov-Smirnov D-statistic of .1911 reveals this to in- is the surprising amount of usage of Cyc terms, as well as significant. This is because while a few terms vastly dom- terms from SKOS, the Simple Knowledge Organization Sys- inate, the vast majority of other terms are not used at all. tem of the W3C, whose primary source of deployment is the This has reprecussions for both Semantic Web implementers W3C’s export of WordNet to RDF [24]. FOAF is also signif- and vocabulary specification within the W3C, since obvi- icant, although not nearly as dominant as was found earlier ously some level of concentration of effort upon the most by Ding and Finin [16]. Also popular is YAGO (Yet Another frequently-deployed terms would be reasonable. Global Ontology), a merger of WordNet and Wikipedia cat- One of the most popular OWL constructs is indeed the egory hierarchies employed by DBPedia [30]. 366,849 33.55% DBpedia URIs beside FALCON-S, which we recognize is a major limiting 109,300 9.99% RDF URIs factor. Second, there is likely too many URIs in Linked Data 100,340 9.17% RDF(S) URIs for a given query, although to truly substantiate this claim 94,520 8.65% Cyc URIs ideally the URIs returned by the search engines should each 34,136 3.12% OWL URIs be individually inspected, although this is difficult in prac- 6,563 0.60% SKOS URIs tice. Yet even at this point it seems is likely that there are 4,728 0.43% dblp.l3s.de many co-referential URIs for the ‘same thing’ that are not 3,263 0.29% FOAF URIS explicitly modelled with owl:sameAs, and unless action is 2,170 0.20% YAGO URIs taken this growth of URIs will contine of the future. Unless 1,836 0.16% WordNet URI there is URI re-usage many of the data-sources for Linked Data are more like semantic islands rather than parts of Table 7: Top Vocabulary URIs in Crawled Triples interconnected semantic continents. 6. ACKNOWLEDGEMENTS Harry Halpin was supported in part by a Microsoft “Be- 5. CONCLUSION yond Search” grant. The empirical analysis of Linked Data presented in this study is by no means complete, for it is only a moderately small sample by one Semantic Web search engine (and so 7. REFERENCES hurt or benefit by the idiosyncratic behavior of the search- [1] B. Adida, M. Birbeck, S. McCarron, and ing of FALCON-S), although it is an important one as this S. Pemberton. RDFa in XHTML: Syntax and sample is driven by Web search queries by actual users. The Processing. W3C Recommendation, W3C, 2008. results of this empirical analysis show a transformation from http://www.w3.org/TR/rdfa-syntax/. the first-generation Semantic Web to the next generation [2] S. Auer, C. Bizer, J. Lehmann, G. Kobilarov, Web of Linked Data. The Semantic Web as it existed in R. Cyganiak, and Z. Ives. DBpedia: A nucleus for a the first-generation was a motley collection of RDF triples, web of open data. In Proceedings of the International heavily dominated by a few exports of social networking and Asian Semantic Web Conference data into FOAF and a long-tail of complex academically- (ISWC/ASWC2007), pages 718–728, Busan, Korea, produced ontologies. Linked Data - at least the section of it 2007. that is of interest to users querying the Web for information [3] R. Baeza-Yates, L. Calderon-Benavides, and - is dominated heavily by DBPedia and consists primarily C. Gonzalez. Understanding user goals in web search. of collections of triples that provide a minimal structure to In Proceedings of String Processing and Information natural language [16]. Retrieval (SPIRE), pages 98–109, 2006. On the level of triples, there are some surprising conclu- [4] R. Baeza-Yates and B. Ribeiro-Neto. Modern sions. The triples on the Semantic Web contain a vast range Information Retrieval. Addison Wesley-Longman, New of data, and the exact kinds of URIs used in the triples are York City, New York, USA, 1999. somewhat unpredictable. However, the kinds of vocabular- [5] A.-L. Barabasi, R. Albert, H. Jeong, and G. Bianconi. ies actually deployed are almost entirely from a few large Power-law distribution of the World Wide Web. vocabularies, such as DBPedia, DBLP, WordNet, YAGO, Science, 287:2115, 2000. and FOAF. This again points to a victory of Berner-Lee’s [6] G. Beged-Dov, D. Brickley, R. Dornfest, I. Davis, idea that a few large vocabularies with well-defined terms L. Dodds, J. Eisenzopf, D. Galbraith, R. Guha, could dominate the Semantic Web [9]. The kinds of triples K. MacLeod, E. Miller, A. Swartz, and E. van der that structured this data do not contain many OWL terms Vlist. RDF Site Summary (RSS) 1.0. Technical report, optimized for inference, but consist almost entirely relatively http://web.resource.org/rss/1.0/spec, 2001. straight-forward RDF(S) expressions for sub-class relation- ships and for annotations in natural language. Overall, [7] F. Belleau, M.-A. Nolin, N. Tourigny, P. Rigault, and Linked Data is primarily being used to provide structured J. Morissette. Bio2rdf: Towards a mashup to build relationships between fragments of natural language, and bioinformatics knowledge systems. Journal of not for inference. Biomedical Informatics, 41(5):706–716, 2008. One could argue that that these results are more charac- [8] T. Berners-Lee. What the Semantic Web can teristic of FALCON-S and DBpedia than the second-generation represent, 1998. Informal Draft. Linked Data as a whole. However, we would respond that http://www.w3.org/DesignIssues/rdfnot.html (Last it is natural in decentralized information systems for power accessed on Sept. 12th 2008). law distributions, where one source of data massively out- [9] T. Berners-Lee and L. Kagal. The fractal nature of the weighs others in weight to evolve, and the ‘giant component’ Semantic Web. AI Magazine, 29(3), 2004. of Linked Data is DBpedia [5]. In fact, if such a ‘giant com- [10] P. Biron and A. Malhotra. XML Schema Part 2: ponent’ and long tail were not observed, it would be cause Datatypes. Recommendation, W3C, 2004. for suspicion. In conclusion, there is potentially lots of rich http://www.w3.org/TR/xmlschema-2/. information that ordinary Web search users in Linked Data [11] C. Bizer, R. Cygniak, and T. Heath. How to publish form, and so one outcome of this analysis should be a greater Linked Data on the Web, 2007. interest in Linked Data from even mainstream information http://www4.wiwiss.fu- retrieval systems. However, for future work we wish to re- berlin.de/bizer/pub/LinkedDataTutorial/ (Last peat this study over different Semantic Web search engines accessed on May 28th 2008). [12] C. Bizer and A. Seaborne. D2RQ: Treating non-RDF pages 683–690, New York, NY, USA, 2007. ACM. databases as virtual RDF graphs. In Proceedings of [28] L. Sauermann and R. Cygniak. Cool URIs for the International Semantic Web Conference, 2004. Semantic Web. Technical report, W3C Semantic Web [13] G. Cheng, W. Ge, and Y. Qu. FALCONS: Searching Interest Group Note, 2008. and browsing entities on the semantic web. In http://www.w3.org/TR/cooluris/. Proceedings of the the World Wide Web Conference, [29] C. Silverstein, H. Marais, M. Henzinger, and 2008. M. Moricz. Analysis of a very large web search engine [14] A. Clauset, C. Shalizi, and M. Newman. Power-law query log. SIGIR Forum, 33(1):6–12, 1999. distributions in empirical data, 2007. [30] F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: http://arxiv.org/abs/0706.1062v1 (Last accessed a core of semantic knowledge. In In Proceedings of the October 13th 2008). 16th International Conference on World Wide Web, [15] D. Connolly. Gleaning Resource Descriptions from pages 697–706, New York, NY, USA, 2007. ACM. Dialects of Languages (GRDDL). Technical report, [31] M. Völkel, M. Krötzsch, D. Vrandecic, H. Haller, and W3C, 2007. Recommendation. R. Studer. Semantic wikipedia. In Proceedings of the [16] L. Ding and T. Finin. Characterizing the Semantic International Conference on World Wide Web Web on the Web. In Proceedings of the International (WWW), pages 585–594, New York, NY, USA, 2006. Semantic Web Conference (ISWC), pages 242–257, ACM. 2006. [32] D. Watts and S. Strogatz. A review of ontology based [17] A. Ginsberg. The big schema of things. In Proceedings query expansion. Nature, 6684(393):409–410, 1998. of Identity, Reference, [33] C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. H. and the Web Workshop at the WWW Conference, 2006. Ungar. Web-scale named entity recognition. In http://www.ibiblio.org/hhalpin/irw2006/aginsberg2006.pdf. Proceedings of Conference on Information and [18] L. Granka, T. Joachims, and G. Gay. Eye-tracking Knowledge Management, pages 123–132. ACM, 2008. analysis of user behavior in www search. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 478–479, New York, NY, USA, 2004. ACM. [19] M. Hausenblas, W. Halb, Y. Raimond, and T. Heath. What is the size of the Semantic Web? In Proceedings of Conference on Semantic Systems (iSemantics), Graz, Austria, 2008. http://tomheath.com/papers/hausenblas- isemantics2008-size-of-semantic-web.pdf. [20] D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the trec-8 web track. In Proceedings of the Text REtrieval Conference (TREC), pages 131–150. ACM, 2000. [21] B. J. Jansen, D. L. Booth, and A. Spink. Determining the informational, navigational, and transactional intent of web queries. Information Process and Management, 44(3):1251–1266, 2008. [22] D. Lenat. Cyc: Towards programs with common sense. Communications of the ACM, 8(33):30–49, 1990. [23] A. Mikheev, C. Grover, and M. Moens. Description of the LTG system used for MUC. In Seventh Message Understanding Conference: Proceedings of a Conference, 1998. [24] A. Miles and S. Bechhofer. SKOS Simple Knowledge Organization System reference. Working draft, W3C, 2008. http://www.w3.org/TR/skos-reference/. [25] M. Newman. Power laws, pareto distributions and zipf’s law. Contemporary Physics, 46:323–351, 2005. [26] E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello. Sindice.com: a document-oriented lookup index for open linked data. International Journal of Metadata, Semantics, and Ontologies 2008, 3(1):37–52, 2008. [27] M. Paşca. Weakly-supervised discovery of named entities using web search queries. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM),