=Paper=
{{Paper
|id=Vol-538/paper-16
|storemode=property
|title=A Query-Driven Characterization of Linked Data
|pdfUrl=https://ceur-ws.org/Vol-538/ldow2009_paper16.pdf
|volume=Vol-538
|dblpUrl=https://dblp.org/rec/conf/www/Halpin09a
}}
==A Query-Driven Characterization of Linked Data==
A Query-Driven Characterization of Linked Data
Harry Halpin
Institute for Communicating and Collaborative Systems
University of Edinburgh
2 Buccleuch Place
Edinburgh, United Kingdom
H.Halpin@ed.ac.uk
ABSTRACT can satisfy these information needs. We present an analysis
Due to the Linked Data initiative, the once unpopulated Se- of a search-engine query log from a major hypertext search
mantic Web is now rapidly being populated with millions engine, Microsoft’s Live.com, and use this query log to sam-
of facts stored in RDF. Could any of this data possibly be ple Linked Data. As an added benefit, such an empirical
interesting to ordinary users? In this study, we run queries analysis can prove or disprove some widely held assump-
extracted from a query log from a major hypertext search tions, such as whether or not there is an endemic over-use of
engine against a Semantic Web search engine to determine owl:sameAs and whether or the Linked Data best practice
if the Semantic Web has anything of interest to the aver- recommendation of 303 redirection is being followed.
age Web user. There is indeed much Semantic Web infor-
mation that could be relevant for many queries for enti- 2. PREVIOUS WORK
ties (like people and places) and abstract concepts, although
For the first-generation of the Semantic Web, there was
these possibly relevant results are overwhelmingly clustered
very little data-driven analysis of the ontologies, primarily
around DBPedia. We present an empirical analysis of the
because so few were actually in existence. The first large-
results, focusing on their major sources, the structure of the
scale analysis of the Semantic Web was done via an inspec-
triples, the use of various RDF and OWL constructs, and
tion of the index of Swoogle by Ding and Finin [16]. Ding
the power-law distributions produced by both the URIs that
and Finin first estimated the size of the Semantic Web to be
serve Linked Data and the URIs in the triples themselves.
in 2006 4.91 million Semantic Web documents via search-
The issue of 303 redirection and URI identity is given in-
ing Google for the media type application/rdf+xml [16].
depth treatment.
As this might not include data that is hosted using the
wrong media type, they estimated, using Google to include
Categories and Subject Descriptors all FOAF files served as HTML and RSS 1.0 files, the size
H.3.d [Information Technology and Systems]: Meta- of the Semantic Web would optimistically be increased by
data two orders of magnitudes. Although the study of Ding and
Finin was of great importance as it was the first empirical
study of the Semantic Web, this work has a number of lim-
General Terms itations [16]. It’s primary limitation was it was unknown
Experimentation if any of the Semantic Web documents indexed contained
information that anyone would want to actually re-use. In-
Keywords tuitively, most of the data on this first-generation Semantic
Web was likely to be of limited value. For example, the vast
Linked Data statistics, query logs, information retrieval,power
majority of data on the Semantic Web in 2006 was caused
law
by Livejournal exporting every user’s profile as FOAF – usu-
ally without the user’s knowledge – without linking to other
1. INTRODUCTION URIs, serving with the correct MIME type, and deploying
What are the characteristics of the Linked Data in the 303 re-direction. The second main source of data in Ding
wild? There are two primary questions we are hoping to and Finin’s study, RSS 1.0, is also of limited value. RSS,
answer. First, has Linked Data changed from earlier ‘first originally an XML-based protocol generally used for news-
generation’ Semantic Web efforts? Second, is there any- feeds, was given a RDF-compatible syntax, creating RSS 1.0
thing worth finding for ordinary users in Linked Data? Only [6]. The very application of RDF in RSS 1.0 is questionable,
a moderately large-scape sampling and analysis of Linked as the data is primarily information about site updates, and
Data can answer this central question. Our method of in- so RSS 1.0 data is rarely merged, re-used, or even linked to
vestigation is to inspect what information needs actual users in a manner that takes advantage of RDF. Due to the id-
are expressing via using a hypertext search engine, and then iosyncratic nature of the data sources of the first generation
use a sample of these queries to determine if Linked Data Semantic Web, it is not surprising that the majority of the
Copyright is held by the International World Wide Web Conference Com-
data likely contained little information that could satisfy the
mittee (IW3C2). Distribution of these papers is limited to classroom use, information need of the average user of the Web.
and personal use by others. Due to the Linked Data initiative, the size of the Seman-
LDOW 2009, April 20, 2009, Madrid, Spain. tic Web has recently increased in size by several orders of
ACM 978-1-60558-487-4/09/04.
magnitudes due to the conversion of a large number of high- noticeable ‘fits and starts’ as large data-sets are released, so
quality databases into RDF [12]. Since the study by Ding each data-set can vastly alter any empirical analysis. The
and Finin missed the rise of Linked Data, the time is ripe question is not how to avoid bias in sampling, but to choose
for more empirical studies of the Semantic Web. It is un- the kind of bias one wants. We are aiming for a bias towards
clear how the dynamics of the Semantic Web are changing. the ordinary user of the Web.
While the number of URIs indexed by Linked Data search What information is available on the Semantic Web that
engines like Sindice shows that the general trend of the num- ordinary users are actually interested in, and how do we
ber of URIs on the Semantic Web visually follows a ‘power- sample this data? The obvious candidate for exploring this
law,’ the correct mathematical analysis has not been done to would be look at a major search engine query log, as it gives
show this to be the case [26]. The only large-scale study of a sample of the interests of many users in aggregate. Since
Linked Data at this time has been by Hausenblas et al., and Semantic Web search engines are currently used mostly by
it estimated the size of the Linked Data at approximately Semantic Web developers and not by ordinary users, the
2 billion triples [19]. The focus of that study was only on query log of a popular hypertext search engine should be
interlinking between data-sets, and it estimated that there sampled as opposed to a more specialized search engine.
were approximately 3 million interlinks between the various The entire bet of the Semantic Web is that it will contain
data-sets. The most popular interlinking property by far information that many ordinary users will want to re-use
was dbpedia:hasPhotoCollection, with approximately 2 mil- and merge via Semantic-Web enabled applications, and that
lion occurrences, most likely to be due to the term being this information will primarily be about non-information re-
used by a Linked Data exporter around the popular photo- sources such as entities like people and places and abstract
hosting service Flickr [2]. In summary, the Linked Data concepts. Thus, the ideal sampling of the Semantic Web
phenomenon is huge, much larger than the first-generation would be to extract query terms referring to physical entities
Semantic Web, and its properties have not been fully stud- and abstract concepts from a hypertext search engine query
ied. In particular, there has been little work on determining log, and then by virtue of a Semantic Web search engine we
how the issues of the reference of URIs play out in the wild can determine precisely how much information Linked Data
given by Linked Data. contains on these subjects.
3.1 The Live.com Query Log
3. SAMPLING LINKED DATA VIA QUERY There has been a much work in query log analysis in or-
LOGS der to discover how to best satisfy the information needs of
The main problem facing any empirical analysis of the Se- users on the Web. Since most search query logs of any size
mantic Web is one of sampling. As almost any database can belong to search engines companies, it is often difficult for
easily be exported to RDF, any sample of the Semantic Web researchers outside those companies to analyze these query
can be biased by the automated release of large, if ultimately logs, and therefore most research in search query logs deal
useless, data-sets. This was demonstrated in an exemplary with small or special-purpose query logs, such as the Web
fashion by the release of RSS 1.0 data. RDF vocabulary track in the TREC competition [20]. A few employees of
terms that have little content, such as rss:item, quickly bias large search corporations have released detailed studies of
the statistical analysis. With the advent of Linked Data, this their search engine query logs. In particular Silverstein et
has to some extent already happened with large numbers of al.’s analysis of a billion queries in the Altavista query log is
databases being released as Linked Data ranging from the considered to be a large ‘gold-standard’ study of query logs
BBC’s John Peel recordings to the MusicBrainz audio CD [29]. In order to extract concepts and entities, we analyze
collection [19]. How much of Linked Data is aimed for gen- the query log of approximately 15 million distinct queries
eral use? Obviously, components like DBPedia, the export from Microsoft Live Search, and all reference to the ‘query
of Wikipedia to Linked Data, could be very useful [2]. The log’ are to this Microsoft query log, which is provided by
vast majority of data released into the Semantic Web is of Microsoft due to a 2007 ‘Beyond Search’ award. This query
appeal only to a niche audience, such as the large appeal of log contains 14,921,285 queries. Of these queries, 7,095,302
Bio2RDF to health care and life-sciences. Just as RSS 1.0 (48%) were unique. Corrected for capitalization, 4,465,912
and the Livejournal export of FOAF biased sampling of the (30%) were unique. Of all queries, only 228,593 (2%) queries
first-generation Semantic Web, the release of a large Linked used some form of advanced keywords, while 709,102 (5%)
Data set such as the Bio2RDF, containing approximately used boolean operators and 266,308 (2%) used quotation,
65 million triples and so rivaling the size of DBPedia, can leading to a total of 1,204,003 (17%) queries using some ad-
bias any sampling of Linked Data [7]. For example, if one vanced techniques provided by the search engines. The av-
just counted the number of URIs used on the Semantic Web, erage number of terms per query was 1.76. Note that these
one would quickly find that bio2rdf:xProteinLinks would extremely brief queries are normal for hypertext Web search
prove to be, in sheer number, a very popular term despite engines, with an average query length of 2.35 being reported
its relative lack of use outside the biomedical community. It by Silverstein et al. for the Altavista query log [29]. Since
is a small step then to imagine ‘semantic spamming’ that re- we did not want to deal with queries that were only typed
leases large amounts of bogus URIs into the Semantic Web. once or a few times, as these may not be representative of
Furthermore, due to open nature of the Web, it is difficult, most user’s interests, we did not select for further use any
if not impossible, to determine how many actual separate queries with a frequency less than 10, resulting in onlyfrom
providers of Semantic Web data there are, so a priori choos- the total query log of 7,095,302, a reduction of 37%.
ing seed samples or to ‘weight’ any sample is difficult. Unlike
the original Web, which grew at least in an organic fashion 3.2 Extracting Queries for Entities and Con-
for its first few years, the Web of Linked Data grows in very cepts
Automatically classifying informational queries is difficult. 7311 david blaine
Rule-based approaches that claim to work over entire query 4039 kelly blue book
logs like those of Jansen et al. [21] are dubious at best, 3053 chase
since they work by applying very loose specifications such 2997 jessica alba
as “query length greater than 2” and “any query using natu- 2100 nick
ral language terms.” More promising work has applied both 1415 office max
supervised and unsupervised machine-learning to discover 1280 michael hayden
informational queries, but only achieved an accuracy of 50% 1139 harley davidson
[3]. A number of machine-learning algorithms could be em- 1098 marcus vick
ployed to learn named entities, but the sparse amount of lin- 1092 keith urban
guistic context in query logs makes identifying a named enti-
ties difficult in a unsupervised manner, and there is virtually Table 1: Top 10 Entity Queries in Query Log
no labeled data for supervised learning [33]. Even most rule-
based approaches for named entity recognition rely heavily
upon capitalization and punctuation, such as ‘I.B.M.’ and
‘Gustave Eiffel,’ features that are lacking from query logs length one where the query had a hyponym and hypernym,
[23]. due to the difficulty of WordNet dealing with some multi-
We call queries that are automatically identified to be about word queries. This assured that the query was for a class
physical entities in the query log entity queries. For the that was suitably abstract (having a hyponym) but not so
discovery of entity queries, people and places are obvious abstract as to be virtually meaningless (had a hypernym).
places to begin. An updated version of the system that This resulted in a more restricted 16,698 concept queries
was the highest performer at MUC-7 [23], a straightforward (.4% of total query log). The top 10 concepts queries are
gazetteer-based and rule-based named entity recognizer, was given in Table 2. Again, a number of clearly transactional
employed to discover the names of people and places. The queries have managed to find themselves into the concept
gazetteer for names was based on a list of names maintained queries, such as ‘chase’ and ‘drudge,’ as well as a number
by the Social Security Administration and the gazetteer for of queries where the sense of a word has been taken over
place names was based on the gazetteer provided by the by a proper name, such as ‘sprint’ and ‘aim.’ Again, this
Alexandria Digital Library Project. Although it could be is due to the preponderance of navigational names towards
possible to separate out people and places, this was not the top of the query distribution. Of a random sample of
done. First, both of these are types of entities. Second, 100 concept queries, a judge considered 98% to be correct.
the names of many location such as ‘Paris’ or places like The top ten concept queries are presented in Table 2. While
‘Georgia’ can also be used as a name. This gazetteer-based some of the queries could be considered somewhat naviga-
approach was chosen to provide high precision, even at the tional (such as those for maps and dictionaries), they could
cost of a dramatically reduced recall. This is an acceptable all be considered informational queries about some abstract
trade-off as we are attempting only to sample the number of concept.
queries that would likely to be have URIs on the Semantic
Web. A high-quality sample of the query log is more impor- 11383 weather
tant than a large one for this purpose. Of a random sample 10321 dictionary
of 100 entity queries, a judge considered 94% to be correctly 3675 people
categorized as entities such as people or places. 3217 music
From the pruned unique queries in the query log, totaling 2192 autism
4,465,912 queries, a total of 509,659 queries (11%) were iden- 1468 map
tified as either people or places by the named-entity recog- 1198 travel
nizer. The top 10 entity queries are given in Table 1. Some 1191 pregnancy
transactional and navigational queries, despite their rela- 1104 news
tively lower frequency overall in the query log, are highly 1052 charter
clustered towards the top of the query distribution. These
navigational queries such as ‘chase’ and ‘office max’ have
Table 2: Top 10 Concept Queries in Query Log
clearly snuck into the top ten due to their use of common
names in their website names. A legitimate number of real
names, such as ‘jessica alba’ and ‘marcus vick’ were discov-
ered.
A method for discovering abstract concepts in the query
3.3 Power-Law Detection
log is more challenging. These queries are called concept The frequency of queries, when rank-ordered, follows what
queries, queries that are automatically identified to be about is known as a ‘power-law’ distribution, with a relatively
abstract concepts in query log. Previous attempts at dis- small number of very popular queries and a long-tail of
covering abstract concepts have employed machine-learning queries only occurring once or twice, where most of the mass
over truly massive query logs and document collections from of the distribution is in the long tail and the ‘top’ of the dis-
Google [27]. Since this massive amount of data was not tribution exponentially decreases. Since this distribution is
available, we employed WordNet instead. WordNet consists common in search on the Web, we will define it precisely: A
of approximately 207,000 words with unique synsets. Our power-law is a relationship between two scalar quantities
algorithm for discovering abstract concepts in query logs us- x and y of the form:
ing WordNet was straightforward: we only chose queries of
y = cxα + b (1)
where α and c are constants characterizing the given power- the conservative p < .1. The Kolmogorov-Smirnov test is
law, and b being some constant or variable dependent on x valid even for power-law distributions since Q’s cumulative
that becomes constant asymptotically. Typically it is ap- density function is asymptotically normally distributed and
plied to rank-ordered frequency diagrams, where the fre- this can be compared to the cumulative density function of
quency of some measurement is given on the horizontal axis P.
while the rank order of the measurements in terms of their The query frequencies for entity and concept queries are
frequency is given on the vertical axis. The α exponent is plotted in logarithmic space in Figure 1. Both entity and
the scaling exponent that determines the slope of the top concept queries appear to be linear in log-space, and so can
of the distribution and provides the remarkable property of be considered candidates for power-laws. Using the method
scale-invariance, such that if a true power-law is observed, described above, the α of the queries for entities was cal-
as more samples are added to the distribution, the α re- culated to be 2.31, with long tail behavior starting around
mains constant, i.e. the distribution is ‘scale-free’ [32]. It a frequency of 17 and a Kolmogorov-Smirnov D-statistic
is crucial to note that a power-law distribution violates as- of .0241, indicating a significant good fit. The α of the
sumptions of the normal Gaussian distribution, such that queries for concept queries was calculated to be 2.12, with
routine statistics such as averages and standard deviations long tail behavior starting around a frequency of 36 with a
can be and usually are misleading. In fact, one of the surest Kolmogorov-Smirnov D-statistic of .0170, also indicating a
sign of a non-normal distribution like a power-law distribu- significant good fit for the power law. Given their two re-
tion is a very large standard deviation. Is such a distribution markably similar α statistics and high goodness of fits, one
evident from Linked Data? One important question is how can safely conclude that these query logs do indeed follow
to detect power-law distributions in actual data. Equation power-law distributions. This indicates our sample of enti-
1 can also be written as: ties and concepts are representative of the larger query log,
which are well-known to follow power-law distributions [4].
log y = α log x + log c (2)
5
10
When written in this form, a fundamental property of
power-laws becomes apparent: When plotted in log-log space,
power-laws are ‘straight’ lines. Thus,the most widely used 4
10
method to check whether a distribution follows a power-law
is to apply a logarithmic transformation, and then perform
linear regression, estimating the slope of the function in log-
Popularity
Query
3
arithmic space to be α, as done by Ding and Finin [16]. 10
However, standard least-square regression has been shown
to produce systematic bias, in particular due to fluctuations
of the long tail [14]. To determine a power-law accurately 10
2
requires minimizing the bias in the value of the scaling ex-
ponent and the beginning of the long tail via maximum like-
lihood estimation. See Newman [25] and Clauset et al. [14] 10
1
0 1 2 3 4
10 10 10 10 10
for the technical details. Popularity−ordered queries
Determining whether a particular distribution is a ‘good
fit’ for a power-law is difficult, as most ‘goodness-of-fit’ tests
employ normal Gaussian assumptions violated by poten- Figure 1: The rank-ordered frequency distribution of
tial power-law distributions. Luckily, the non-parametric extracted entity and concept queries, with the entity
Kolmogorov-Smirnov test can be employed for any distribu- queries given by green and the concept queries by blue.
tion and so is thus ideal for use measuring ‘goodness-of-fit’
of a given finite distribution to a power-law function. While
the details are given at length in Clauset et al. [14], intu- 3.4 Querying Linked Data with FALCON-S
itively the Kolmogorov-Smirnov test can be thought of as Both the concept queries and the entity queries are used
follows: Given a reference distribution P , such as an ideal to query the Semantic Web. Since our goal was to discover
power-law distribution generating function, and a sample how much of interest for ordinary users was present on the
distribution Q of size n suspected of being a power-law, Semantic Web, one problem with using the entire query log
where one is testing the null hypothesis that Q is drawn was that it would contain a vast amount of unique queries
from P , then the Kolmogorov-Smirnov test compares the that would likely to be never be repeated. So, we excluded
cumulative frequency of both P and Q to discover the great- a portion of the long tail from the study by removing all
est discrepancy (the D-statistic) between the two distribu- queries of less than a frequency of 10. The parameter 10 was
tions. This D-statistic is then tested against the critical chosen as it was the number that could reduce both entity
value p of the D-statistic at n, which varies per function. and concept queries to the same order of magnitude. Due to
The null hypothesis is rejected if the D statistic is less than the power-law behavior of both entity and concept queries,
the critical p-value for n, p being the probability that the this truncation consists of ‘removing’ a large amount of the
distribution was drawn from a power-law generating func- long tail, while maintaining the entire ‘top’ of the power-
tion given the estimated parameters. In order to determine law distribution, as well as some significant component of
how well the power-law method fits, whenever a power-law the long tail. This procedure is justified insofar as the ‘long-
is reported, the D-statistic is also reported, and we will de- tail’ likely consists of queries that are never or very rarely
termine whether or not the fit was significant according to repeated, while the remaining queries represents queries that
are likely to be repeated. This pruning of low-frequency 10
7
queries from our sampling does exclude many ‘difficult’ or
‘specialist’ queries, but we are aiming for queries that are 10
6
general-purpose and popular. We call these queries with
Frequency of Semantic Web URIs returned
more than 10 URIs returned from the Semantic Web the 10
5
crawled queries to distinguish them from the greater query
log. Likewise, crawled entity queries are entity queries 10
4
with more than 10 URIs returned from the Semantic Web,
and similarly for crawled concept queries. 10
3
This truncation reduced the amount of queries signifi-
cantly, from 587,283 to 7,848 queries, removing 99% of the 10
2
queries. It reduced the number of entity queries from 570,585
to 5,308 (a 91% reduction) and from the amount of concept 10
1
queries from 16,698 to 2,540 (an 85% reduction). This gap
in the result of pruning off the ‘long tail’ is interesting, as it
0
10
0 1 2 3 4
10 10 10 10 10
shows that while there is a lower amount of concept queries Frequency−ordered Returned Semantic Web URIs
than entity queries overall, concept queries are repeated by a
order of magnitude or so more often than entity queries. The Figure 2: The rank-ordered frequency distribution of
only caveat is that our identification of concept queries via the number of URIs returned from entity and concept
WordNet is likely more stringent than our identification of queries, with the entity queries given by green and the
entity queries, and thus leads to less concept queries overall. concept queries by blue.
Furthermore, the vast majority of entity queries, as opposed
to concept queries, appear to be queries that are only once
or a very few times. This would make a certain amount
of sense, as many queries for people and places are not for the insignificant .0077 (p > .05), while for concept queries,
famous people and places, but for infrequently-mentioned the correlation was the still insignificant at .0125 (p > .05).
people and places, such as wayne way san mateo and sara Just because a query is popular or unpopular does not mean
matthews. Some concepts that were as diverse as gastropod the Semantic Web will be more or less likely to satisfy the
and accolade. Still, the crawled queries are still biased sig- information need of the query. This makes sense, as the vast
nificantly in favor of entity queries, being composed of 68% majority of queries are heavily dependent on current events
being entity queries and only 32% concept queries. and fashion, and the Linked Data data sources are not up-
The FALCON-S Object Semantic Web search engine [13] dated often enough to deal with this kind of information, so
was used to query the Semantic Web for selected entity and there is an inevitable temporal lag between the time infor-
concept queries between August 3rd and 4th 2008. We rec- mation appears in the world outside the Semantic Web and
ognize that this a major weakness of the study, as its index its digitization on the Semantic Web. Yet as shown by Fig-
may not be a representative sample of the entire Linked ure 2, the amount of possibly useful information for the vast
Data Web, but it is a significant sample regardless. At the majority of queries is still surprisingly large, although how
time, FALCON-S seemed to have the best rankings, and a many of the returned URIs are actually relevant to human
comparable index to other engines. The results of running users is not yet known.
the crawled queries against a Semantic Web search engine
were surprisingly fruitful, although varying immensely. For
entity queries, there was an average of 1,339 URIs (S.D. 5
5
x 10
8,000) returned per query. On the other hand, for concept
4.5
queries, there were an average of 26,294 URIs (S.D. 14,1580)
returned per query, with no queries returning zero docu- 4
ments. Given the high standard deviation of these results, 3.5
it is likely that there is either a power-law in the resulting 3
URIs for the queries, or some other non-normal distribu-
URIs
2.5
tion. As shown in Figure 2, when plotted in logarithmic
space, both entity queries and concept queries show a distri- 2
bution that is heavily skewed towards a very large number of 1.5
high-frequency results, with a steep drop-off to almost zero
1
results instead of the characteristic long tail of a power law.
Far from having no information that might be relevant to 0.5
ordinary user queries, the Semantic Web search engines re- 0
500 1000 1500 2000 2500 3000
turned either too many URIs possibly relevant to the query Popularity−ordered Queries
or none at all.
Another question is whether or not there is any correlation
between the amount of URIs returned from the Semantic Figure 3: The rank-ordered popularity of entity and
concept queries is on the x-axis, with the y axis displaying
Web and the popularity of the query. As shown by Figure 3,
the number of Semantic Web URIs returned, with the
there is no correlation between the amount of URIs returned
entity queries given by green and the concept queries by
from the Semantic Web and the popularity of the query. For
blue.
entity queries, the Spearman’s rank correlation statistic was
4. EMPIRICAL ANALYSIS OF THE SEMAN- Linked Data [11]. This statistic as regards usage of the 303
TIC WEB convention is misleading in the broad sense, as most of the
URIs are from a single source, DBPedia, as shown later in
Surprisingly, there is a deluge of possible Semantic Web Table 4.
URIs for any given query. Due to the high number of re- The majority of URIs, 51,873 (74%), served a Semantic
sults for each query, we restricted our analysis to the top Web document via 303 redirection, and so returned the 200
10 Semantic Web URI results for each query as given by status code when the Semantic Web document was accessed
FALCON-S’s ranking algorithm, and distinguish this subset
after the redirection. 200 status codes without 303 redi-
from all the URIs returned by the Semantic Web, by calling
rection still form a substantial fraction of Semantic Web
these this subset the crawled URIs. Concept URIs are URIs. There are several reasons this; all hash convention
crawled URIs from the crawled concept queries while entity URIs would by default still technically commit a redirect
URIs are crawled URIs from the crawled entity queries. Al- to be served by a 200 status code. However, this is only a
though crawled URIs are a small subset of the total URIs re- minority (27%) of those URIs returning a 200 status code.
trieved, given that user behavior in general inspects the first
The rest are likely caused by people serving RDF that does
ten URIs returned by this search [18], it makes more sense to
not have the access to the Web server configuration needed
sample these ten URIs per query than to sample every URI to serve RDF using the 303 redirection, while many others
retrieved. The crawled URIs totaled 70,128 URIs, composed may have started serving RDF before the W3C TAG deci-
of 25,400 (36%) concept URIs and 44,728 (63.78%) entity sion [28] was made or are not aware of Linked Data best
URIs. These URIs were crawled using HTTP GET with a practices. For example, some earlier RDF-enabled reposito-
preference for application-type of application+rdf/xml in
ries like W3C WordNet did redirection by 300 redirection. A
order to prefer RDF files served by content negotiation, and
small percentage may be ordinary web-pages, perhaps con-
any 303 redirection was followed. taining some meta-data as enabled by GRDDL, that just
Of all crawled queries, a total of 6,673 (85%) had at least happened to be indexed by the Semantic Web search en-
10 crawled URIs. All concept queries had at least 10 crawled gine [15]. Furthermore, of these crawled URIs, 9,156 (13%)
URIs and only 4,133 of the entity queries (12% of all entity URIs had no Semantic Web document that was accessible
queries) did not have 10 queries. Inspecting just the set
via HTTP, shown by the use of a 4xx or a 5xx-level status
of queries that did not have 10 crawled URIs, the average
code.
number of URIs when 10 URIs were not returned were 2.89
(S.D. 2.88). So, the trend observed earlier was repeated in
this smaller data-set, namely that while most of the time too 51,873 73.97% 303
many URIs are retrieved from the Semantic Web, sometimes 6,061 8.65% 200
there are no URIs are retrieved from the Semantic Web for 4,517 6.44% 404
certain entity queries. Looking at the data more closely, 357 4,257 6.07% 500
(30%) of the crawled URIs with less than 10 results returned 3,147 4.49% 300
no URIs, while 138 (12%) returned a single URI and 113 re- 246 0.35% 406
turned two URIs (10%). These queries with zero results 20 0.03% 403
seem to be mostly for not well-known places such as playa 4 0.00% 302
linda (a hotel in Majorica) or fairly unknown people such 3 0.00% 502
as william ravies or misspellings or popular truncations of
names for people such as steven colbertbush. This obser- Table 3: Top 10 HTTP Status Codes for crawled
vation helps explains the sudden drop in Semantic Web URIs URIs
returned for queries in Figure 3. There was little overlap be-
tween the the crawled URIs retrieved by different queries,
with an overlap in entity queries of 546 URIs (.01%) and an The top 10 hosts of Semantic Web data in the crawled
overlap in concept queries of 1031 URIs (.04%). In other URIs is given by Table 4. DBPedia, the export of Wikipedia
words, the various queries weren’t just retrieving the same to RDF, dominates the results with 83% of all URIs com-
small group of URIs over and over again. ing from either Wikipedia or DBPedia [2]. The W3C it-
self is the third largest exporter of RDF with a share of
4.1 URI-based Statistics 5%. Upon closer inspection, most of the URIs crawled from
In this section, we inspect the various kinds of statistics we the W3C derive from the W3C-hosted export of the linguis-
can detect on the ‘macro-level’ of the crawled URIs without tic database Wordnet. The domain of the Freie Universität
actually accessing any Semantic Web documents from the Berlin has a significant 2% of all RDF data, which is due pri-
URIs. marily for its Flickr photo export to RDF. An RDF-version
The HTTP status returned by attempting to access the of Cyc and the biomedical data hosting site Bio2RDF also
various crawled URIs are given in Table 3. In particular, host small but significant amounts of Semantic Web data
the most revealing statistic is the majority of the Seman- [22]. The Russian-blog hosting site Liveinternet.ru carries
tic Web sampled by the crawled URIs is served using the on the tradition of FOAF exporting of Livejournal. True-
303 convention, not the hash convention. In fact, a total of sense is another export of WordNet to RDF, although not
51,762 (73%) of crawled URIs use the 303 convention, while as frequently used as W3C Wordnet. Towards the end of
only 1,662 (2%) of the crawled URIs use the hash conven- the ranking there is the RDF version of Univeristät Trier’s
tion. Of these URIs returning the hash convention, manual widely used DBLP academic citation database and
inspection showed many to be FOAF files. This shows the Ontoworld.org, a RDF-enabled wiki for the Semantic Web
vast majority of Linked Data is following the 303 conven- research community [31].
tion and so obeying the W3C and the guide to publishing The average number of URIs hosted by a domain name
10
6 accessible crawled URIs contained 24,074 accessible crawled
entity URIs
concept URIs
Total Semantic Web URIs
concept URIs (95% of all crawled concept URIs) and 36,898
10
5
(82% of all crawled entity URIs) accessible crawled entity
URIs. Thus, the accessible crawled URIs maintained a bias
10
4
towards entity URIs (61% of all accessible crawled URIs)
as compared to concept URIs (39% of all accessible crawled
Number of URIs
crawled
URIs). Each of the crawled accessible URIs was accessed,
3
10
2
and this resulted in a total of 59,228 Web representations
10
with only 48 URIs not allowing access to a Semantic Web
10
1
document. These non-Semantic Web documents were usu-
ally ordinary web-pages from which RDF triples could not
10
0
0 1 2 3
be extracted via GRDDL [15] or RDFa [1]. These crawled
10 10 10 10
URI frequency−ordered domain names Semantic Web Documents we will call the crawled Seman-
tic Web documents, and the total sum of triples in these
documents are called the crawled triples.
Figure 4: The rank-ordered distribution of the domain
There were a total of 411,574 RDF triples in the crawled
names hosting Semantic Web data from the crawled
triples, with 242,829 (59%) triples for concepts and 168,745
URIs ordered by number of URIs hosted.
(41%) triples for entity URIs. Concepts, despite being fewer
in number, seem to require more triples to describe than
entities. The internal structure of these triples is of surpris-
was 1,268 (S.D. 16,060), with the average number of entity ing interest. Of these triples, there were a total of 1,051
URIs hosted by any domain being 1,236 (S.D. 15,458) and triples containing blank nodes, a measly .25% of all triples
the average number of concept URIs hosted by a domain in the corpus, of which 772 (73%) were subjects and only
being 1,0327 (S.D. 6,650). The very high standard devia- 279 (27%) were in the object position. This means that
tions are usually a sign of power-law distribution, as shown the use of blank nodes, whose purpose is as syntactic place-
in in Figure 4. Attempting to fit a power-law distribution, holders in URIs for objects like lists and in representing n-
the α of the rank-ordered domain list frequency distribu- ary arguments in RDF, is almost non-existent in our sample.
tion is 1.53, with long tail behavior starting around 175 and Removing blank nodes, the composition was split between
a Kolmogorov-Smirnov D-statistic of .1414, indicating in- URI nodes (66%) and a surprisingly large minority of RDF
significant fit for the power-law distribution. In other words, literals nodes (34%). These literals contain some form of in-
while a few sources like DBPedia dominates the crawled formation in either ‘unstructured’ natural language or some
URIs, with an rapidly decreasing number of smaller sites form of structured information in a formal language, such
such as Cyc and the W3C, the long-tail individuals URIs as integer values.
hosting their FOAF files on their personal websites is still Of the literals, a total of 403,119 were RDF string lit-
rather insignificant compared to the ‘top’ major sites host- erals, while only 2% were of some other data type, with
ing Linked Data. This is because the Linked Data is being top 10 frequent data-types given in Table 5. The most fre-
artificially generated in large ‘chunks’ by projects like W3C quent data-types are from XML Schema [10], while others
Wordnet and DBpedia, and so do not organically form the are customized for DBPedia. It appears that the vast ma-
power-law distribution characteristic of naturally-evolving jority of RDF in the Semantic Web of interest to average
complex systems. users are simple URI-based triples with rich information in
natural language. This also goes against the intuition that
the vast majority of Semantic Web data that is of interest
54,698 78.00% dbpedia.org
to ordinary users would be highly structured data of ex-
3,584 5.11% wikipedia.org
ported databases [8]. Instead, what is of interest in Linked
3,448 4.92% w3.org
Data is stored mainly in natural language, with RDF adding
1,704 2.43% fuberlin.de
only a minimal structure to essentially fragments of natu-
811 1.16% cyc.com
ral language. While it could be argued that this particular
701 1.00% bio2rdf.org
finding is merely an artifact of DBpedia, however, it should
599 0.85% liveinternet.ru
be acknowledged that DBpedia is, given that our querying
417 0.59% truesense.net
includes other data-sets, this finding may well be generaliz-
322 0.46% dblp.unitrier.de
able. We are not studying the Semantic Web as some of its
314 0.47% ontoworld.org
designers would like to have it, but as it actually exists, and
part of its existence is that DBpedia forms a huge central
Table 4: Top 10 Domain Names for URIs for cluster that for ordinary users is the most interesting and
Crawled URIs useful part of Linked Data.
One interesting question is the predominance of the vari-
ous kinds of Semantic Web knowledge representations terms
4.2 Triple-based Statistics on the Semantic Web, since this would show what kinds
of inference could actually be deployed on the Semantic
In this section, we move our analysis down from the level
Web. First, of the total 1,093,212 URIs in triples harvested
of URIs to the level of the triples accessible from the URIs.
from the crawled accessible URIs, only 243,776 (22%) were
Since a number of crawled URIs were inaccessible, this re-
from one of the primary W3C Semantic Web knowledge
duced the total number of accessible crawled URIs to
representation languages, either RDF, RDF(S), or OWL.
60,972, a reduction of (13%) from the crawled URIs. The
403,119 97.95% RDF plain literal controversial owl:sameAs term, which is used to declare some
3,103 0.75% w3c:/XMLSchema#integer sort of global equivalence between two URIs. While a tiny
2,789 0.68% w3c:/XMLSchema#string portion (.47%) of overall Semantic Web modelling term us-
1,185 0.29% w3c:/XMLSchema#double age, it is far from insignificant, with 1,157 occurrences. The
522 0.13% w3c:/XMLSchema#date use of owl:sameAs in the wild is far different than the role it
248 0.06% w3c:/XMLSchema#float plays in popular debate within the Semantic Web commu-
136 0.03% w3c:/XMLSchema#gYear nity would suppose. Logicians hold that owl:sameAs is only
65 0.02% w3c:/XMLSchema#gYearMonth for what is properly considered individuals in description
59 0.01% dbpedia:Rank logic, so that classes and properties should use the more re-
46 0.01% dbpedia:Dollar stricted and semantically correct owl:equivalentClass and
14 0.00% w3c:/XMLSchema#int owl:equivalentProperty. Yet this best practice in logic
9 0.00% dbpedia:Percent hasn’t the Linked Data community, as owl:equivalentClass
has only 2 occurrences and there are none of
Table 5: Common Data Types in Crawled Triples owl:equivalentClass. Instead, the Linked Data movement
uses owl:sameAs to simply “state that another data source
also provides information about a specific non-information
resource,” so leading owl:sameAs to tend to mean ‘more-or-
less the same thing as’ [11]. This practice leads to the fear
Of these, the RDF vocabulary itself was the most popu-
that the use of owl:sameAs would propagate too far, such
lar, with 109,300 URIs (45%), followed fairly closely by the
that many URIs for the perhaps differing referents would be
RDF(S) vocabulary with 100,340 URIs (41%), and OWL
declared identical [17].
being dwarfed by RDF and RDF(S) with only 34,136 URIs
Both critiques of owl:sameAs appear to be wrong. Given
(14%). This does not mean that OWL is irrelevant to the
the amount of Semantic Web URIs returned by the queries,
other corpus, as ontologies constructed with OWL could be
while there is considerable use of owl:sameAs, it appears
deployed to model the concepts and entities employed in
that the manual discovery and publication of co-referential
‘instance’ data. Yet while OWL has been an academic suc-
URIs using owl:sameAs falls far behind the actual growth of
cess story, insofar as practical deployment, RDF terms and
Linked Data. One could say that owl:sameAs is not being
RDF(S)-based inference seems to be the foundation of the
used enough. The real problem is not that distinct things
Semantic Web in practice.
are being given the same URI, but the reverse; namely that
What precise URI-based terms are used in these knowl-
it appears endemic that the same thing has multiple URIs.
edge representation languages? The top constructs in ei-
So Berners-Lee’s hypothesis appears to be wrong: A single
ther RDF, RDF(S), or OWL in crawled triples are given in
thing is likely identified by more than a single URI on the
Table 6. To summarize, RDF(S) class and sub-class rea-
Semantic Web.
soning is very popular, with this construction consisting of
nearly half (48%) of knowledge representation use of the Se-
mantic Web. The second most popular use of knowledge 73,451 30.31% rdfs:Class
representation (22%) is for natural language annotation, de- 47,044 19.30% rdfs:comment
scribing a particular Semantic Web resource using natural 44,113 18.10% rdfs:subClassOf
language and connecting this natural language description to 8,630 3.54% owl:Ontology
the URI via the use of rdfs:comment or rdfs:label. There 7,256 2.97% rdfs:label
are surprisingly few (4%) actual ontologies in the crawled 6,618 2.14% rdf:Subject
Semantic Web resources. Furthermore, non-traditional fea- 5,107 2.09% owl:ObjectProperty
tures of RDF(S), such as the use of rdfs:property, are fre- 3,642 1.49% rdfs:subPropertyOf
quent occurrences. Even reification of RDF triples, officially 1,157 0.47% owl:sameAs
discouraged by the Semantic Web community, accounts for 535 0.29% rdfs:range
only 95 triples, and there is also fairly heavy use of discour-
aged RDF constructs to represent different kinds of lists, Table 6: RDF and OWL Constructs in Crawled
such as rdf:Alt (349 occurrences) and rdf:Bag (344 oc- Triples
currences). Lastly, while many Semantic Web researchers
originally hoped that the use of inverse functional proper-
ties would allow the merger of Semantic Web data, there
were zero explicitly declared usages of The top 10 Semantic Web vocabularies used in the crawled
owl:inverseFunctionalProperty. Overall, the usage of OWL, triples, including those outside of the W3C-approved Seman-
RDF(S), and RDF terms in the corpus also follows to some tic Web knowledge representation languages, are shown in
degree a power-law like distribution, where α equal to 1.5, Table 7. The results should not be that surprising, in par-
with long tail behavior starting around 90, although the ticular the vast dominance of DBPedia. Perhaps surprising
Kolmogorov-Smirnov D-statistic of .1911 reveals this to in- is the surprising amount of usage of Cyc terms, as well as
significant. This is because while a few terms vastly dom- terms from SKOS, the Simple Knowledge Organization Sys-
inate, the vast majority of other terms are not used at all. tem of the W3C, whose primary source of deployment is the
This has reprecussions for both Semantic Web implementers W3C’s export of WordNet to RDF [24]. FOAF is also signif-
and vocabulary specification within the W3C, since obvi- icant, although not nearly as dominant as was found earlier
ously some level of concentration of effort upon the most by Ding and Finin [16]. Also popular is YAGO (Yet Another
frequently-deployed terms would be reasonable. Global Ontology), a merger of WordNet and Wikipedia cat-
One of the most popular OWL constructs is indeed the egory hierarchies employed by DBPedia [30].
366,849 33.55% DBpedia URIs beside FALCON-S, which we recognize is a major limiting
109,300 9.99% RDF URIs factor. Second, there is likely too many URIs in Linked Data
100,340 9.17% RDF(S) URIs for a given query, although to truly substantiate this claim
94,520 8.65% Cyc URIs ideally the URIs returned by the search engines should each
34,136 3.12% OWL URIs be individually inspected, although this is difficult in prac-
6,563 0.60% SKOS URIs tice. Yet even at this point it seems is likely that there are
4,728 0.43% dblp.l3s.de many co-referential URIs for the ‘same thing’ that are not
3,263 0.29% FOAF URIS explicitly modelled with owl:sameAs, and unless action is
2,170 0.20% YAGO URIs taken this growth of URIs will contine of the future. Unless
1,836 0.16% WordNet URI there is URI re-usage many of the data-sources for Linked
Data are more like semantic islands rather than parts of
Table 7: Top Vocabulary URIs in Crawled Triples interconnected semantic continents.
6. ACKNOWLEDGEMENTS
Harry Halpin was supported in part by a Microsoft “Be-
5. CONCLUSION yond Search” grant.
The empirical analysis of Linked Data presented in this
study is by no means complete, for it is only a moderately
small sample by one Semantic Web search engine (and so 7. REFERENCES
hurt or benefit by the idiosyncratic behavior of the search- [1] B. Adida, M. Birbeck, S. McCarron, and
ing of FALCON-S), although it is an important one as this S. Pemberton. RDFa in XHTML: Syntax and
sample is driven by Web search queries by actual users. The Processing. W3C Recommendation, W3C, 2008.
results of this empirical analysis show a transformation from http://www.w3.org/TR/rdfa-syntax/.
the first-generation Semantic Web to the next generation [2] S. Auer, C. Bizer, J. Lehmann, G. Kobilarov,
Web of Linked Data. The Semantic Web as it existed in R. Cyganiak, and Z. Ives. DBpedia: A nucleus for a
the first-generation was a motley collection of RDF triples, web of open data. In Proceedings of the International
heavily dominated by a few exports of social networking and Asian Semantic Web Conference
data into FOAF and a long-tail of complex academically- (ISWC/ASWC2007), pages 718–728, Busan, Korea,
produced ontologies. Linked Data - at least the section of it 2007.
that is of interest to users querying the Web for information [3] R. Baeza-Yates, L. Calderon-Benavides, and
- is dominated heavily by DBPedia and consists primarily C. Gonzalez. Understanding user goals in web search.
of collections of triples that provide a minimal structure to In Proceedings of String Processing and Information
natural language [16]. Retrieval (SPIRE), pages 98–109, 2006.
On the level of triples, there are some surprising conclu- [4] R. Baeza-Yates and B. Ribeiro-Neto. Modern
sions. The triples on the Semantic Web contain a vast range Information Retrieval. Addison Wesley-Longman, New
of data, and the exact kinds of URIs used in the triples are York City, New York, USA, 1999.
somewhat unpredictable. However, the kinds of vocabular- [5] A.-L. Barabasi, R. Albert, H. Jeong, and G. Bianconi.
ies actually deployed are almost entirely from a few large
Power-law distribution of the World Wide Web.
vocabularies, such as DBPedia, DBLP, WordNet, YAGO,
Science, 287:2115, 2000.
and FOAF. This again points to a victory of Berner-Lee’s
[6] G. Beged-Dov, D. Brickley, R. Dornfest, I. Davis,
idea that a few large vocabularies with well-defined terms
L. Dodds, J. Eisenzopf, D. Galbraith, R. Guha,
could dominate the Semantic Web [9]. The kinds of triples
K. MacLeod, E. Miller, A. Swartz, and E. van der
that structured this data do not contain many OWL terms
Vlist. RDF Site Summary (RSS) 1.0. Technical report,
optimized for inference, but consist almost entirely relatively
http://web.resource.org/rss/1.0/spec, 2001.
straight-forward RDF(S) expressions for sub-class relation-
ships and for annotations in natural language. Overall, [7] F. Belleau, M.-A. Nolin, N. Tourigny, P. Rigault, and
Linked Data is primarily being used to provide structured J. Morissette. Bio2rdf: Towards a mashup to build
relationships between fragments of natural language, and bioinformatics knowledge systems. Journal of
not for inference. Biomedical Informatics, 41(5):706–716, 2008.
One could argue that that these results are more charac- [8] T. Berners-Lee. What the Semantic Web can
teristic of FALCON-S and DBpedia than the second-generation represent, 1998. Informal Draft.
Linked Data as a whole. However, we would respond that http://www.w3.org/DesignIssues/rdfnot.html (Last
it is natural in decentralized information systems for power accessed on Sept. 12th 2008).
law distributions, where one source of data massively out- [9] T. Berners-Lee and L. Kagal. The fractal nature of the
weighs others in weight to evolve, and the ‘giant component’ Semantic Web. AI Magazine, 29(3), 2004.
of Linked Data is DBpedia [5]. In fact, if such a ‘giant com- [10] P. Biron and A. Malhotra. XML Schema Part 2:
ponent’ and long tail were not observed, it would be cause Datatypes. Recommendation, W3C, 2004.
for suspicion. In conclusion, there is potentially lots of rich http://www.w3.org/TR/xmlschema-2/.
information that ordinary Web search users in Linked Data [11] C. Bizer, R. Cygniak, and T. Heath. How to publish
form, and so one outcome of this analysis should be a greater Linked Data on the Web, 2007.
interest in Linked Data from even mainstream information http://www4.wiwiss.fu-
retrieval systems. However, for future work we wish to re- berlin.de/bizer/pub/LinkedDataTutorial/ (Last
peat this study over different Semantic Web search engines accessed on May 28th 2008).
[12] C. Bizer and A. Seaborne. D2RQ: Treating non-RDF pages 683–690, New York, NY, USA, 2007. ACM.
databases as virtual RDF graphs. In Proceedings of [28] L. Sauermann and R. Cygniak. Cool URIs for the
International Semantic Web Conference, 2004. Semantic Web. Technical report, W3C Semantic Web
[13] G. Cheng, W. Ge, and Y. Qu. FALCONS: Searching Interest Group Note, 2008.
and browsing entities on the semantic web. In http://www.w3.org/TR/cooluris/.
Proceedings of the the World Wide Web Conference, [29] C. Silverstein, H. Marais, M. Henzinger, and
2008. M. Moricz. Analysis of a very large web search engine
[14] A. Clauset, C. Shalizi, and M. Newman. Power-law query log. SIGIR Forum, 33(1):6–12, 1999.
distributions in empirical data, 2007. [30] F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO:
http://arxiv.org/abs/0706.1062v1 (Last accessed a core of semantic knowledge. In In Proceedings of the
October 13th 2008). 16th International Conference on World Wide Web,
[15] D. Connolly. Gleaning Resource Descriptions from pages 697–706, New York, NY, USA, 2007. ACM.
Dialects of Languages (GRDDL). Technical report, [31] M. Völkel, M. Krötzsch, D. Vrandecic, H. Haller, and
W3C, 2007. Recommendation. R. Studer. Semantic wikipedia. In Proceedings of the
[16] L. Ding and T. Finin. Characterizing the Semantic International Conference on World Wide Web
Web on the Web. In Proceedings of the International (WWW), pages 585–594, New York, NY, USA, 2006.
Semantic Web Conference (ISWC), pages 242–257, ACM.
2006. [32] D. Watts and S. Strogatz. A review of ontology based
[17] A. Ginsberg. The big schema of things. In Proceedings query expansion. Nature, 6684(393):409–410, 1998.
of Identity, Reference, [33] C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. H.
and the Web Workshop at the WWW Conference, 2006. Ungar. Web-scale named entity recognition. In
http://www.ibiblio.org/hhalpin/irw2006/aginsberg2006.pdf. Proceedings of Conference on Information and
[18] L. Granka, T. Joachims, and G. Gay. Eye-tracking Knowledge Management, pages 123–132. ACM, 2008.
analysis of user behavior in www search. In SIGIR
’04: Proceedings of the 27th annual international
ACM SIGIR conference on Research and development
in information retrieval, pages 478–479, New York,
NY, USA, 2004. ACM.
[19] M. Hausenblas, W. Halb, Y. Raimond, and T. Heath.
What is the size of the Semantic Web? In Proceedings
of Conference on Semantic Systems (iSemantics),
Graz, Austria, 2008.
http://tomheath.com/papers/hausenblas-
isemantics2008-size-of-semantic-web.pdf.
[20] D. Hawking, E. Voorhees, N. Craswell, and P. Bailey.
Overview of the trec-8 web track. In Proceedings of the
Text REtrieval Conference (TREC), pages 131–150.
ACM, 2000.
[21] B. J. Jansen, D. L. Booth, and A. Spink. Determining
the informational, navigational, and transactional
intent of web queries. Information Process and
Management, 44(3):1251–1266, 2008.
[22] D. Lenat. Cyc: Towards programs with common sense.
Communications of the ACM, 8(33):30–49, 1990.
[23] A. Mikheev, C. Grover, and M. Moens. Description of
the LTG system used for MUC. In Seventh Message
Understanding Conference: Proceedings of a
Conference, 1998.
[24] A. Miles and S. Bechhofer. SKOS Simple Knowledge
Organization System reference. Working draft, W3C,
2008. http://www.w3.org/TR/skos-reference/.
[25] M. Newman. Power laws, pareto distributions and
zipf’s law. Contemporary Physics, 46:323–351, 2005.
[26] E. Oren, R. Delbru, M. Catasta, R. Cyganiak,
H. Stenzhorn, and G. Tummarello. Sindice.com: a
document-oriented lookup index for open linked data.
International Journal of Metadata, Semantics, and
Ontologies 2008, 3(1):37–52, 2008.
[27] M. Paşca. Weakly-supervised discovery of named
entities using web search queries. In Proceedings of the
sixteenth ACM conference on Conference on
information and knowledge management (CIKM),