=Paper= {{Paper |id=Vol-538/paper-16 |storemode=property |title=A Query-Driven Characterization of Linked Data |pdfUrl=https://ceur-ws.org/Vol-538/ldow2009_paper16.pdf |volume=Vol-538 |dblpUrl=https://dblp.org/rec/conf/www/Halpin09a }} ==A Query-Driven Characterization of Linked Data== https://ceur-ws.org/Vol-538/ldow2009_paper16.pdf
              A Query-Driven Characterization of Linked Data

                                                                  Harry Halpin
                                       Institute for Communicating and Collaborative Systems
                                                        University of Edinburgh
                                                          2 Buccleuch Place
                                                      Edinburgh, United Kingdom
                                                            H.Halpin@ed.ac.uk


ABSTRACT                                                                    can satisfy these information needs. We present an analysis
Due to the Linked Data initiative, the once unpopulated Se-                 of a search-engine query log from a major hypertext search
mantic Web is now rapidly being populated with millions                     engine, Microsoft’s Live.com, and use this query log to sam-
of facts stored in RDF. Could any of this data possibly be                  ple Linked Data. As an added benefit, such an empirical
interesting to ordinary users? In this study, we run queries                analysis can prove or disprove some widely held assump-
extracted from a query log from a major hypertext search                    tions, such as whether or not there is an endemic over-use of
engine against a Semantic Web search engine to determine                    owl:sameAs and whether or the Linked Data best practice
if the Semantic Web has anything of interest to the aver-                   recommendation of 303 redirection is being followed.
age Web user. There is indeed much Semantic Web infor-
mation that could be relevant for many queries for enti-                    2. PREVIOUS WORK
ties (like people and places) and abstract concepts, although
                                                                               For the first-generation of the Semantic Web, there was
these possibly relevant results are overwhelmingly clustered
                                                                            very little data-driven analysis of the ontologies, primarily
around DBPedia. We present an empirical analysis of the
                                                                            because so few were actually in existence. The first large-
results, focusing on their major sources, the structure of the
                                                                            scale analysis of the Semantic Web was done via an inspec-
triples, the use of various RDF and OWL constructs, and
                                                                            tion of the index of Swoogle by Ding and Finin [16]. Ding
the power-law distributions produced by both the URIs that
                                                                            and Finin first estimated the size of the Semantic Web to be
serve Linked Data and the URIs in the triples themselves.
                                                                            in 2006 4.91 million Semantic Web documents via search-
The issue of 303 redirection and URI identity is given in-
                                                                            ing Google for the media type application/rdf+xml [16].
depth treatment.
                                                                            As this might not include data that is hosted using the
                                                                            wrong media type, they estimated, using Google to include
Categories and Subject Descriptors                                          all FOAF files served as HTML and RSS 1.0 files, the size
H.3.d [Information Technology and Systems]: Meta-                           of the Semantic Web would optimistically be increased by
data                                                                        two orders of magnitudes. Although the study of Ding and
                                                                            Finin was of great importance as it was the first empirical
                                                                            study of the Semantic Web, this work has a number of lim-
General Terms                                                               itations [16]. It’s primary limitation was it was unknown
Experimentation                                                             if any of the Semantic Web documents indexed contained
                                                                            information that anyone would want to actually re-use. In-
Keywords                                                                    tuitively, most of the data on this first-generation Semantic
                                                                            Web was likely to be of limited value. For example, the vast
Linked Data statistics, query logs, information retrieval,power
                                                                            majority of data on the Semantic Web in 2006 was caused
law
                                                                            by Livejournal exporting every user’s profile as FOAF – usu-
                                                                            ally without the user’s knowledge – without linking to other
1.    INTRODUCTION                                                          URIs, serving with the correct MIME type, and deploying
  What are the characteristics of the Linked Data in the                    303 re-direction. The second main source of data in Ding
wild? There are two primary questions we are hoping to                      and Finin’s study, RSS 1.0, is also of limited value. RSS,
answer. First, has Linked Data changed from earlier ‘first                  originally an XML-based protocol generally used for news-
generation’ Semantic Web efforts? Second, is there any-                     feeds, was given a RDF-compatible syntax, creating RSS 1.0
thing worth finding for ordinary users in Linked Data? Only                 [6]. The very application of RDF in RSS 1.0 is questionable,
a moderately large-scape sampling and analysis of Linked                    as the data is primarily information about site updates, and
Data can answer this central question. Our method of in-                    so RSS 1.0 data is rarely merged, re-used, or even linked to
vestigation is to inspect what information needs actual users               in a manner that takes advantage of RDF. Due to the id-
are expressing via using a hypertext search engine, and then                iosyncratic nature of the data sources of the first generation
use a sample of these queries to determine if Linked Data                   Semantic Web, it is not surprising that the majority of the
Copyright is held by the International World Wide Web Conference Com-
                                                                            data likely contained little information that could satisfy the
mittee (IW3C2). Distribution of these papers is limited to classroom use,   information need of the average user of the Web.
and personal use by others.                                                    Due to the Linked Data initiative, the size of the Seman-
LDOW 2009, April 20, 2009, Madrid, Spain.                                   tic Web has recently increased in size by several orders of
ACM 978-1-60558-487-4/09/04.
magnitudes due to the conversion of a large number of high-       noticeable ‘fits and starts’ as large data-sets are released, so
quality databases into RDF [12]. Since the study by Ding          each data-set can vastly alter any empirical analysis. The
and Finin missed the rise of Linked Data, the time is ripe        question is not how to avoid bias in sampling, but to choose
for more empirical studies of the Semantic Web. It is un-         the kind of bias one wants. We are aiming for a bias towards
clear how the dynamics of the Semantic Web are changing.          the ordinary user of the Web.
While the number of URIs indexed by Linked Data search              What information is available on the Semantic Web that
engines like Sindice shows that the general trend of the num-     ordinary users are actually interested in, and how do we
ber of URIs on the Semantic Web visually follows a ‘power-        sample this data? The obvious candidate for exploring this
law,’ the correct mathematical analysis has not been done to      would be look at a major search engine query log, as it gives
show this to be the case [26]. The only large-scale study of      a sample of the interests of many users in aggregate. Since
Linked Data at this time has been by Hausenblas et al., and       Semantic Web search engines are currently used mostly by
it estimated the size of the Linked Data at approximately         Semantic Web developers and not by ordinary users, the
2 billion triples [19]. The focus of that study was only on       query log of a popular hypertext search engine should be
interlinking between data-sets, and it estimated that there       sampled as opposed to a more specialized search engine.
were approximately 3 million interlinks between the various       The entire bet of the Semantic Web is that it will contain
data-sets. The most popular interlinking property by far          information that many ordinary users will want to re-use
was dbpedia:hasPhotoCollection, with approximately 2 mil-         and merge via Semantic-Web enabled applications, and that
lion occurrences, most likely to be due to the term being         this information will primarily be about non-information re-
used by a Linked Data exporter around the popular photo-          sources such as entities like people and places and abstract
hosting service Flickr [2]. In summary, the Linked Data           concepts. Thus, the ideal sampling of the Semantic Web
phenomenon is huge, much larger than the first-generation         would be to extract query terms referring to physical entities
Semantic Web, and its properties have not been fully stud-        and abstract concepts from a hypertext search engine query
ied. In particular, there has been little work on determining     log, and then by virtue of a Semantic Web search engine we
how the issues of the reference of URIs play out in the wild      can determine precisely how much information Linked Data
given by Linked Data.                                             contains on these subjects.

                                                                  3.1 The Live.com Query Log
3.   SAMPLING LINKED DATA VIA QUERY                                  There has been a much work in query log analysis in or-
     LOGS                                                         der to discover how to best satisfy the information needs of
   The main problem facing any empirical analysis of the Se-      users on the Web. Since most search query logs of any size
mantic Web is one of sampling. As almost any database can         belong to search engines companies, it is often difficult for
easily be exported to RDF, any sample of the Semantic Web         researchers outside those companies to analyze these query
can be biased by the automated release of large, if ultimately    logs, and therefore most research in search query logs deal
useless, data-sets. This was demonstrated in an exemplary         with small or special-purpose query logs, such as the Web
fashion by the release of RSS 1.0 data. RDF vocabulary            track in the TREC competition [20]. A few employees of
terms that have little content, such as rss:item, quickly bias    large search corporations have released detailed studies of
the statistical analysis. With the advent of Linked Data, this    their search engine query logs. In particular Silverstein et
has to some extent already happened with large numbers of         al.’s analysis of a billion queries in the Altavista query log is
databases being released as Linked Data ranging from the          considered to be a large ‘gold-standard’ study of query logs
BBC’s John Peel recordings to the MusicBrainz audio CD            [29]. In order to extract concepts and entities, we analyze
collection [19]. How much of Linked Data is aimed for gen-        the query log of approximately 15 million distinct queries
eral use? Obviously, components like DBPedia, the export          from Microsoft Live Search, and all reference to the ‘query
of Wikipedia to Linked Data, could be very useful [2]. The        log’ are to this Microsoft query log, which is provided by
vast majority of data released into the Semantic Web is of        Microsoft due to a 2007 ‘Beyond Search’ award. This query
appeal only to a niche audience, such as the large appeal of      log contains 14,921,285 queries. Of these queries, 7,095,302
Bio2RDF to health care and life-sciences. Just as RSS 1.0         (48%) were unique. Corrected for capitalization, 4,465,912
and the Livejournal export of FOAF biased sampling of the         (30%) were unique. Of all queries, only 228,593 (2%) queries
first-generation Semantic Web, the release of a large Linked      used some form of advanced keywords, while 709,102 (5%)
Data set such as the Bio2RDF, containing approximately            used boolean operators and 266,308 (2%) used quotation,
65 million triples and so rivaling the size of DBPedia, can       leading to a total of 1,204,003 (17%) queries using some ad-
bias any sampling of Linked Data [7]. For example, if one         vanced techniques provided by the search engines. The av-
just counted the number of URIs used on the Semantic Web,         erage number of terms per query was 1.76. Note that these
one would quickly find that bio2rdf:xProteinLinks would           extremely brief queries are normal for hypertext Web search
prove to be, in sheer number, a very popular term despite         engines, with an average query length of 2.35 being reported
its relative lack of use outside the biomedical community. It     by Silverstein et al. for the Altavista query log [29]. Since
is a small step then to imagine ‘semantic spamming’ that re-      we did not want to deal with queries that were only typed
leases large amounts of bogus URIs into the Semantic Web.         once or a few times, as these may not be representative of
Furthermore, due to open nature of the Web, it is difficult,      most user’s interests, we did not select for further use any
if not impossible, to determine how many actual separate          queries with a frequency less than 10, resulting in onlyfrom
providers of Semantic Web data there are, so a priori choos-      the total query log of 7,095,302, a reduction of 37%.
ing seed samples or to ‘weight’ any sample is difficult. Unlike
the original Web, which grew at least in an organic fashion       3.2 Extracting Queries for Entities and Con-
for its first few years, the Web of Linked Data grows in very         cepts
   Automatically classifying informational queries is difficult.                      7311    david blaine
Rule-based approaches that claim to work over entire query                            4039    kelly blue book
logs like those of Jansen et al. [21] are dubious at best,                            3053    chase
since they work by applying very loose specifications such                            2997    jessica alba
as “query length greater than 2” and “any query using natu-                           2100    nick
ral language terms.” More promising work has applied both                             1415    office max
supervised and unsupervised machine-learning to discover                              1280    michael hayden
informational queries, but only achieved an accuracy of 50%                           1139    harley davidson
[3]. A number of machine-learning algorithms could be em-                             1098    marcus vick
ployed to learn named entities, but the sparse amount of lin-                         1092    keith urban
guistic context in query logs makes identifying a named enti-
ties difficult in a unsupervised manner, and there is virtually       Table 1: Top 10 Entity Queries in Query Log
no labeled data for supervised learning [33]. Even most rule-
based approaches for named entity recognition rely heavily
upon capitalization and punctuation, such as ‘I.B.M.’ and
‘Gustave Eiffel,’ features that are lacking from query logs        length one where the query had a hyponym and hypernym,
[23].                                                              due to the difficulty of WordNet dealing with some multi-
   We call queries that are automatically identified to be about   word queries. This assured that the query was for a class
physical entities in the query log entity queries. For the         that was suitably abstract (having a hyponym) but not so
discovery of entity queries, people and places are obvious         abstract as to be virtually meaningless (had a hypernym).
places to begin. An updated version of the system that             This resulted in a more restricted 16,698 concept queries
was the highest performer at MUC-7 [23], a straightforward         (.4% of total query log). The top 10 concepts queries are
gazetteer-based and rule-based named entity recognizer, was        given in Table 2. Again, a number of clearly transactional
employed to discover the names of people and places. The           queries have managed to find themselves into the concept
gazetteer for names was based on a list of names maintained        queries, such as ‘chase’ and ‘drudge,’ as well as a number
by the Social Security Administration and the gazetteer for        of queries where the sense of a word has been taken over
place names was based on the gazetteer provided by the             by a proper name, such as ‘sprint’ and ‘aim.’ Again, this
Alexandria Digital Library Project. Although it could be           is due to the preponderance of navigational names towards
possible to separate out people and places, this was not           the top of the query distribution. Of a random sample of
done. First, both of these are types of entities. Second,          100 concept queries, a judge considered 98% to be correct.
the names of many location such as ‘Paris’ or places like          The top ten concept queries are presented in Table 2. While
‘Georgia’ can also be used as a name. This gazetteer-based         some of the queries could be considered somewhat naviga-
approach was chosen to provide high precision, even at the         tional (such as those for maps and dictionaries), they could
cost of a dramatically reduced recall. This is an acceptable       all be considered informational queries about some abstract
trade-off as we are attempting only to sample the number of        concept.
queries that would likely to be have URIs on the Semantic
Web. A high-quality sample of the query log is more impor-                               11383    weather
tant than a large one for this purpose. Of a random sample                               10321    dictionary
of 100 entity queries, a judge considered 94% to be correctly                            3675     people
categorized as entities such as people or places.                                        3217     music
   From the pruned unique queries in the query log, totaling                             2192     autism
4,465,912 queries, a total of 509,659 queries (11%) were iden-                           1468     map
tified as either people or places by the named-entity recog-                             1198     travel
nizer. The top 10 entity queries are given in Table 1. Some                              1191     pregnancy
transactional and navigational queries, despite their rela-                              1104     news
tively lower frequency overall in the query log, are highly                              1052     charter
clustered towards the top of the query distribution. These
navigational queries such as ‘chase’ and ‘office max’ have
                                                                     Table 2: Top 10 Concept Queries in Query Log
clearly snuck into the top ten due to their use of common
names in their website names. A legitimate number of real
names, such as ‘jessica alba’ and ‘marcus vick’ were discov-
ered.
   A method for discovering abstract concepts in the query
                                                                   3.3 Power-Law Detection
log is more challenging. These queries are called concept             The frequency of queries, when rank-ordered, follows what
queries, queries that are automatically identified to be about     is known as a ‘power-law’ distribution, with a relatively
abstract concepts in query log. Previous attempts at dis-          small number of very popular queries and a long-tail of
covering abstract concepts have employed machine-learning          queries only occurring once or twice, where most of the mass
over truly massive query logs and document collections from        of the distribution is in the long tail and the ‘top’ of the dis-
Google [27]. Since this massive amount of data was not             tribution exponentially decreases. Since this distribution is
available, we employed WordNet instead. WordNet consists           common in search on the Web, we will define it precisely: A
of approximately 207,000 words with unique synsets. Our            power-law is a relationship between two scalar quantities
algorithm for discovering abstract concepts in query logs us-      x and y of the form:
ing WordNet was straightforward: we only chose queries of
                                                                                             y = cxα + b                        (1)
where α and c are constants characterizing the given power-          the conservative p < .1. The Kolmogorov-Smirnov test is
law, and b being some constant or variable dependent on x            valid even for power-law distributions since Q’s cumulative
that becomes constant asymptotically. Typically it is ap-            density function is asymptotically normally distributed and
plied to rank-ordered frequency diagrams, where the fre-             this can be compared to the cumulative density function of
quency of some measurement is given on the horizontal axis           P.
while the rank order of the measurements in terms of their              The query frequencies for entity and concept queries are
frequency is given on the vertical axis. The α exponent is           plotted in logarithmic space in Figure 1. Both entity and
the scaling exponent that determines the slope of the top            concept queries appear to be linear in log-space, and so can
of the distribution and provides the remarkable property of          be considered candidates for power-laws. Using the method
scale-invariance, such that if a true power-law is observed,         described above, the α of the queries for entities was cal-
as more samples are added to the distribution, the α re-             culated to be 2.31, with long tail behavior starting around
mains constant, i.e. the distribution is ‘scale-free’ [32]. It       a frequency of 17 and a Kolmogorov-Smirnov D-statistic
is crucial to note that a power-law distribution violates as-        of .0241, indicating a significant good fit. The α of the
sumptions of the normal Gaussian distribution, such that             queries for concept queries was calculated to be 2.12, with
routine statistics such as averages and standard deviations          long tail behavior starting around a frequency of 36 with a
can be and usually are misleading. In fact, one of the surest        Kolmogorov-Smirnov D-statistic of .0170, also indicating a
sign of a non-normal distribution like a power-law distribu-         significant good fit for the power law. Given their two re-
tion is a very large standard deviation. Is such a distribution      markably similar α statistics and high goodness of fits, one
evident from Linked Data? One important question is how              can safely conclude that these query logs do indeed follow
to detect power-law distributions in actual data. Equation           power-law distributions. This indicates our sample of enti-
1 can also be written as:                                            ties and concepts are representative of the larger query log,
                                                                     which are well-known to follow power-law distributions [4].
                    log y = α log x + log c                  (2)
                                                                                        5
                                                                                      10
   When written in this form, a fundamental property of
power-laws becomes apparent: When plotted in log-log space,
power-laws are ‘straight’ lines. Thus,the most widely used                              4
                                                                                      10
method to check whether a distribution follows a power-law
is to apply a logarithmic transformation, and then perform
linear regression, estimating the slope of the function in log-
                                                                         Popularity
                                                                          Query




                                                                                        3

arithmic space to be α, as done by Ding and Finin [16].                               10


However, standard least-square regression has been shown
to produce systematic bias, in particular due to fluctuations
of the long tail [14]. To determine a power-law accurately                            10
                                                                                        2




requires minimizing the bias in the value of the scaling ex-
ponent and the beginning of the long tail via maximum like-
lihood estimation. See Newman [25] and Clauset et al. [14]                            10
                                                                                        1
                                                                                          0    1                2                3    4
                                                                                        10    10              10                10   10
for the technical details.                                                                         Popularity−ordered queries

   Determining whether a particular distribution is a ‘good
fit’ for a power-law is difficult, as most ‘goodness-of-fit’ tests
employ normal Gaussian assumptions violated by poten-                Figure 1: The rank-ordered frequency distribution of
tial power-law distributions. Luckily, the non-parametric            extracted entity and concept queries, with the entity
Kolmogorov-Smirnov test can be employed for any distribu-            queries given by green and the concept queries by blue.
tion and so is thus ideal for use measuring ‘goodness-of-fit’
of a given finite distribution to a power-law function. While
the details are given at length in Clauset et al. [14], intu-        3.4 Querying Linked Data with FALCON-S
itively the Kolmogorov-Smirnov test can be thought of as               Both the concept queries and the entity queries are used
follows: Given a reference distribution P , such as an ideal         to query the Semantic Web. Since our goal was to discover
power-law distribution generating function, and a sample             how much of interest for ordinary users was present on the
distribution Q of size n suspected of being a power-law,             Semantic Web, one problem with using the entire query log
where one is testing the null hypothesis that Q is drawn             was that it would contain a vast amount of unique queries
from P , then the Kolmogorov-Smirnov test compares the               that would likely to be never be repeated. So, we excluded
cumulative frequency of both P and Q to discover the great-          a portion of the long tail from the study by removing all
est discrepancy (the D-statistic) between the two distribu-          queries of less than a frequency of 10. The parameter 10 was
tions. This D-statistic is then tested against the critical          chosen as it was the number that could reduce both entity
value p of the D-statistic at n, which varies per function.          and concept queries to the same order of magnitude. Due to
The null hypothesis is rejected if the D statistic is less than      the power-law behavior of both entity and concept queries,
the critical p-value for n, p being the probability that the         this truncation consists of ‘removing’ a large amount of the
distribution was drawn from a power-law generating func-             long tail, while maintaining the entire ‘top’ of the power-
tion given the estimated parameters. In order to determine           law distribution, as well as some significant component of
how well the power-law method fits, whenever a power-law             the long tail. This procedure is justified insofar as the ‘long-
is reported, the D-statistic is also reported, and we will de-       tail’ likely consists of queries that are never or very rarely
termine whether or not the fit was significant according to          repeated, while the remaining queries represents queries that
are likely to be repeated. This pruning of low-frequency                                                            10
                                                                                                                      7


queries from our sampling does exclude many ‘difficult’ or
‘specialist’ queries, but we are aiming for queries that are                                                        10
                                                                                                                      6


general-purpose and popular. We call these queries with




                                                                          Frequency of Semantic Web URIs returned
more than 10 URIs returned from the Semantic Web the                                                                10
                                                                                                                      5


crawled queries to distinguish them from the greater query
log. Likewise, crawled entity queries are entity queries                                                            10
                                                                                                                      4



with more than 10 URIs returned from the Semantic Web,
and similarly for crawled concept queries.                                                                          10
                                                                                                                      3



   This truncation reduced the amount of queries signifi-
cantly, from 587,283 to 7,848 queries, removing 99% of the                                                          10
                                                                                                                      2



queries. It reduced the number of entity queries from 570,585
to 5,308 (a 91% reduction) and from the amount of concept                                                           10
                                                                                                                      1




queries from 16,698 to 2,540 (an 85% reduction). This gap
in the result of pruning off the ‘long tail’ is interesting, as it
                                                                                                                      0
                                                                                                                    10
                                                                                                                        0                 1                 2                      3           4
                                                                                                                      10             10                   10                    10            10

shows that while there is a lower amount of concept queries                                                                           Frequency−ordered Returned Semantic Web URIs


than entity queries overall, concept queries are repeated by a
order of magnitude or so more often than entity queries. The         Figure 2: The rank-ordered frequency distribution of
only caveat is that our identification of concept queries via        the number of URIs returned from entity and concept
WordNet is likely more stringent than our identification of          queries, with the entity queries given by green and the
entity queries, and thus leads to less concept queries overall.      concept queries by blue.
Furthermore, the vast majority of entity queries, as opposed
to concept queries, appear to be queries that are only once
or a very few times. This would make a certain amount
of sense, as many queries for people and places are not for          the insignificant .0077 (p > .05), while for concept queries,
famous people and places, but for infrequently-mentioned             the correlation was the still insignificant at .0125 (p > .05).
people and places, such as wayne way san mateo and sara              Just because a query is popular or unpopular does not mean
matthews. Some concepts that were as diverse as gastropod            the Semantic Web will be more or less likely to satisfy the
and accolade. Still, the crawled queries are still biased sig-       information need of the query. This makes sense, as the vast
nificantly in favor of entity queries, being composed of 68%         majority of queries are heavily dependent on current events
being entity queries and only 32% concept queries.                   and fashion, and the Linked Data data sources are not up-
   The FALCON-S Object Semantic Web search engine [13]               dated often enough to deal with this kind of information, so
was used to query the Semantic Web for selected entity and           there is an inevitable temporal lag between the time infor-
concept queries between August 3rd and 4th 2008. We rec-             mation appears in the world outside the Semantic Web and
ognize that this a major weakness of the study, as its index         its digitization on the Semantic Web. Yet as shown by Fig-
may not be a representative sample of the entire Linked              ure 2, the amount of possibly useful information for the vast
Data Web, but it is a significant sample regardless. At the          majority of queries is still surprisingly large, although how
time, FALCON-S seemed to have the best rankings, and a               many of the returned URIs are actually relevant to human
comparable index to other engines. The results of running            users is not yet known.
the crawled queries against a Semantic Web search engine
were surprisingly fruitful, although varying immensely. For
entity queries, there was an average of 1,339 URIs (S.D.                                                             5
                                                                                                                             5
                                                                                                                          x 10

8,000) returned per query. On the other hand, for concept
                                                                                                                    4.5
queries, there were an average of 26,294 URIs (S.D. 14,1580)
returned per query, with no queries returning zero docu-                                                             4


ments. Given the high standard deviation of these results,                                                          3.5

it is likely that there is either a power-law in the resulting                                                       3
URIs for the queries, or some other non-normal distribu-
                                                                             URIs




                                                                                                                    2.5
tion. As shown in Figure 2, when plotted in logarithmic
space, both entity queries and concept queries show a distri-                                                        2

bution that is heavily skewed towards a very large number of                                                        1.5

high-frequency results, with a steep drop-off to almost zero
                                                                                                                     1
results instead of the characteristic long tail of a power law.
Far from having no information that might be relevant to                                                            0.5


ordinary user queries, the Semantic Web search engines re-                                                           0
                                                                                                                             500   1000        1500            2000         2500       3000
turned either too many URIs possibly relevant to the query                                                                                     Popularity−ordered Queries


or none at all.
   Another question is whether or not there is any correlation
between the amount of URIs returned from the Semantic                Figure 3: The rank-ordered popularity of entity and
                                                                     concept queries is on the x-axis, with the y axis displaying
Web and the popularity of the query. As shown by Figure 3,
                                                                     the number of Semantic Web URIs returned, with the
there is no correlation between the amount of URIs returned
                                                                     entity queries given by green and the concept queries by
from the Semantic Web and the popularity of the query. For
                                                                     blue.
entity queries, the Spearman’s rank correlation statistic was
4.   EMPIRICAL ANALYSIS OF THE SEMAN-                               Linked Data [11]. This statistic as regards usage of the 303
     TIC WEB                                                        convention is misleading in the broad sense, as most of the
                                                                    URIs are from a single source, DBPedia, as shown later in
   Surprisingly, there is a deluge of possible Semantic Web         Table 4.
URIs for any given query. Due to the high number of re-                The majority of URIs, 51,873 (74%), served a Semantic
sults for each query, we restricted our analysis to the top         Web document via 303 redirection, and so returned the 200
10 Semantic Web URI results for each query as given by              status code when the Semantic Web document was accessed
FALCON-S’s ranking algorithm, and distinguish this subset
                                                                    after the redirection. 200 status codes without 303 redi-
from all the URIs returned by the Semantic Web, by calling
                                                                    rection still form a substantial fraction of Semantic Web
these this subset the crawled URIs. Concept URIs are                URIs. There are several reasons this; all hash convention
crawled URIs from the crawled concept queries while entity          URIs would by default still technically commit a redirect
URIs are crawled URIs from the crawled entity queries. Al-          to be served by a 200 status code. However, this is only a
though crawled URIs are a small subset of the total URIs re-        minority (27%) of those URIs returning a 200 status code.
trieved, given that user behavior in general inspects the first
                                                                    The rest are likely caused by people serving RDF that does
ten URIs returned by this search [18], it makes more sense to
                                                                    not have the access to the Web server configuration needed
sample these ten URIs per query than to sample every URI            to serve RDF using the 303 redirection, while many others
retrieved. The crawled URIs totaled 70,128 URIs, composed           may have started serving RDF before the W3C TAG deci-
of 25,400 (36%) concept URIs and 44,728 (63.78%) entity             sion [28] was made or are not aware of Linked Data best
URIs. These URIs were crawled using HTTP GET with a                 practices. For example, some earlier RDF-enabled reposito-
preference for application-type of application+rdf/xml in
                                                                    ries like W3C WordNet did redirection by 300 redirection. A
order to prefer RDF files served by content negotiation, and
                                                                    small percentage may be ordinary web-pages, perhaps con-
any 303 redirection was followed.                                   taining some meta-data as enabled by GRDDL, that just
   Of all crawled queries, a total of 6,673 (85%) had at least      happened to be indexed by the Semantic Web search en-
10 crawled URIs. All concept queries had at least 10 crawled        gine [15]. Furthermore, of these crawled URIs, 9,156 (13%)
URIs and only 4,133 of the entity queries (12% of all entity        URIs had no Semantic Web document that was accessible
queries) did not have 10 queries. Inspecting just the set
                                                                    via HTTP, shown by the use of a 4xx or a 5xx-level status
of queries that did not have 10 crawled URIs, the average
                                                                    code.
number of URIs when 10 URIs were not returned were 2.89
(S.D. 2.88). So, the trend observed earlier was repeated in
this smaller data-set, namely that while most of the time too                          51,873   73.97%     303
many URIs are retrieved from the Semantic Web, sometimes                               6,061    8.65%      200
there are no URIs are retrieved from the Semantic Web for                              4,517    6.44%      404
certain entity queries. Looking at the data more closely, 357                          4,257    6.07%      500
(30%) of the crawled URIs with less than 10 results returned                           3,147    4.49%      300
no URIs, while 138 (12%) returned a single URI and 113 re-                             246      0.35%      406
turned two URIs (10%). These queries with zero results                                 20       0.03%      403
seem to be mostly for not well-known places such as playa                              4        0.00%      302
linda (a hotel in Majorica) or fairly unknown people such                              3        0.00%      502
as william ravies or misspellings or popular truncations of
names for people such as steven colbertbush. This obser-            Table 3: Top 10 HTTP Status Codes for crawled
vation helps explains the sudden drop in Semantic Web URIs          URIs
returned for queries in Figure 3. There was little overlap be-
tween the the crawled URIs retrieved by different queries,
with an overlap in entity queries of 546 URIs (.01%) and an            The top 10 hosts of Semantic Web data in the crawled
overlap in concept queries of 1031 URIs (.04%). In other            URIs is given by Table 4. DBPedia, the export of Wikipedia
words, the various queries weren’t just retrieving the same         to RDF, dominates the results with 83% of all URIs com-
small group of URIs over and over again.                            ing from either Wikipedia or DBPedia [2]. The W3C it-
                                                                    self is the third largest exporter of RDF with a share of
4.1 URI-based Statistics                                            5%. Upon closer inspection, most of the URIs crawled from
   In this section, we inspect the various kinds of statistics we   the W3C derive from the W3C-hosted export of the linguis-
can detect on the ‘macro-level’ of the crawled URIs without         tic database Wordnet. The domain of the Freie Universität
actually accessing any Semantic Web documents from the              Berlin has a significant 2% of all RDF data, which is due pri-
URIs.                                                               marily for its Flickr photo export to RDF. An RDF-version
   The HTTP status returned by attempting to access the             of Cyc and the biomedical data hosting site Bio2RDF also
various crawled URIs are given in Table 3. In particular,           host small but significant amounts of Semantic Web data
the most revealing statistic is the majority of the Seman-          [22]. The Russian-blog hosting site Liveinternet.ru carries
tic Web sampled by the crawled URIs is served using the             on the tradition of FOAF exporting of Livejournal. True-
303 convention, not the hash convention. In fact, a total of        sense is another export of WordNet to RDF, although not
51,762 (73%) of crawled URIs use the 303 convention, while          as frequently used as W3C Wordnet. Towards the end of
only 1,662 (2%) of the crawled URIs use the hash conven-            the ranking there is the RDF version of Univeristät Trier’s
tion. Of these URIs returning the hash convention, manual           widely used DBLP academic citation database and
inspection showed many to be FOAF files. This shows the             Ontoworld.org, a RDF-enabled wiki for the Semantic Web
vast majority of Linked Data is following the 303 conven-           research community [31].
tion and so obeying the W3C and the guide to publishing                The average number of URIs hosted by a domain name
                          10
                            6                                                                                   accessible crawled URIs contained 24,074 accessible crawled
                                                                                  entity URIs
                                                                                  concept URIs
                                                                                  Total Semantic Web URIs
                                                                                                                concept URIs (95% of all crawled concept URIs) and 36,898
                          10
                            5
                                                                                                                (82% of all crawled entity URIs) accessible crawled entity
                                                                                                                URIs. Thus, the accessible crawled URIs maintained a bias
                          10
                            4
                                                                                                                towards entity URIs (61% of all accessible crawled URIs)
                                                                                                                as compared to concept URIs (39% of all accessible crawled
         Number of URIs
            crawled




                                                                                                                URIs). Each of the crawled accessible URIs was accessed,
                            3
                          10




                            2
                                                                                                                and this resulted in a total of 59,228 Web representations
                          10
                                                                                                                with only 48 URIs not allowing access to a Semantic Web
                          10
                            1
                                                                                                                document. These non-Semantic Web documents were usu-
                                                                                                                ally ordinary web-pages from which RDF triples could not
                          10
                            0
                              0               1                               2                             3
                                                                                                                be extracted via GRDDL [15] or RDFa [1]. These crawled
                            10              10                               10                         10
                                             URI frequency−ordered domain names                                 Semantic Web Documents we will call the crawled Seman-
                                                                                                                tic Web documents, and the total sum of triples in these
                                                                                                                documents are called the crawled triples.
Figure 4: The rank-ordered distribution of the domain
                                                                                                                   There were a total of 411,574 RDF triples in the crawled
names hosting Semantic Web data from the crawled
                                                                                                                triples, with 242,829 (59%) triples for concepts and 168,745
URIs ordered by number of URIs hosted.
                                                                                                                (41%) triples for entity URIs. Concepts, despite being fewer
                                                                                                                in number, seem to require more triples to describe than
                                                                                                                entities. The internal structure of these triples is of surpris-
was 1,268 (S.D. 16,060), with the average number of entity                                                      ing interest. Of these triples, there were a total of 1,051
URIs hosted by any domain being 1,236 (S.D. 15,458) and                                                         triples containing blank nodes, a measly .25% of all triples
the average number of concept URIs hosted by a domain                                                           in the corpus, of which 772 (73%) were subjects and only
being 1,0327 (S.D. 6,650). The very high standard devia-                                                        279 (27%) were in the object position. This means that
tions are usually a sign of power-law distribution, as shown                                                    the use of blank nodes, whose purpose is as syntactic place-
in in Figure 4. Attempting to fit a power-law distribution,                                                     holders in URIs for objects like lists and in representing n-
the α of the rank-ordered domain list frequency distribu-                                                       ary arguments in RDF, is almost non-existent in our sample.
tion is 1.53, with long tail behavior starting around 175 and                                                   Removing blank nodes, the composition was split between
a Kolmogorov-Smirnov D-statistic of .1414, indicating in-                                                       URI nodes (66%) and a surprisingly large minority of RDF
significant fit for the power-law distribution. In other words,                                                 literals nodes (34%). These literals contain some form of in-
while a few sources like DBPedia dominates the crawled                                                          formation in either ‘unstructured’ natural language or some
URIs, with an rapidly decreasing number of smaller sites                                                        form of structured information in a formal language, such
such as Cyc and the W3C, the long-tail individuals URIs                                                         as integer values.
hosting their FOAF files on their personal websites is still                                                       Of the literals, a total of 403,119 were RDF string lit-
rather insignificant compared to the ‘top’ major sites host-                                                    erals, while only 2% were of some other data type, with
ing Linked Data. This is because the Linked Data is being                                                       top 10 frequent data-types given in Table 5. The most fre-
artificially generated in large ‘chunks’ by projects like W3C                                                   quent data-types are from XML Schema [10], while others
Wordnet and DBpedia, and so do not organically form the                                                         are customized for DBPedia. It appears that the vast ma-
power-law distribution characteristic of naturally-evolving                                                     jority of RDF in the Semantic Web of interest to average
complex systems.                                                                                                users are simple URI-based triples with rich information in
                                                                                                                natural language. This also goes against the intuition that
                                                                                                                the vast majority of Semantic Web data that is of interest
                                  54,698   78.00%              dbpedia.org
                                                                                                                to ordinary users would be highly structured data of ex-
                                  3,584    5.11%               wikipedia.org
                                                                                                                ported databases [8]. Instead, what is of interest in Linked
                                  3,448    4.92%               w3.org
                                                                                                                Data is stored mainly in natural language, with RDF adding
                                  1,704    2.43%               fuberlin.de
                                                                                                                only a minimal structure to essentially fragments of natu-
                                  811      1.16%               cyc.com
                                                                                                                ral language. While it could be argued that this particular
                                  701      1.00%               bio2rdf.org
                                                                                                                finding is merely an artifact of DBpedia, however, it should
                                  599      0.85%               liveinternet.ru
                                                                                                                be acknowledged that DBpedia is, given that our querying
                                  417      0.59%               truesense.net
                                                                                                                includes other data-sets, this finding may well be generaliz-
                                  322      0.46%               dblp.unitrier.de
                                                                                                                able. We are not studying the Semantic Web as some of its
                                  314      0.47%               ontoworld.org
                                                                                                                designers would like to have it, but as it actually exists, and
                                                                                                                part of its existence is that DBpedia forms a huge central
Table 4: Top 10 Domain Names for URIs for                                                                       cluster that for ordinary users is the most interesting and
Crawled URIs                                                                                                    useful part of Linked Data.
                                                                                                                   One interesting question is the predominance of the vari-
                                                                                                                ous kinds of Semantic Web knowledge representations terms
4.2 Triple-based Statistics                                                                                     on the Semantic Web, since this would show what kinds
                                                                                                                of inference could actually be deployed on the Semantic
   In this section, we move our analysis down from the level
                                                                                                                Web. First, of the total 1,093,212 URIs in triples harvested
of URIs to the level of the triples accessible from the URIs.
                                                                                                                from the crawled accessible URIs, only 243,776 (22%) were
Since a number of crawled URIs were inaccessible, this re-
                                                                                                                from one of the primary W3C Semantic Web knowledge
duced the total number of accessible crawled URIs to
                                                                                                                representation languages, either RDF, RDF(S), or OWL.
60,972, a reduction of (13%) from the crawled URIs. The
   403,119   97.95%    RDF plain literal                          controversial owl:sameAs term, which is used to declare some
   3,103     0.75%     w3c:/XMLSchema#integer                     sort of global equivalence between two URIs. While a tiny
   2,789     0.68%     w3c:/XMLSchema#string                      portion (.47%) of overall Semantic Web modelling term us-
   1,185     0.29%     w3c:/XMLSchema#double                      age, it is far from insignificant, with 1,157 occurrences. The
   522       0.13%     w3c:/XMLSchema#date                        use of owl:sameAs in the wild is far different than the role it
   248       0.06%     w3c:/XMLSchema#float                       plays in popular debate within the Semantic Web commu-
   136       0.03%     w3c:/XMLSchema#gYear                       nity would suppose. Logicians hold that owl:sameAs is only
   65        0.02%     w3c:/XMLSchema#gYearMonth                  for what is properly considered individuals in description
   59        0.01%     dbpedia:Rank                               logic, so that classes and properties should use the more re-
   46        0.01%     dbpedia:Dollar                             stricted and semantically correct owl:equivalentClass and
   14        0.00%     w3c:/XMLSchema#int                         owl:equivalentProperty. Yet this best practice in logic
   9         0.00%     dbpedia:Percent                            hasn’t the Linked Data community, as owl:equivalentClass
                                                                  has only 2 occurrences and there are none of
 Table 5: Common Data Types in Crawled Triples                    owl:equivalentClass. Instead, the Linked Data movement
                                                                  uses owl:sameAs to simply “state that another data source
                                                                  also provides information about a specific non-information
                                                                  resource,” so leading owl:sameAs to tend to mean ‘more-or-
                                                                  less the same thing as’ [11]. This practice leads to the fear
Of these, the RDF vocabulary itself was the most popu-
                                                                  that the use of owl:sameAs would propagate too far, such
lar, with 109,300 URIs (45%), followed fairly closely by the
                                                                  that many URIs for the perhaps differing referents would be
RDF(S) vocabulary with 100,340 URIs (41%), and OWL
                                                                  declared identical [17].
being dwarfed by RDF and RDF(S) with only 34,136 URIs
                                                                     Both critiques of owl:sameAs appear to be wrong. Given
(14%). This does not mean that OWL is irrelevant to the
                                                                  the amount of Semantic Web URIs returned by the queries,
other corpus, as ontologies constructed with OWL could be
                                                                  while there is considerable use of owl:sameAs, it appears
deployed to model the concepts and entities employed in
                                                                  that the manual discovery and publication of co-referential
‘instance’ data. Yet while OWL has been an academic suc-
                                                                  URIs using owl:sameAs falls far behind the actual growth of
cess story, insofar as practical deployment, RDF terms and
                                                                  Linked Data. One could say that owl:sameAs is not being
RDF(S)-based inference seems to be the foundation of the
                                                                  used enough. The real problem is not that distinct things
Semantic Web in practice.
                                                                  are being given the same URI, but the reverse; namely that
   What precise URI-based terms are used in these knowl-
                                                                  it appears endemic that the same thing has multiple URIs.
edge representation languages? The top constructs in ei-
                                                                  So Berners-Lee’s hypothesis appears to be wrong: A single
ther RDF, RDF(S), or OWL in crawled triples are given in
                                                                  thing is likely identified by more than a single URI on the
Table 6. To summarize, RDF(S) class and sub-class rea-
                                                                  Semantic Web.
soning is very popular, with this construction consisting of
nearly half (48%) of knowledge representation use of the Se-
mantic Web. The second most popular use of knowledge                         73,451   30.31%    rdfs:Class
representation (22%) is for natural language annotation, de-                 47,044   19.30%    rdfs:comment
scribing a particular Semantic Web resource using natural                    44,113   18.10%    rdfs:subClassOf
language and connecting this natural language description to                 8,630    3.54%     owl:Ontology
the URI via the use of rdfs:comment or rdfs:label. There                     7,256    2.97%     rdfs:label
are surprisingly few (4%) actual ontologies in the crawled                   6,618    2.14%     rdf:Subject
Semantic Web resources. Furthermore, non-traditional fea-                    5,107    2.09%     owl:ObjectProperty
tures of RDF(S), such as the use of rdfs:property, are fre-                  3,642    1.49%     rdfs:subPropertyOf
quent occurrences. Even reification of RDF triples, officially               1,157    0.47%     owl:sameAs
discouraged by the Semantic Web community, accounts for                      535      0.29%     rdfs:range
only 95 triples, and there is also fairly heavy use of discour-
aged RDF constructs to represent different kinds of lists,        Table 6:    RDF and OWL Constructs in Crawled
such as rdf:Alt (349 occurrences) and rdf:Bag (344 oc-            Triples
currences). Lastly, while many Semantic Web researchers
originally hoped that the use of inverse functional proper-
ties would allow the merger of Semantic Web data, there
were zero explicitly declared usages of                              The top 10 Semantic Web vocabularies used in the crawled
owl:inverseFunctionalProperty. Overall, the usage of OWL,         triples, including those outside of the W3C-approved Seman-
RDF(S), and RDF terms in the corpus also follows to some          tic Web knowledge representation languages, are shown in
degree a power-law like distribution, where α equal to 1.5,       Table 7. The results should not be that surprising, in par-
with long tail behavior starting around 90, although the          ticular the vast dominance of DBPedia. Perhaps surprising
Kolmogorov-Smirnov D-statistic of .1911 reveals this to in-       is the surprising amount of usage of Cyc terms, as well as
significant. This is because while a few terms vastly dom-        terms from SKOS, the Simple Knowledge Organization Sys-
inate, the vast majority of other terms are not used at all.      tem of the W3C, whose primary source of deployment is the
This has reprecussions for both Semantic Web implementers         W3C’s export of WordNet to RDF [24]. FOAF is also signif-
and vocabulary specification within the W3C, since obvi-          icant, although not nearly as dominant as was found earlier
ously some level of concentration of effort upon the most         by Ding and Finin [16]. Also popular is YAGO (Yet Another
frequently-deployed terms would be reasonable.                    Global Ontology), a merger of WordNet and Wikipedia cat-
   One of the most popular OWL constructs is indeed the           egory hierarchies employed by DBPedia [30].
            366,849   33.55%    DBpedia URIs                      beside FALCON-S, which we recognize is a major limiting
            109,300   9.99%     RDF URIs                          factor. Second, there is likely too many URIs in Linked Data
            100,340   9.17%     RDF(S) URIs                       for a given query, although to truly substantiate this claim
            94,520    8.65%     Cyc URIs                          ideally the URIs returned by the search engines should each
            34,136    3.12%     OWL URIs                          be individually inspected, although this is difficult in prac-
            6,563     0.60%     SKOS URIs                         tice. Yet even at this point it seems is likely that there are
            4,728     0.43%     dblp.l3s.de                       many co-referential URIs for the ‘same thing’ that are not
            3,263     0.29%     FOAF URIS                         explicitly modelled with owl:sameAs, and unless action is
            2,170     0.20%     YAGO URIs                         taken this growth of URIs will contine of the future. Unless
            1,836     0.16%     WordNet URI                       there is URI re-usage many of the data-sources for Linked
                                                                  Data are more like semantic islands rather than parts of
Table 7: Top Vocabulary URIs in Crawled Triples                   interconnected semantic continents.

                                                                  6. ACKNOWLEDGEMENTS
                                                                    Harry Halpin was supported in part by a Microsoft “Be-
5.   CONCLUSION                                                   yond Search” grant.
   The empirical analysis of Linked Data presented in this
study is by no means complete, for it is only a moderately
small sample by one Semantic Web search engine (and so            7. REFERENCES
hurt or benefit by the idiosyncratic behavior of the search-       [1] B. Adida, M. Birbeck, S. McCarron, and
ing of FALCON-S), although it is an important one as this              S. Pemberton. RDFa in XHTML: Syntax and
sample is driven by Web search queries by actual users. The            Processing. W3C Recommendation, W3C, 2008.
results of this empirical analysis show a transformation from          http://www.w3.org/TR/rdfa-syntax/.
the first-generation Semantic Web to the next generation           [2] S. Auer, C. Bizer, J. Lehmann, G. Kobilarov,
Web of Linked Data. The Semantic Web as it existed in                  R. Cyganiak, and Z. Ives. DBpedia: A nucleus for a
the first-generation was a motley collection of RDF triples,           web of open data. In Proceedings of the International
heavily dominated by a few exports of social networking                and Asian Semantic Web Conference
data into FOAF and a long-tail of complex academically-                (ISWC/ASWC2007), pages 718–728, Busan, Korea,
produced ontologies. Linked Data - at least the section of it          2007.
that is of interest to users querying the Web for information      [3] R. Baeza-Yates, L. Calderon-Benavides, and
- is dominated heavily by DBPedia and consists primarily               C. Gonzalez. Understanding user goals in web search.
of collections of triples that provide a minimal structure to          In Proceedings of String Processing and Information
natural language [16].                                                 Retrieval (SPIRE), pages 98–109, 2006.
   On the level of triples, there are some surprising conclu-      [4] R. Baeza-Yates and B. Ribeiro-Neto. Modern
sions. The triples on the Semantic Web contain a vast range            Information Retrieval. Addison Wesley-Longman, New
of data, and the exact kinds of URIs used in the triples are           York City, New York, USA, 1999.
somewhat unpredictable. However, the kinds of vocabular-           [5] A.-L. Barabasi, R. Albert, H. Jeong, and G. Bianconi.
ies actually deployed are almost entirely from a few large
                                                                       Power-law distribution of the World Wide Web.
vocabularies, such as DBPedia, DBLP, WordNet, YAGO,
                                                                       Science, 287:2115, 2000.
and FOAF. This again points to a victory of Berner-Lee’s
                                                                   [6] G. Beged-Dov, D. Brickley, R. Dornfest, I. Davis,
idea that a few large vocabularies with well-defined terms
                                                                       L. Dodds, J. Eisenzopf, D. Galbraith, R. Guha,
could dominate the Semantic Web [9]. The kinds of triples
                                                                       K. MacLeod, E. Miller, A. Swartz, and E. van der
that structured this data do not contain many OWL terms
                                                                       Vlist. RDF Site Summary (RSS) 1.0. Technical report,
optimized for inference, but consist almost entirely relatively
                                                                       http://web.resource.org/rss/1.0/spec, 2001.
straight-forward RDF(S) expressions for sub-class relation-
ships and for annotations in natural language. Overall,            [7] F. Belleau, M.-A. Nolin, N. Tourigny, P. Rigault, and
Linked Data is primarily being used to provide structured              J. Morissette. Bio2rdf: Towards a mashup to build
relationships between fragments of natural language, and               bioinformatics knowledge systems. Journal of
not for inference.                                                     Biomedical Informatics, 41(5):706–716, 2008.
   One could argue that that these results are more charac-        [8] T. Berners-Lee. What the Semantic Web can
teristic of FALCON-S and DBpedia than the second-generation            represent, 1998. Informal Draft.
Linked Data as a whole. However, we would respond that                 http://www.w3.org/DesignIssues/rdfnot.html (Last
it is natural in decentralized information systems for power           accessed on Sept. 12th 2008).
law distributions, where one source of data massively out-         [9] T. Berners-Lee and L. Kagal. The fractal nature of the
weighs others in weight to evolve, and the ‘giant component’           Semantic Web. AI Magazine, 29(3), 2004.
of Linked Data is DBpedia [5]. In fact, if such a ‘giant com-     [10] P. Biron and A. Malhotra. XML Schema Part 2:
ponent’ and long tail were not observed, it would be cause             Datatypes. Recommendation, W3C, 2004.
for suspicion. In conclusion, there is potentially lots of rich        http://www.w3.org/TR/xmlschema-2/.
information that ordinary Web search users in Linked Data         [11] C. Bizer, R. Cygniak, and T. Heath. How to publish
form, and so one outcome of this analysis should be a greater          Linked Data on the Web, 2007.
interest in Linked Data from even mainstream information               http://www4.wiwiss.fu-
retrieval systems. However, for future work we wish to re-             berlin.de/bizer/pub/LinkedDataTutorial/ (Last
peat this study over different Semantic Web search engines             accessed on May 28th 2008).
[12] C. Bizer and A. Seaborne. D2RQ: Treating non-RDF               pages 683–690, New York, NY, USA, 2007. ACM.
     databases as virtual RDF graphs. In Proceedings of        [28] L. Sauermann and R. Cygniak. Cool URIs for the
     International Semantic Web Conference, 2004.                   Semantic Web. Technical report, W3C Semantic Web
[13] G. Cheng, W. Ge, and Y. Qu. FALCONS: Searching                 Interest Group Note, 2008.
     and browsing entities on the semantic web. In                  http://www.w3.org/TR/cooluris/.
     Proceedings of the the World Wide Web Conference,         [29] C. Silverstein, H. Marais, M. Henzinger, and
     2008.                                                          M. Moricz. Analysis of a very large web search engine
[14] A. Clauset, C. Shalizi, and M. Newman. Power-law               query log. SIGIR Forum, 33(1):6–12, 1999.
     distributions in empirical data, 2007.                    [30] F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO:
     http://arxiv.org/abs/0706.1062v1 (Last accessed                a core of semantic knowledge. In In Proceedings of the
     October 13th 2008).                                            16th International Conference on World Wide Web,
[15] D. Connolly. Gleaning Resource Descriptions from               pages 697–706, New York, NY, USA, 2007. ACM.
     Dialects of Languages (GRDDL). Technical report,          [31] M. Völkel, M. Krötzsch, D. Vrandecic, H. Haller, and
     W3C, 2007. Recommendation.                                     R. Studer. Semantic wikipedia. In Proceedings of the
[16] L. Ding and T. Finin. Characterizing the Semantic              International Conference on World Wide Web
     Web on the Web. In Proceedings of the International            (WWW), pages 585–594, New York, NY, USA, 2006.
     Semantic Web Conference (ISWC), pages 242–257,                 ACM.
     2006.                                                     [32] D. Watts and S. Strogatz. A review of ontology based
[17] A. Ginsberg. The big schema of things. In Proceedings          query expansion. Nature, 6684(393):409–410, 1998.
     of Identity, Reference,                                   [33] C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. H.
     and the Web Workshop at the WWW Conference, 2006.              Ungar. Web-scale named entity recognition. In
     http://www.ibiblio.org/hhalpin/irw2006/aginsberg2006.pdf.      Proceedings of Conference on Information and
[18] L. Granka, T. Joachims, and G. Gay. Eye-tracking               Knowledge Management, pages 123–132. ACM, 2008.
     analysis of user behavior in www search. In SIGIR
     ’04: Proceedings of the 27th annual international
     ACM SIGIR conference on Research and development
     in information retrieval, pages 478–479, New York,
     NY, USA, 2004. ACM.
[19] M. Hausenblas, W. Halb, Y. Raimond, and T. Heath.
     What is the size of the Semantic Web? In Proceedings
     of Conference on Semantic Systems (iSemantics),
     Graz, Austria, 2008.
     http://tomheath.com/papers/hausenblas-
     isemantics2008-size-of-semantic-web.pdf.
[20] D. Hawking, E. Voorhees, N. Craswell, and P. Bailey.
     Overview of the trec-8 web track. In Proceedings of the
     Text REtrieval Conference (TREC), pages 131–150.
     ACM, 2000.
[21] B. J. Jansen, D. L. Booth, and A. Spink. Determining
     the informational, navigational, and transactional
     intent of web queries. Information Process and
     Management, 44(3):1251–1266, 2008.
[22] D. Lenat. Cyc: Towards programs with common sense.
     Communications of the ACM, 8(33):30–49, 1990.
[23] A. Mikheev, C. Grover, and M. Moens. Description of
     the LTG system used for MUC. In Seventh Message
     Understanding Conference: Proceedings of a
     Conference, 1998.
[24] A. Miles and S. Bechhofer. SKOS Simple Knowledge
     Organization System reference. Working draft, W3C,
     2008. http://www.w3.org/TR/skos-reference/.
[25] M. Newman. Power laws, pareto distributions and
     zipf’s law. Contemporary Physics, 46:323–351, 2005.
[26] E. Oren, R. Delbru, M. Catasta, R. Cyganiak,
     H. Stenzhorn, and G. Tummarello. Sindice.com: a
     document-oriented lookup index for open linked data.
     International Journal of Metadata, Semantics, and
     Ontologies 2008, 3(1):37–52, 2008.
[27] M. Paşca. Weakly-supervised discovery of named
     entities using web search queries. In Proceedings of the
     sixteenth ACM conference on Conference on
     information and knowledge management (CIKM),