Weaving the Pedantic Web

         Aidan Hogany , Andreas Harthy z , Alexandre Passanty , Stefan Deckery , Axel Polleresy
                       y Digital Enterprise Research Institute, National University of Ireland, Galway
                                     z AIFB, Karlsruhe Institute of Technology, Germany

                                  yf firstname.lastnameg@deri.org, z harth@kit.edu


ABSTRACT                                                         \show me all American female models who have also won an
Over a decade after RDF has been published as a W3C rec-         Academy Award for Best Supporting Actress", constructed
ommendation, publishing open and machine-readable con-           using facets in the user-interface; the results are automati-
tent on the Web has recently received a lot more attention,      cally aggregated from thirteen distinct sources.
including from corporate and governmental bodies; notably           However, all has not been plain sailing: this new paradigm
thanks to the Linked Open Data community, there now ex-          in Web publishing and interaction [11] has inevitably led to
ists a rich vein of heterogeneous RDF data published on the      many teething problems. As we will discuss in this paper,
Web (the so-called \Web of Data") accessible to all. How-        there exists a lot of noise within the Web of Data which
ever, RDF publishers are prone to making errors which com-       inhibits applications from e ectively exploiting this rich lode
promise the e ectiveness of applications leveraging the re-      of open, well-de ned and structured information.
sulting data. In this paper, we discuss common errors in            To illustrate, we introduce Alice: a hypothetical end-user
RDF publishing, their consequences for applications, along       of an application for searching and browsing the Web of Data.
with possible publisher-oriented approaches to improve the       Alice loads some interesting data about herself and is im-
quality of structured, machine-readable and open data on         mediately impressed by the integrated view of data from
the Web.                                                         publication, blog, social network and workplace exporters;
                                                                 however, for every second resource she explores, the applica-
                                                                 tion cannot locate or parse any relevant data. She tries to
1.     INTRODUCTION                                              load her publications into a calendar view, but one quarter
   Based on the simple principle of using URIs to name and       of them are missing as the dates/times contain illegal val-
link things { not just documents { the Resource Description      ues. She wants more information relating to properties and
Framework (RDF) o ers a standardised means of represent-         classes used to describe herself, but some do not exist; dis-
ing information on the Web such that: (i) structured data        couraged, she clicks on a friend of hers but nds that he has
is available to all over the Web; (ii) data can be handled       1,169 names and email addresses (she knew him as \Bob").
through standard APIs and applications; (iii) the meaning        She begins to notice that all resources she explores are in-
of the data is well-de ned using lightweight ontologies (or      stances of nine strange properties { and then the nal straw:
vocabularies); and (iv) data is interoperable with other RDF     she now nds out that her professor is actually a document.
on the Web and can be re-used and extended by other pub-            We will provide evidence in this paper as to how Alice
lishers and application developers.                              could have had such an experience browsing the Web of Data.
   Over the past few years, many Web publishers have turned      In so doing, we will take stock of some of the diculties
to RDF as a means of disseminating information in an open        currently apparent in RDF publishing, and discuss how we {
and machine-interpretable way, resulting in a \Web of Data"      and the now decade old Semantic Web community at large
which now includes interlinked content exported from corpo-      { can help to improve the current and future quality of RDF
rate bodies (e.g., BBC, New York Times, Freebase), commu-        data published on the Web.
nity e orts (e.g., Wikipedia, GeoNames), biomedical datasets
(e.g., DrugBank, Linked Clinical Trials) { even UK govern-
mental entities, where public sector organisations must now
additionally disclose their consultations in RDF1 . Applica-
tions and search engines are now starting to exploit this rich
vein of structured and linked data [9].
   For example, Figure 1 shows the results returned by the
VisiNav (http://visinav.deri.org/) system for the query
We would like to acknowledge and thank Richard Cyga-
niak, Michael Hausenblas, Stphane Corlosquet and Antoine
Zimmermann with whom we co-founded the Pedantic Web
Group. We would also like to thank anonymous reviewers
of various incarnations of this paper for their valued feed-
back. The work presented in this paper has been funded
in part by Science Foundation Ireland under Grant No.
SFI/08/CE/I1380 (Lion-2) and by an IRCSET Postgradu-
ate scholarship.
1
    http://coi.gov.uk/guidance.php?page=315
                                                                 Figure 1: Results from VisiNav showing American
Copyright is held by the author/owner(s).
LDOW2010, April 27, 2010, Raleigh, USA.
                                                                 models who have won Best Supporting Actress in
.                                                                the Academy Awards.
2.    WEB OF DATA ANALYSIS                                           Finally, we will also endeavour to provide discussion for
   We herein present some analysis based on an RDF dataset         each issue, both from the perspective of publishers and from
retrieved from the Web in April 2009 using MultiCrawler [8].       the perspective of data consumers.
We performed a seven-hop breadth- rst crawl for RDF/XML              We begin with issues relating to how data is found and
documents where we enforced a maximum of 5,000 crawled             accessed; then discuss parsing and syntax issues; look at rea-
documents per pay-level-domain (or PLD, viz.: a domain             soning issues, including inconsistent data; and nally, intro-
that requires payment, such as deri.ie or data.gov.uk) so as       duce and discuss ontology hijacking.
to ensure a diverse Web dataset covering a wide spectrum of
publishers. Indeed, we only crawled for RDF/XML and not            2.1     URI/HTTP: accessibility and derefencabil-
for other formats such as RDFa; RDF/XML is currently by                    ity
far the most popular format with RDFa growing in popular-             As previously alluded to, the Linked Open Data movement
ity. Still, one could expect a small percentage of documents       has been integral to RDF publishing on the Web, emphasis-
to contain { e.g., RDFa metadata { which we admittedly             ing four basic principles [2]: (i) use URIs as names for things;
overlook in our illustrative statistics.                           (ii) use HTTP URIs so that those names can be looked up;
   The crawl accessed 149,057 URIs (including 39,439 redi-         (iii) provide useful information when a look-up on that URI
rects), 54,836 (36.8%) of which resulted in valid RDF/XML          is made; and (iv) include links using external URIs.
documents (almost precisely 50% excluding redirects). The             With regards to providing information about a resource
  nal dataset contains 12,534,481 RDF statements mention-          upon a HTTP lookup of its URI { called dereferencing { em-
ing 1,598,521 URIs { including 5,850 classes and 9,507 prop-       phasis is placed on providing information in RDF and disam-
erties.                                                            biguating identi cation of information resources (document
   Based on this dataset, we present selected issues in RDF        URIs) from non-information resources (entities described in
data published on the Web. We focus on errors that we can          those documents). Now, using statistics of our crawl which
systematically detect, and thus one should not consider the        consisted of lookups on URIs in the data, we can draw some
following an exhaustive list; similarly, it is important to note   initial conclusions relating to Linked Data practices on the
{ given the diminutive scale and perhaps even age of our           Web of Data.
dataset { that the statistics presented herein are intended to
be illustrative, not exhaustive. That said, we still claim that    2.1.1    Dereferencability issues
the analysis of our dataset o ers a valuable insight into cur-     Category : incomplete
rent issues relating to RDF Web publishing: although inter-           In accordance with \use HTTP URIs so that those names
polating the exact prevalence of such problems to the entire       can be looked up", dereferencing a URI consists of retrieving
Web of Data may not be sensible, our statistics should o er        content as de ned by RFC39862 .
an indication as to the relative and approximate prevalence           Firstly, 5.3% of URIs returned an error (4xx client er-
of such problems.                                                  ror/5xx server error) response code, in con ict with the third
   Throughout the paper, we endeavour to present examples          Linked Data principle above: \provide useful information
of the various publishing errors by giving links to RDF Web        when a look-up on that URI is made". In most such { ad-
documents exhibiting such. Note that the purpose of provid-        mittedly relatively rare { cases simply nothing exists at that
ing these examples is to: (i) to give concrete and tangible ex-    location and a 404 Not Found code is returned (4.3% overall,
amples to the errors, giving indications as to how they might      81% of error codes).
have occurred, how they might be presently solved, and how            Secondly, 26.5% of URI lookups resulted in a redirect (30x
they could be avoided in future; (ii) show that noise is present   code). In fact, Linked Data principles encourage the use
in a diverse range of sources, describing a diverse range of       of redirects, particularly for identifying non-information re-
domains; and (iii) to show that errors in RDF publishing are       sources (i.e. URIs which denote things rather than les):
not only the result of inexperience { we show examples of          speci cally, the 303 redirect is recommended. Of the redi-
errors in academic publishing, community-based publishing,         rection URIs, 55.1% (14.6% of total) o ered a 303 redirect
popular vocabularies, and even documents published by the          to another location as recommended; however, 30.2% (8% of
authors of this paper. The purpose of the examples is thus         total) used a 302 redirect and the remaining 14.7% (3.9% of
not to \point the nger", but to give an honest appraisal of        total) used a 301 redirect.
such issues so as to identify possible directions forward.            In the machine-oriented world of Linked Data, publishers
   For posterity, we provide snapshots of documents and enu-       should be even more careful to avoid broken links and to
merate the namespace pre xes referenced in this paper at           make URIs dereferencable, thus enabling automatic data-
http://aidanhogan.com/pedantic/.                                   access for Semantic Web applications and providing them {
   In order to structure the highlighted issues, we identify       and ultimately end-users { a complete, coherent picture.
four categories of symptoms:
                                                                   Publisher Recommendations : Publishers should carefully
      incomplete: equatable to a dead-link in the current         follow Linked Data best practices when \minting" URIs.
       HTML web { a software agent will not be able to re-         Consumer Recommendations : Applications should not
       trieve data relevant to a particular task;                  expect high recall when dereferencing URIs found in RDF
                                                                   data: for high recall, applications may have to consider pre-
      incoherent: a software agent will not be able to cor-       fetching/data-warehousing approaches.
       rectly interpret some local piece of data as either the
       publisher or vocabulary maintainer would expect;            2.1.2    No structured data available
                                                                   Category : incomplete
      hijack: a software agent will not be able to correctly        Excluding redirects, 92.8% of URIs return a 200 OK re-
       interpret some remote piece of data as would be ex-         sponse code along with content; but what do these docu-
       pected;                                                     ments contain? Linked Data principles require that useful
      inconsistent: a software agent will interpret a contra-     2
                                                                     http://labs.apache.org/webarch/uri/rfc/rfc3986.
       diction in the data.                                        html
data be returned upon lookup of a URI from the Web of           application/rdf+xml, only 571 (1.2%) were invalid RDF/XML
Data, with particular emphasis on returning RDF. Thus,          documents; usually caused by simple errors such as unescaped
from our crawl requesting application/rdf+xml content, we       special characters, misuse of RDF/XML shortcuts, and omis-
would reasonably expect a high percentage of documents re-      sion of namespace. Again, such issues are relatively rare, pre-
turning RDF/XML3 .                                              sumably due to use of mature RDF/XML APIs for producing
   Of the 101,709 URIs which returned content with response     data and the popularity of the W3C RDF/XML validation
code 200 OK, we observed that only 45.4% of URIs report a       service5 .
content-type application/rdf+xml, with a further 34.8% re-      Publisher Recommendations : Publishers should use an
porting text/html. Commonly in RDF data, information re-        appropriate syntactic validator for their content, or only use
source URIs are used to identify themselves (or more prob-      trusted APIs to produce content.
lematically to identify related resources); for example, in
RDF, HTML documents are naturally identi ed using their         Consumer Recommendations : Applications could pos-
native URI. In almost all instances of a non-RDF content-       sibly investigate the use of tools for xing syntax errors:
type, the URI is simply a document without any supporting       e.g., use standard XML syntax cleaning tools for XML-based
RDF metadata. Hence, as before, Semantic Web agents will        RDF syntax. We have no experience in using such tools, and
not be able to properly exploit the content as expected by      they would have to be evaluated in the given application sce-
end-users.                                                      nario: again in any case, syntax errors are admittedly rare
                                                                and such concerns would only apply to applications with a
Publisher Recommendations : HTML pages { especially             large emphasis on high recall.
those whose URIs are mentioned in RDF documents { could
be embedded with RDFa.                                          2.3     Reasoning: noise and inconsistency
Consumer Recommendations : A possible { and admit-                 Thus far, we have seen that about half of the URIs used to
tedly quick and dirty { solution to avoid dead-links would be   identify resources in the Web of Data resolve to some valid
to convert the header information of HTTP URIs into RDF         RDF/XML data. We now look at issues relating to the data
using the terms from the W3C published \HTTP Vocabu-            contained within those documents: i.e., what they say and
lary in RDF 1.0"4 . More ambitiously, a system may consider     how the machine interprets the data.
extracting RDF from non-RDF content, such as the title of          Layered on top of RDF are the core RDF Schema (RDFS)
a HTML page or metadata for images. Such measures would         and Web Ontology Language (OWL) standards, which allow
ensure that at the very least, some structured information      for de ning the semantics or meaning of RDF data through
can be retrieved for a wider variety of URIs, thus avoiding     de nitions of classes and properties in schemas/ontologies.
`dead-links'.                                                   For example, the Friend Of A Friend (FOAF)6 project pub-
                                                                lishes OWL de nitions of a set of classes and properties which
2.1.3     Misreported content-types                             forms a structured and popular vocabulary for describing
Category : incomplete                                           people in RDF.
   A HTTP response contains an optional header eld stat-           Classes represent a grouping of resources: e.g., FOAF de-
ing the content type of the returned le. A consumer ap-           nes the class foaf:Person and one can assign ex:Alice and
plication can then decide from the header whether the con-      ex:Bob as members of this class. Using RDFS and OWL, a
tent is suitable for consumption, and whether the content       publisher can then de ne characteristics of such classes (and,
should be accessed. However, we observed that RDF/XML           thus, of all of its members); e.g., by de ning foaf:Person
content is commonly returned with a reported content-type       as a subclass of foaf:Agent, FOAF implies that ex:Alice,
other than application/rdf+xml: from our crawl, 16.9% of        ex:Bob and all other foaf:Persons are also members of the
valid RDF/XML documents were returned with an incom-            class foaf:Agent.
patible or more generic content type; e.g.,: text/xml (9.5%),      Properties represent the de nable attributes of resources,
application/xml (5.9%), text/plain (1%) & text/html (0.4%).     and also relationships that are possible between resources;
Publisher Recommendations : Publishers should ensure            e.g., FOAF de nes foaf:knows as a relationship that can exist
that the most speci c available MIME-type is reported for       from one member of foaf:Person to another, or that members
their content.                                                  of foaf:Person can have the attribute foaf:surname which has
                                                                a string value. Other publishers across the Web can then re-
Consumer Recommendations : Herein, a trade-o exists             use and extend de nitions of classes and properties { such as
for consumer agents: an agent with emphasis on perfor-          the ones from FOAF.
mance may still use the reported content-type to lter non-         Thereafter, reasoning can use the semantics of these classes
supported content formats, whereas an agent with more em-       and properties to interpret the data, and to infer new knowl-
phasis on recall should relax { or possibly ignore { ltering    edge (e.g., that ex:Alice is also a foaf:Agent, or that if ex:-
based on reported content-type.                                 Alice foaf:knows ex:Bob, then ex:Alice and ex:Bob are foaf:-
                                                                Persons).
2.2     Syntax errors                                              Some errors in RDF only reveal themselves after reasoning
2.2.1     RDF/XML Syntax Errors                                 { e.g., some unforeseen incorrect inferences occur { and as
                                                                such, can stay hidden from the publisher. In this section, we
Category : incomplete                                           will look at issues relating to the interpretation of RDF data
   At the outset of the Semantic Web movement, publish-         on the Web { in particular focussing on reasoning issues; in
ers opted to employ the existing XML standard to encode         order to shed light on such issues, we applied reasoning over
RDF; RDF/XML is still the most popular means of pub-            our crawl using the Scalable Authoritative OWL Reasoner
lishing RDF today. Although its syntax is quite complex,        (SAOR) [12], which we will discuss as pertinent.
we encountered relatively few syntax errors in RDF/XML
documents accessed during our crawl. Of the 46,136 doc-         2.3.1    Atypical use of collections, containers and reifi-
uments which return response code 200 and content-type                   cation
3                                                               5
                                                                  http://www.w3.org/RDF/Validator/
4                                                               6
    http://www.w3.org/TR/HTTP-in-RDF10/                          http://foaf-project.org/
                                                                            Unde ned Property Triples Used
Category : incoherent                                                       foaf:member name                   148,251
   There is a set of URI names which are reserved by the                    foaf:tagLine                       148,250
RDF speci cation for special interpretation in a set of triples;            foaf:image                         140,791
although the RDF speci cation does not formally restrict                    cycann:label                       123,058
usage of these reserved names, misuse is often inadvertent.                 qdoslf:neighbour                   100,339
   We rstly discuss the RDF collection vocabulary, which           Table 1: Count of the top ve properties used with-
consists of four constructs: fList, first, rest, nilg. Indeed,     out a de nition
few examples of atypical collection usage exist on the Web,                Unde ned Class       Triples Used
probably attributable to widespread usage of the RDF/XML                    sioc:UserGroup                      21,395
shortcut rdf:parseType="Collection" for specifying collections;             rss:item                            19,259
this shortcut shields users from the underlying complexity of               linkedct:link                       17,356
collections on the triple level and generally ensures typical               politico:Term                       14,490
collection use. The only atypical collection usage we found in              bibtex:inproceedings                11,975
our Web-crawl was one document which speci ed resources
of type List without first or rest properties attached7 .          Table 2: Count of the top ve classes used without
   A related issue is that of atypical container usage, which      a de nition
is concerned with the following constructs: Alt, Bag, Seq,         total of almost 300k triples11 { the FOAF vocabulary does
 1... n and the syntactic keyword li. Again, atypical con-
                                                                   not contain these properties and they are not de ned else-
tainer usage is uncommon on the Web: we found one domain           where; such a practice of deliberately inventing unde ned
(viz. semanticweb.org8 ) which, in 229 documents, exports          properties within a related namespace is common on the
RDF containers without choosing a type of Alt, Bag or Seq.         Web. Sometimes publishers make simple spelling mistakes:
   Finally, there may exist atypical usage of the rei cation       again, the property foaf:image is incorrectly used instead of
constructs: Statement, subject, predicate, object. However,        foaf:img in the livejournal.com domain; to take another ex-
in our dataset we only found one such example9 wherein             ample, the term qdoslf:neighbour is commonly used { in 100k
predicate is assigned a blank node value and used alone with-
                                                                   triples { instead of the property qdoslf:neighbours de ned in
out subject or object.                                             the namespace.12
Publisher Recommendations : Where possible, publishers                Similarly, there were 1.01M triples (8.1%) mentioning un-
should abide by the standard usage of such RDF terms to            de ned classes in 21.3k documents (38.8%); the top ve in-
enable interoperability.                                           stantiated such classes are enumerated in Table 2. Neither
Consumer Recommendations : Although we found that                  of the rst three classes nor the last class are de ned in the
atypical usage of the core RDF terms is relatively uncom-          dereferenced documents; for example, all of the sioc:User-
mon, consumer applications should be tolerant of such atypi-       Group instances come from the apassant.net domain13 . To
cal usage; for example, developers of reasoning engines which      take another example, the class politico:Term is generically
operate over Web data and consider RDF collections as part         described in the dereferenced document, but is neither im-
of complex OWL class descriptions { and even though we             plicitly nor explicitly typed as a class.
did not nd such usage in our dataset { should implement            Publisher Recommendations : Many such errors are inde-
simple checks to ensure that the respective engine is tolerant     liberate and due to spelling or syntactic mistakes resolvable
to cyclic, non-terminating and branching collection descrip-       through minor xes to the respective ontologies or exporters.
tions.                                                             Where terms have been knowingly invented, we suggest that
                                                                   the term be recommended as an addition to the respective
 2.3.2   Use of undefined classes and properties                   ontology { or de ned in a separate namespace { to enable
Category : incoherent                                              re-use.
   Oftentimes on the Web of Data, properties and classes are       Consumer Recommendations : Liberal consumer appli-
used without any formal de nition. For example, publish-           cations could, for example, use fuzzy string matching tech-
ers might say that ex:Alice ex:colleague ex:Bob even though        niques { e.g., Levenstein distance measures { between un-
ex:colleague is not de ned as a property. Again, although          de ned classes and properties encountered in the data, and
such practice is not prohibited, by using ad-hoc unde ned          classes and properties de ned in the vocabularies. Generally
classes and properties publishers make automatic integration       however, consumer applications can usually overlook such
of data less e ective and forego the possibility of making in-     mistakes and simply accept the consequence of incomplete
ferences through reasoning.                                        reasoning for triples using such unde ned terms.
   From our crawl, 1.78M triples (14.3% of all triples) use
unde ned properties, appearing in 39.7k documents (72.4%            2.3.3   Misplaced classes/properties
of valid RDF/XML documents): Table 1 enumerates the top            Category : incoherent
  ve.10                                                              Sometimes, a URI de ned as a class is used as a property
   For example, from our crawl, the livejournal.com domain         (appears in the predicate position of a triple) or, conversely,
uses the properties foaf:member name and foaf:tagLine in a         a URI de ned as a property is used as a class (appears in
                                                                   the object position of an rdf:type triple); although not pro-
7                                                                  hibited, such usage is usually inadvertent and can ruin the
  http://scripts.mit.edu/~kennylu/myself.rdf
8                                                                  machine-interpretation of the associated data.
  cf.   http://iswc2006.semanticweb.org/submissions/
Harth2006dq_Harth_Andreas
                                                                   11
9
  http://web.mit.edu/dsheets/www/foaf.rdf                             cf. http://danbri.livejournal.com/data/foaf
                                                                   12
10
 It is important to note that herein, when we mention \un-            cf. http://foafbuilder.qdos.com/people/danbri.org/
de ned" classes or properties, we loosely refer to classes or       foaf.rdf
                                                                   13
properties \not de ned in our crawl". In any case, our crawl          cf. http://apassant.net/home/2007/12/flickrdf/data/
would contain any property- or class-descriptions published         people/36887937@N00 { indeed the authors herein are also
according to best practices (i.e., using dereferencable terms).    prone to making simple errors in their publishing.
          Class                    # Misplaced                         D.type Prop.     # Non-literal % Non-literal
          rdfs:range                        8,012                      swrc:journal               19,853             97.8%
          foaf:Image                          639                      swrc:series                14,963             97.3%
          rdfs:Class                           94                      ical:location                   4              2.6%
          wot:PubKey                           18                      foaf:name                       4              0%
          foaf:OnlineAccount                   15                      foaf:msnChatID                  3              0.4%
Table 3: Top ve \classes" used in the predicate po-               Table 5: Top ve datatype-properties used with non-
sition of a triple                                                literal values
            Property             # Misplaced                               Obj. Prop.          # Literal % Literal
            foaf:knows                         4                           affy:startsAt            6,234         100%
            foaf:name                          4                           affy:stopsAt             6,234         100%
            foaf:sha1                          2                           affy:cdsType             5,193         100%
            swrc:author                        1                           affy:frame               4,882         100%
            foaf:based near                    1                           affy:commonToAll         4,814         100%
Table 4: Top ve properties found in the object po-                Table 6: Top ve object-properties used with literal
sition of an rdf:type triple                                      values
   Table 3 shows the top ve classes used as a property in our     that that term is a class or property { for example, rule rdf1
crawl. In fact, rdfs:range is a core RDFS property, but is        in RDFS [10]. Aside from this, consumer applications will
de ned in one document14 as a class; hence the 8,012 occur-       probably have to accept incomplete inferencing over such er-
rences are valid use of the property and the single declaration   roneous triples.
of rdfs:range as a class is at fault (this is also an instance
of ontology hijacking, which we will discuss in Section 2.4).      2.3.4    Misuse of owl:DatatypeProperty/owl:ObjectProp-
Most occurrences of the foaf:Image class used as a property                 erty
stem from the sembase.at domain15 ; here the foaf:depiction       Category : incoherent
property would be more suitable. Use of rdfs:Class as a              The built-in term owl:DatatypeProperty describes proper-
property comes from the ajft.org and rdfweb.org domains16         ties which relate some resource to a literal value, i.e., an
where rdfs:Class is seemingly mistaken as rdf:type. The           \attribute" property (in terms of Object-Oriented Program-
class wot:PubKey is mistakenly used instead of wot:hasKey17 .     ming); similarly, the OWL term owl:ObjectProperty describes
Misuse of foaf:OnlineAccount stems from one document18            properties which relate one resource to another (i.e., a \re-
wherein the RDF/XML shortcut rdf:parseType="Resource"             lation" property). Oftentimes, attribute properties are used
is used inappropriately, causing parsing of foaf:OnlineAcc-       between two resources, and relation properties are used with
ount elements as predicates.
                                                                  literal values.
   After reasoning, more such errors were discovered, partic-        From our crawl, we found a total of 34.8k triples (0.3%)
ularly in the affymetrix.com domain19 which describes genes       with datatype-properties given non-literal objects (in 1,194
and mistakingly uses rdfs:subClassOf to assert subsumption        [2.2%] documents across 9 domains). Table 5 lists the top
relations between properties (amongst many other issues);           ve; the only signi cant errors stem from l3d.de21 which ex-
this resulted in properties { which, combined, were used in       ports RDF from the Digital Bibliography & Library Project
37,454 triples { being typed as classes.                          (DBLP) { they de ne two datatype-properties in the swrc:
   Conversely, the usage of properties in the class position {    namespace but only use the properties with non-literal ob-
viz. the object position of an rdf:type tripe { is much less      jects.
common; Table 4 lists the results, with most errors stemming         Analogously, there were 41.7k triples (0.3%) with object-
from one document20 .                                             properties given literal values (in 4,438 [8%] documents from
Publisher Recommendations : Again, all such errors could          91 domains). Table 6 lists the top ve; many such occur-
easily be xed by the publishers once they are made aware.         rences come from the affymetrix.com domain which com-
Many of the above encountered errors were as a result of mis-     monly uses ve di erent object-properties with literal val-
use of RDF syntactic terms, such as rdf:parseType="Resource",     ues (in a total of 27.4k triples from our crawl). However,
or more generally as syntactic mistakes in their documents:       there were many other such properties with signi cant mis-
thus, publishers should not only ensure that their documents      use including miscellaneous properties from the opencyc.org
are syntactically valid, but also that they parse into the        domain (6,161), foaf:page (3,160), foaf:based near (1,078),
triples expected.                                                 ical:organizer (456), amongst others; again, the errors were
Consumer Recommendations : Applications which incor-              spread over 92 di erent domains. In fact, the property foaf:-
porate reasoning should consider foregoing standard infer-        myersBriggs (in the popularly used FOAF speci cation itself)
ences which rely on the position of a term in a triple to infer   was until recently incorrectly de ned as an owl:ObjectProperty
14
                                                                  with rdfs:range rdfs:Literal and had 35 literal values in our
   http://www.w3.org/2000/10/swap/infoset/                        dataset.
 infoset-diagram.rdf
15
   cf.       http://wiki.sembase.at/index.php/Special:            Publisher Recommendations : Where datatype- or object-
 ExportRDF/Dieter_Fensel                                          property constraints are erroneously speci ed { e.g., swrc:-
16
   cf. http://swordfish.rdfweb.org/discovery/2004/01/             journal, swrc:series, foaf:myersBriggs { they can simply be
 www2004/files/1101776794087.rdf                                  reversed by the ontology maintainers. However, in many
17
   cf. http://www.snell-pym.org.uk/alaric/alaric-foaf.            cases such constraints are purposefully de ned to ensure con-
 rdf                                                              sistent usage of the term; in this case, the onus is on pub-
18
   cf. http://tommorris.org/foaf                                  lishers to thereby abide.
19
   cf. http://affymetrix.com/community/publications/
 affymetrix/tmsplice/all_genes.1.rdf                              21
20
                                                                   cf. http://dblp.l3s.de/d2r/data/publications/conf/
   http://www.marconeumann.org/foaf.rdf                           aswc/HoganHP08
Consumer Recommendations : Applications would typi-                Now, all such users can be interpreted as equivalent { i.e.,
cally use such constraints for form generation in the context      representing the same real-world person { according to the
of instance data creation. Liberal versions of such applica-       semantics of the foaf:mbox sha1sum property. This problem
tions may decide to automatically reverse such constraints,        is quite widespread: even in our diminutive crawl, 52 hosts
where { in examples such as the affy: properties above { all       contribute 1,169 di erent bogus values in 1,041 documents.
usage is contrary to the speci ed constraint. Indeed, some         For example, 194 errors come from the bleeper.de domain23 ,
weighting scheme may be adopted for examples { such as the         189 from identi.ca24 , 166 from uni-karlsruhe.de25 , 163 from
swrc: properties { where most usage is contrary to the vocab-      twit.tv26 and 92 from tweet.ie27 ; Table 7 details the top ve
ulary constraint. Again, such approaches would admittedly          void values for inverse-functional properties which we found
need evaluation in the setting of the given application.           in our dataset.
                                                                      According to the standard re exive, symmetric and transi-
 2.3.5    Members of deprecated classes/properties                 tive semantics of equality (represented in RDF by the equal-
Category : incoherent                                              ity relation owl:sameAs), if we take for example the 986 entries
   Brie y, the OWL classes owl:DeprecatedClass and owl:-           with the same null sha1 value, 9862 =972k owl:sameAs rela-
DeprecatedProperty are used to indicate classes or properties
                                                                   tions would be inferred. Further, assuming, for example, an
that are no longer recommended for use: vocabulary publish-        average of eight triples mentioning each equivalent resource,
ers usually assert deprecation for classes or properties which     972k*8 = 7.8M statements would be inferred by substitut-
have been considered to be obsoleted by more popular terms         ing each equivalent identi er into each statement. In other
in local or remote vocabularies, or perhaps even where the         words, such chains of equality cause a quadratic explosion of
original term is contrary to some naming scheme or consid-         inferences; when one considers larger Web-crawls, the prob-
ered outside of the scope of the vocabulary. In our dataset,       lem becomes quite critical.
we did not nd any members of a deprecated class; however,          Publisher Recommendations : For publishers, the issue is
we found 290 instances (in 115 documents) of four depre-           easily resolved by, for example, validating user input and
cated properties: wordmap:subCategory (260), sioc:has group        checking the uniqueness and validity of inverse-functional
(15), sioc:content encoded (10) and sioc:description (5).          values. Conversely, vocabulary maintainers should be care-
Publisher Recommendations : Publishers of instance data            ful to clearly state that a property is inverse-functional in
should intermittently verify that no terms used have since         the human-readable speci cation, and select labels for prop-
been considered deprecated by the vocabulary maintainer,           erty URIs which give an indication of the inverse-functional
and should take appropriate action to use { possibly novel {       nature of the property { for example, choose the label ex:-
                                                                   personalMbox over ex:mbox.
recommended terms where possible.
Consumer Recommendations : Applications could con-                 Consumer Recommendations : A simple solution com-
sider specifying manual mappings from deprecated terms to          monly used by reasoning agents is to simply blacklist void
compatible terms now recommended for use. Less liberal             values. Although an exhaustive list of blacklist candidates
applications may consider omitting triples which use depre-        may be dicult to derive, the above values would { in our
cated terms. Generally, however, usage of deprecated terms         experience { constitute most of the void values. Other heuris-
does not require special treatment.                                tics may be employed to ensure correct equality reasoning {
                                                                   for example, use of a disambiguation step to quickly remove
 2.3.6   Bogus owl:InverseFunctionalProperty values                obviously incorrect equality inferences.
Category : incoherent/hijack                                        2.3.7   Malformed datatype literals
   Aside from URIs { which can be hard to agree upon {             Category : incoherent
resources are also commonly identi ed by values for proper-
ties which uniquely identify a resource; such keys are pre-          In RDF, a subset of well-de ned XML datatypes are used
existing and easier to agree upon. These properties are            to provide structure and semantics to literal (string) val-
termed \inverse-functional" and are identi ed in OWL with          ues. For example, string date values can be speci ed us-
the term owl:InverseFunctionalProperty. If two resources           ing the xsd:date datatype, which provides a lexical syntax
share a common value for one of these properties, reason-          for date strings and a mapping from date strings to date
ing will view these resources as equivalent (referring to the      values interpretable by an application. From the content
same resource). For example, the FOAF ontology has de-             of the crawl, we found 3,666,840 literals of which 170,351
  ned a number of inverse-functional properties for identi-        (4.6%) used a datatype. Of these, the top ve most popular
fying people; these include foaf:homepage, foaf:mbox (email),      datatypes were xsd:string (53,879), xsd:nonNegativeInteger
foaf:mbox sha1sum (sha1 encoded email to prevent spamming),        (38,501), xsd:integer (15,826), xsd:dateTime (15,824), and
amongst others. Herein, FOAF holds the intuition that the          xsd:unsignedLong (12,318).
values for such properties should be unique to an individ-           Unfortunately, incorrect use of datatypes is relatively com-
ual, and that the usage of such properties should re ect that      mon in the Web of Data. Firstly, datatype literals can be
(i.e., foaf:mbox should only be used for personal and unshared     malformed : i.e., ill-typed literals which do not abide by the
email-addresses).                                                  lexical syntax for their respective datatype. There were 4,650
   However, FOAF exporters commonly do not respect the             malformed datatype literals (2.7% of all typed literals) in our
semantics of these inverse-functional properties and export        crawl: Table 8 summarises the top ve datatypes to be in-
`void' values given partial user-input. The most widespread        stantiated with malformed values.
example is 08445a31a78661b5c746feff39a9db6e4e2cc5cf, which           The two most common errors for xsd:dateTime stem from
is the encrypted SHA1 value of `mailto:' and is commonly           23
assigned by FOAF exporters { as values for foaf:mbox sha1sum          cf. http://bleeper.de/powerboy/foaf
                                                                   24
{ to users who don't specify an email in some input form.22           cf. http://identi.ca/whataboutbob/foaf
                                                                   25
                                                                      cf.      http://www.aifb.uni-karlsruhe.de/Personen/
22
 In fact, at the time of writing, a Google search for this SHA1     viewPersonFOAF/foaf_1876.rdf
                                                                   26
string will result in nearly two million hits { seemingly almost      cf. http://army.twit.tv/takeit2/foaf
                                                                   27
all of which are FOAF RDF documents.                                  cf. http://tweet.ie/seank/foaf
                    Inverse-Functional Property Void Value                                                 Count
                    foaf:mbox sha1sum                 "08445a31a78661b5c746feff39a9db6e4e2cc5cf"               986
                    foaf:mbox sha1sum                 "da39a3ee5e6b4b0d3255bfef95601890afd80709"               167
                    foaf:homepage                     <http://>                                                 11
                    foaf:mbox sha1sum                 ""                                                         5
                    foaf:isPrimaryTopicOf             <http://>                                                  2
                     Table 7: Count of the ve most common void inverse-functional property values
     Datatype                 # Malformed % Malformed                   Datatype Property          # Clashes % Clashes
     xsd:dateTime                    4,042             26.4%            sl:creationDate                   9,212           100%
     xsd:int                           250              2.1%            scot:ownAFrequency                  529           100%
     xsd:nonNegativeInteger            232              0.6%            owl:cardinality                     464          65.2%
     xsd:gYearMonth                     67              100%            ical:description                    262          21.8%
     xsd:gYear                          27              1.4%            wn20schema:tagCount                 204           100%
Table 8: Top ve datatypes having malformed values                  Table 9: Top ve properties with datatype-clashes
and percentage of all values which are malformed                   and percentage of all values which cause clashes
(i) the wasab.dk domain28 whereby time-zones are missing           so are disjoint with xsd:date. The property scot:ownAFreq-
the required `:' separator; and (ii) the soton.ac.uk domain29      uency is given range xsd:float but only ever used in the do-
wherein the mandatory seconds- eld is not speci ed. For            main linkeddata.org34 with xsd:integer objects; xsd:integer
xsd:int, almost all errors stem from the freebase.com domain       is a sub-type of xsd:decimal and is disjoint with xsd:float [4].
whereby boolean values True and False are found30 . For            owl:cardinality is often used with plain-literal objects35 con-
xsd:nonNegativeInteger, all stem from the deri.ie domain31         trary to the de ned range xsd:nonNegativeInteger. The prop-
where non-numeric strings are incorrectly used. Finally, for       erty ical:description { de ned as having range xsd:string
xsd:gYearMonth and xsd:gYear, all illegal usage comes from         { is almost always instantiated with a plain-literal object
the dbpedia.org domain32 where full xsd:dateTime literals are      (99.8%); however, only the 21.8% which use language tags
used instead.                                                      constitute an inconsistency36 . Finally, wn20schema:tagCount
Publisher Recommendations : Clearly, malformed literals            has range xsd:nonNegativeInteger but is only used with plain
                                                                   literals in the w3.org domain37 .
are quite common. In all examples, the errors can be resolved
by simple syntactic xes to the publishing framework, or            Publisher Recommendations : In all such cases, the root
removing or changing the datatype on the literal; one can          problem could be resolved if the vocabulary publisher re-
conclude { especially in the absence of a popular validator        moves the range on the property; in many cases such an
for datatype syntax { that publishers are simply not aware         approach may even be suitable: properties such as ical:-
of such issues.                                                    description which are intended to have prose values should
Consumer Recommendations : Although datatype-aware                 remove xsd:string constraints { optionally setting the range
                                                                   as the more inclusive rdf:PlainLiteral datatype to encourage
agents could incorporate heuristics to shoulder common mis-
takes { e.g., publishers commonly omit the mandatory sec-          literal values { and thus allow use of language tags. However,
onds eld from date-time literals { not all such mistakes can       the majority of such datatype domain constraints are validly
feasibly be accounted for. Again { and in cases where the          used to restrict possible values for the property and the onus
issue next discussed does not apply { such literals can simply     is on data-publishers to thereby abide.
be interpreted as plain literals.                                  Consumer Recommendations : Again, liberal agents could
                                                                   consider changing the de ned range of the property to re ect
 2.3.8      Literals incompatible with datatype range              some notion of \common" usage. Also, although the usage of
Category : incoherent/inconsistent                                 properties often does not re ect the de ned datatype range,
   Aside from explicitly typed literals, the range of properties   in our dataset we found that the literal strings were almost al-
may also be constrained to be a certain datatype, mandating        ways within the lexical space of the range datatype and that
respectively typed values for that property; e.g., one can say     they were just poorly typed. We only found two properties
that the attribute property ex:bornOnDate has xsd:date val-        which were given objects malformed according to the range
ues. A datatype clash can then occur if the property is given      datatype (before, we were concerned with malformed liter-
a value (i) that is malformed, or (ii) that is a member of an      als given an explicit datatype): viz. exif:exposureTime with
incompatible datatype. Table 9 provides counts of datatype         range xsd:decimal (given 49 plain literals with malformed
clashes for the top ve such properties.                            decimal values in one document38 ) and cfp:deadline with
   The property sl:creationDate has the range xsd:date but         range xsd:dateTime (given 3 plain literals with malformed
all triples with sl:creationDate in the predicate position have    date-time values in 3 documents39 ). Thus, in all but the
plain-literal objects { all such usage originates from the sem-    latter cases, liberal software agents could ignore mismatches
anlink.net tagging system33 ; please note that plain literals      34
                                                                      cf.         http://community.linkeddata.org/dataspace/kidehen2/
without language tags are considered as xsd:strings [10] and        subscriptions/Kingsley_Feed_Collection/tag/rdf
                                                                   35
28                                                                    425 of 464 such examples stem from http://bioinfo.
     cf.  http://www.wasab.dk/morten/2004/08/photos/1/              icapture.ubc.ca/subversion/Cartik/Object-OWLDL2.owl
 index.rdf                                                         36
                                                                      cf.     http://www.ivan-herman.net/professional/CV/
29
   cf. http://rdf.ecs.soton.ac.uk/publication/10006                 W3CTalks.rdf
30
   cf. http://rdf.freebase.com/rdf/aviation/aircraft_              37
                                                                      cf.    http://www.w3.org/2006/03/wn/wn20/instances/
 ownership_count                                                    wordsense-act-verb-3.rdf
31
   cf. http://www.deri.ie/fileadmin/scripts/foaf.php?              38
                                                                      http://kasei.us/pictures/2005/20050422-WCCS_
 id=320                                                             Dinner/index.rdf
32
   cf. http://dbpedia.org/data/1994_San_Marino_Grand_              39
                                                                      cf. http://sw.deri.org/2005/08/conf/ssws2006.rdf {
 Prix.xml                                                          an example of errors admittedly generated by an author of
33
   cf. http://www.semanlink.net/tag/rdf.rdf                        this paper.
     Disjoint Classes                      # Instances            foaf:Person in the opiumfield.com domain41 and inferred to
     foaf:Agent u foaf:Document                       502         be members of foaf:Document in the dbtune.org domain42 .
     foaf:Organization u foaf:Person                  328         Again, there are many other exporters and domains which
     foaf:Document u foaf:Person                      232         contribute; for example, an exporter of Wikipedia data in
     sioc:Container u sioc:Item                       194         the sioc-project.org domain43 uses the same URI to iden-
     sioc:Item u sioc:User                             35         tify users and the users' Wikipedia pro le page.
Table 10: Top         ve instantiated pairs of disjoint           Publisher Recommendations : Such problems with incon-
classes                                                           sistent data { especially those arising from multiple sources
                                                                  { may be quite dicult to solve. The obvious and lazy
between an object's datatype and that speci ed by the prop-       solution is to remove the disjointness constraints from the
erty's range, parsing the literal string into the value space     relevant ontologies; however, these constraints are intended
of the range datatype; however, caution is required when          to ag nonsensical or con icting information and removing
considering non-standard datatypes: consider if a property        them clearly does not solve the root cause. Currently, the
ex:temp has the datatype ex:celcius as range and is used          main observed cause for such inconsistencies is the use of
with an ex:fahrenheit value { clearly the value should not        incompatible naming schemes { using URIs to identify two
be parsed as ex:celcius although in it's lexical space.           completely di erent things { most often across di erent do-
                                                                  mains; agreement must be reached on what is an appropriate
 2.3.9   OWL inconsistencies                                      identi er for the contentious resource.
Category : inconsistent                                           Consumer Recommendations : There are two standard
   The Web Ontology Language (OWL) includes features {            approaches for handling inconsistencies in agents incorpo-
such as de ning disjoint classes, inequality between resources,   rating reasoning: resolve or overlook; the former approach
etc. { which can additionally be used to check if some data       { which requires `defeating' the `marginal view' { may not
agrees with the underlying ontology; i.e., that the data is       be so in tune with the open philosophy of the Web, where
consistent.                                                       contradiction could be considered a `healthy' symptom of dif-
   To begin with, we quickly mention inconsistency checks         fering opinions. Rule-based reasoning approaches have the
which we performed, but which did not detect anything in          luxury of optionally overlooking inconsistencies, where in-
the crawl. Firstly, the class owl:Nothing is intended to rep-     consistent data can simply be agged (e.g., see OWL 2 RL
resent the empty class, and, as such, should not contain          rules in [7] with false consequences). However, tableaux al-
any members; in our dataset, we found no directly asserted        gorithms are less resistent to inconsistencies and are tied by
members of owl:Nothing. Also, an inconsistency can occur          the principle of explosion: ex contradictione quodlibet (from
when owl:sameAs and owl:differentFrom overlap; again, how-        contradiction follows anything); some works focus on para-
ever, we found no such examples in our crawl { in fact, we        consistent reasoning { tableaux reasoning tolerant to incon-
found no usage of owl:differentFrom in the predicate position     sistency { although such approaches are expensive in prac-
of a triple. Similarly, although we found two instances of        tice (cf. [14]). In any case, in either rule- or tableaux-based
owl:AllDifferent/ owl:distinctMembers usage, none resulted        approaches { and depending on the application scenario {
in an inconsistency. Continuing, we also performed sim-           inconsistent data may be pre-processed with those triples
ilar checks for instances of classes which were de ned as         causing inconsistencies dropped according to some heuristic
complements of each other using owl:complementOf; however,        measures.
again we found no owl:complementOf relations in our dataset.
Brie y, we also performed simple checks for unsatis able con-     2.4      Non-authoritative contributions
cepts whereby, for example, one class is (possibly indirectly)
both a subclass-of and disjoint-with another class: for each       2.4.1    Ontology-hijacking
class found, we performed reasoning on an arbitrary mem-
bership of that class and checked whether any of the inferred
                                                                  Category : incoherent/hijack
memberships were of disjoint classes; however, we found no           In previous work, we encountered a behaviour which we
such concepts on the Web.                                         termed \ontology hijacking" [12]: the rede nition by third
   In fact, all inconsistencies we found in our crawl were re-    parties of external classes/properties such that reasoning over
lated to memberships of disjoint classes. The OWL property        data using those external terms is a ected: herein { and
owl:disjointWith is used to relate classes which cannot share
                                                                  loosely { we de ne the notion of an authoritative document
members; disjoint classes are used in popular Web ontologies      for a term as the document resolved by dereferencing the
as an indicator of inconsistent information. For example, in      term, and consider all other (non-authoritative) documents
FOAF the classes foaf:Person and foaf:Document are de ned         as third-party documents (please see [12] for a more exhaus-
as being disjoint: something cannot be both. Resources can        tive discussion). Web ontologies/vocabularies published ac-
be asserted to be members of disjoint classes either directly     cording to best-practices are thereby the only document au-
by document owners, or inferred through reasoning. We only        thoritative for the terms in their namespace.
detected a small number of such direct assertions in our crawl       In our dataset, we found that 5,211 document engaged
{ generally, a resource is asserted to be a member of one class   in some form of ontology hijacking { most such occurrences
in one document and a disjoint class in a remote document.40      were due to third party sources `echoing' the authoritative
   However, after reasoning on our dataset, there were 1,329      de nition of a class or property in their local ontology. How-
occurrences of inconsistencies caused by disjoint classes; Ta-    ever, we also encountered examples of third-parties rede n-
ble 10 enumerates the top ve.                                     ing class/properties. As an example, we found one document
   The most prominent cause of such problems stem from            which rede nes the core property rdf:type { de ning nine of
two incompatible FOAF exporters for LastFM data: the              its properties as being the domain of rdf:type { e ectively
same resources are simultaneously de ned as being of type         41
                                                                     cf., http://rdf.opiumfield.com/lastfm/profile/danbri
                                                                  42
                                                                     cf., http://dbtune.org/last-fm/danbri.rdf
40                                                                43
 http://apassant.net/blog/2009/05/17/                                cf.     http://ws.sioc-project.org/mediawiki/mediawiki.php?wiki=
inconsistencies-lod-cloud                                          http://en.wikipedia.org/wiki/User:Andy_Dingley
leading to every entity described on the Web being inferred       our focus is much more broad in characterising errors in RDF
as a member of those nine properties.44 Again, for example,       Web data.
we found 219 statements declaring foaf:Image { authorita-
tively de ned as a class { to be a property; these were from
the sembase.at domain (again see Footnote 15).                    4.   WHAT ABOUT ALICE?
Publisher Recommendations : This particular issue fo-                We can now see that although our protagonist Alice is
                                                                  purely hypothetical, her adventures in Linked Data wonder-
cuses on how vocabulary publishers re-use existing vocabu-
laries: we would thus particularly encourage vocabularies to      land are disappointingly less so; in our analysis, we have
extend external terms, and not rede ne them. Such usage is        shown the types of issues in RDF data on the Web that have
more generally related to the principle of modularity, encour-    made her journey so disconcerting. We have presented, pro-
aging the modular design of Web vocabularies and avoiding         vided statistics and examples for, and discussed a plethora of
the mess implied by the cross-de nition of terms over the         di erent types of errors, hopefully raising awareness of such
Web.                                                              issues amongst data publishers and developers of agents who
                                                                  wish to access and interpret such data. As typi ed by Al-
Consumer Recommendations : Clearly, on the Web, peo-              ice, such issues can dramatically lower the quality of some
ple should not be constrained in what they express and where      applications, and consequently their end-user appeal; the er-
they express it; however, to do useful reasoning, developers      rors do not come from the engine, but from the underlying
must take contextual information into account and provide         data and thus, reasonable e orts to resolve data issues are
some means of insulating ontologies from wayward external         as important as developing tolerant applications.
contributions. Again, in previous work we have described our         So, how can we help Alice?
system for performing reasoning over RDF Web data called             We have already determined that many such issues are
SAOR [12], and found it essential to introduce our notion         easily resolvable by the publisher and therefore concluded
of authority when doing reasoning: in particular, we de ne        that publishers are unaware of the problems resident in their
our notion of an \authoritative rule application" which will      data. One solution would be to provide a system for validat-
not produce inferences from non-authoritative triples which       ing RDF data being published to the Web: several systems
rede ne external terms. An orthogonal approach to the             exist but do not cover the broad range of issues discussed in
same problem is that of \quarantined reasoning" described         this paper. From a syntactic point of view, the rst valida-
in [5], which loosely constitutes \per-document" reasoning,       tor available was the W3C RDF Validator45 , being able to
and scopes inferences based on a closed notion of context         check the syntax of any RDF/XML document (however, not
derived from the implicit and explicit imports of each input      datatype syntax). The DAML validator46 provides check-
document, thus excluding third-party contributions (please        ing of a large number of issues; however the validator is out
see [12] for a more in-depth comparison).                         of date (does not support OWL), and, at the time of writ-
                                                                  ing, does not work. With regards to the protocol issues, the
3.      RELATED WORK                                              online Vapour validator47 [3] aims at validating the compli-
   Earlier papers analysing problems in RDF Web data and          ance of published RDF data (either vocabularies or instances
the uptake of standards mainly focus on the categorisation        data) according to Linked Data principles [2]. The online
and validation of documents with respect to the various OWL       Pellet [16] validator48 enables species validation as well as
species. In [1], the authors performed validation { based on      other criteria we identi ed such as checking ontology consis-
OWL-DL constraints { for a sample group of 201 OWL on-            tency and nding unsatis able concepts.
tologies which were all found to be OWL Full for mainly              There are also a number of command-line validators. The
trivial reasons; the authors then suggested means of patch-       Validating RDF Parser (VRP)49 operates on speci ed RDF
ing the ontologies to be OWL-DL conformant. A similar             Schema constraints, with some support for datatypes. The
but more extensive survey was conducted in [19] over 1,275        Eyeball50 project provides command-line validation of RDF
ontologies; the authors provided categorisation of the expres-    data for common problems including use of unde ned prop-
sivity and species and discussion related to patching of the      erties and classes, poorly formed namespaces, problematic
ontologies. At the moment, we do not o er species validation      pre xes, literal syntax validation and other optional heuris-
for RDFS/OWL and our scope is much broader with respect           tics.
to validation.                                                       However, none of the above validators cover the plethora
   In [15], the authors describe common user errors in model-     of issues we have encountered; thus, we have developed and
ing OWL-DL ontologies. In [17], the authors describe some         now provide RDF:Alerts : http://swse.deri.org/RDFAlerts/.
error checking for OWL ontologies using integrity constraints     Given a URI, the system provides validation for many of the
involving the Unique Name Assumption (UNA) and also the           issues enumerated in this paper; Figure 2 shows a screenshot
Closed World Assumption (CWA). Similarly, in [18], vari-          of feedback for an erroneous document. We further intend to
ous errors and constraints are introduced for error check-        extend the tool { to include all of the presented issues and
ing; the primary contribution is the introduction of ve `in-      suggestions from the community { and to improve usabil-
congruencies' (e.g., an individual not satisfying a cardinality   ity; we may also consider extending such a tool to provide
constraint according to UNA/CWA) with cases, causes and           intermittent automatic reporting to publishers who opt in,
methods of detection. However, all of these papers have a de-     depending on the perceived demand of such a service.
cidedly more OWL-centric focus than our work and provide             Still, other issues { particularly relating to inter-dataset
no analysis or discussion of Web data.                            incompatibility, naming, and inconsistent use of vocabulary
   In [6], the authors provided an in-depth analysis of the       terms { may be more dicult to resolve. Indeed, we have
landscape of RDF Web data in a crawl of 300M triples. Also        45
they identi ed some statistics about classes and properties          http://www.w3.org/RDF/Validator/
                                                                  46
(SWTs) in RDF data; e.g., they found that 2.2% of classes            http://www.daml.org/validator/
                                                                  47
and properties had no de nition and that 0.08% of terms              http://validator.linkeddata.org
                                                                  48
had both class and property meta-usage. However, again               http://www.mindswap.org/2003/pellet/demo.shtml
                                                                  49
                                                                     http://139.91.183.30:9090/RDF/
44                                                                50
     http://www.eiao.net/rdf/1.0                                   http://jena.sourceforge.net/Eyeball/
                                                                 5.   REFERENCES
                                                                  [1] S. Bechhofer and R. Volz. Patching syntax in OWL
                                                                      ontologies. In International Semantic Web Conference,
                                                                      volume 3298 of Lecture Notes in Computer Science,
                                                                      pages 668{682. Springer, November 2004.
                                                                  [2] T. Berners-Lee. Linked Data. Design issues for the
                                                                      World Wide Web, World Wide Web Consortium, 2006.
                                                                      http://www.w3.org/DesignIssues/LinkedData.html.
                                                                  [3] D. Berrueta, S. Fernndez, and I. Frade. Cooking
                                                                      HTTP content negotiation with Vapour. In
                                                                      Proceedings of 4th Workshop on Scripting for the
                                                                      Semantic Web (SFSW2008), June 2008.
                                                                  [4] P. V. Biron and A. Malhotra. XML Schema part 2:
Figure 2: Screenshot of validation results from                       Datatypes second edition. W3C Recommendation,
RDF:Alerts system.                                                    Oct. 2004. http://www.w3.org/TR/xmlschema-2/.
also not properly discussed issues introduced by versioning,      [5] R. Delbru, A. Polleres, G. Tummarello, and S. Decker.
where, for example, a vocabulary maintainer makes changes             Context dependent reasoning for semantic documents
to the de nition of a term breaking backwards-compatibility           in sindice. In Proceedings of the 4th International
with legacy usage of that term { indeed, we recognise that            Workshop on Scalable Semantic Web Knowledge Base
casual versioning may explain some of the discrepancies we            Systems (SSWS 2008), Karlsruhe, Germany, Oct. 2008.
have encountered in this paper, though systematic detection       [6] L. Ding and T. Finin. Characterizing the Semantic
of such errors is dicult given our static snapshot dataset.          Web on the Web. In Proceedings of the 5th
   The resolution of such errors may sometimes require com-           International Semantic Web Conference, November
promise between maintainers of ontologies and maintainers             2006.
of exporters which populate the ontologies' terms, re ect-        [7] B. C. Grau, B. Motik, Z. Wu, A. Fokoue, and C. Lutz.
ing the current social and community driven nature of Web             OWL 2 Web Ontology Language: Pro les. W3C
publishing. Re ecting such community driven e orts, con-              Working Draft, Apr. 2008.
sideration is being given to more open ontology editing and           http://www.w3.org/TR/owl2-profiles/.
creation. In VoCamp events51 , people from di erent back-         [8] A. Harth, J. Umbrich, and S. Decker. Multicrawler: A
grounds and with di erent perspectives meet to work on                pipelined architecture for crawling and indexing
modelling lightweight ontologies for immediate use. In order          semantic web data. In 5th International Semantic Web
to allow ontologies to evolve according to user needs, initia-        Conference, pages 258{271, 2006.
tives such as semantic wikis for ontology management [13]         [9] M. Hausenblas. Exploiting linked data to build
and services such as OpenVocab52 allow users to more freely           applications. IEEE Internet Computing, 13(4):68{73,
interact with the ontology terms they wish to use and share.          2009.
Although such approaches may again su er from human er-          [10] P. Hayes. RDF semantics. W3C Recommendation,
ror and disagreement { and have many open issues such as              Feb. 2004. http://www.w3.org/TR/rdf-mt/.
versioning and editing privileges { such community-driven        [11] T. Heath. How will we interact with the web of data?
e orts could lead to a more extensive vocabulary of terms             IEEE Internet Computing, 12(5):88{91, 2008.
for use on the Web.
   We have also initiated a community driven e ort which         [12] A. Hogan, A. Harth, and A. Polleres. Scalable
we call \The Pedantic Web Group"53 , which aims to engage             Authoritative OWL Reasoning for the Web. Int. J.
                                                                      Semantic Web Inf. Syst., 5(2), 2009.
with publishers and help them improve the quality of their
data. Firstly, we have provided some pragmatic educational       [13] M. Krotzsch, S. Scha ert, and D. Vrandecic. Reasoning
material for publishers, including a list of validation tools         in semantic wikis. In Reasoning Web, pages 310{329,
and of frequently observed problems in RDF publishing. Sec-           2007.
ondly, we have created a mailing list for actively contacting    [14] Y. Ma, P. Hitzler, and Z. Lin. Algorithms for
publishers about their mistakes and for various discussions           Paraconsistent Reasoning with OWL. In ESWC, pages
on the quality of the Web of Data { subscription to which             399{413, 2007.
is open to the community. Indeed, such e orts may be the         [15] A. L. Rector, N. Drummond, M. Horridge, J. Rogers,
only means to resolve issues which require the co-ordination          H. Knublauch, R. Stevens, H. Wang, and C. Wroe.
of multiple publishers. As such, we see the Pedantic Web              Owl pizzas: Practical experience of teaching owl-dl:
Group as a go-to point for tackling publishing-related issues         Common errors & common patterns. In EKAW, pages
on the Web of Data, and as a community-driven means of                63{81, 2004.
promoting better quality publishing for the Web of Data.         [16] E. Sirin, B. Parsia, B. C. Grau, A. Kalyanpur, and
   To nally conclude, we would like to replace the present            Y. Katz. Pellet: A practical OWL-DL reasoner.
hypothetical Alice with a possible future Alice who is again          Journal of Web Semantics, 5(2):51{53, 2007.
browsing the Web of Data { however this time using an ap-        [17] E. Sirin, M. Smith, and E. Wallace. Opening, closing
plication which has been tempered for noisy data, where the           worlds - on integrity constraints. In OWLED, 2008.
documents have been validated, consistent identi ers used,       [18] J. Tao, L. Ding, and D. L. McGuinness. Instance data
and resources described using a rich vocabulary of community-         evaluation for semantic web-based knowledge
endorsed terms. We hope that such an Alice might be amazed            management systems. In HICSS, pages 1{10, 2009.
{ this time for the right reasons.                               [19] T. D. Wang, B. Parsia, and J. A. Hendler. A survey of
                                                                      the web ontology landscape. In Proceedings of the 5th
51
   http://vocamp.org/wiki/Main_Page                                   International Semantic Web Conference (ISWC 2006),
52
   http://open.vocab.org/                                             pages 682{694, Athens, GA, USA, Nov. 2006.
53
 http://pedantic-web.org/