Weaving the Pedantic Web
Aidan Hogany , Andreas Harthy z , Alexandre Passanty , Stefan Deckery , Axel Polleresy
y Digital Enterprise Research Institute, National University of Ireland, Galway
z AIFB, Karlsruhe Institute of Technology, Germany
yf firstname.lastnameg@deri.org, z harth@kit.edu
ABSTRACT \show me all American female models who have also won an
Over a decade after RDF has been published as a W3C rec- Academy Award for Best Supporting Actress", constructed
ommendation, publishing open and machine-readable con- using facets in the user-interface; the results are automati-
tent on the Web has recently received a lot more attention, cally aggregated from thirteen distinct sources.
including from corporate and governmental bodies; notably However, all has not been plain sailing: this new paradigm
thanks to the Linked Open Data community, there now ex- in Web publishing and interaction [11] has inevitably led to
ists a rich vein of heterogeneous RDF data published on the many teething problems. As we will discuss in this paper,
Web (the so-called \Web of Data") accessible to all. How- there exists a lot of noise within the Web of Data which
ever, RDF publishers are prone to making errors which com- inhibits applications from e ectively exploiting this rich lode
promise the e ectiveness of applications leveraging the re- of open, well-de ned and structured information.
sulting data. In this paper, we discuss common errors in To illustrate, we introduce Alice: a hypothetical end-user
RDF publishing, their consequences for applications, along of an application for searching and browsing the Web of Data.
with possible publisher-oriented approaches to improve the Alice loads some interesting data about herself and is im-
quality of structured, machine-readable and open data on mediately impressed by the integrated view of data from
the Web. publication, blog, social network and workplace exporters;
however, for every second resource she explores, the applica-
tion cannot locate or parse any relevant data. She tries to
1. INTRODUCTION load her publications into a calendar view, but one quarter
Based on the simple principle of using URIs to name and of them are missing as the dates/times contain illegal val-
link things { not just documents { the Resource Description ues. She wants more information relating to properties and
Framework (RDF) o ers a standardised means of represent- classes used to describe herself, but some do not exist; dis-
ing information on the Web such that: (i) structured data couraged, she clicks on a friend of hers but nds that he has
is available to all over the Web; (ii) data can be handled 1,169 names and email addresses (she knew him as \Bob").
through standard APIs and applications; (iii) the meaning She begins to notice that all resources she explores are in-
of the data is well-de ned using lightweight ontologies (or stances of nine strange properties { and then the nal straw:
vocabularies); and (iv) data is interoperable with other RDF she now nds out that her professor is actually a document.
on the Web and can be re-used and extended by other pub- We will provide evidence in this paper as to how Alice
lishers and application developers. could have had such an experience browsing the Web of Data.
Over the past few years, many Web publishers have turned In so doing, we will take stock of some of the diculties
to RDF as a means of disseminating information in an open currently apparent in RDF publishing, and discuss how we {
and machine-interpretable way, resulting in a \Web of Data" and the now decade old Semantic Web community at large
which now includes interlinked content exported from corpo- { can help to improve the current and future quality of RDF
rate bodies (e.g., BBC, New York Times, Freebase), commu- data published on the Web.
nity e orts (e.g., Wikipedia, GeoNames), biomedical datasets
(e.g., DrugBank, Linked Clinical Trials) { even UK govern-
mental entities, where public sector organisations must now
additionally disclose their consultations in RDF1 . Applica-
tions and search engines are now starting to exploit this rich
vein of structured and linked data [9].
For example, Figure 1 shows the results returned by the
VisiNav (http://visinav.deri.org/) system for the query
We would like to acknowledge and thank Richard Cyga-
niak, Michael Hausenblas, Stphane Corlosquet and Antoine
Zimmermann with whom we co-founded the Pedantic Web
Group. We would also like to thank anonymous reviewers
of various incarnations of this paper for their valued feed-
back. The work presented in this paper has been funded
in part by Science Foundation Ireland under Grant No.
SFI/08/CE/I1380 (Lion-2) and by an IRCSET Postgradu-
ate scholarship.
1
http://coi.gov.uk/guidance.php?page=315
Figure 1: Results from VisiNav showing American
Copyright is held by the author/owner(s).
LDOW2010, April 27, 2010, Raleigh, USA.
models who have won Best Supporting Actress in
. the Academy Awards.
2. WEB OF DATA ANALYSIS Finally, we will also endeavour to provide discussion for
We herein present some analysis based on an RDF dataset each issue, both from the perspective of publishers and from
retrieved from the Web in April 2009 using MultiCrawler [8]. the perspective of data consumers.
We performed a seven-hop breadth- rst crawl for RDF/XML We begin with issues relating to how data is found and
documents where we enforced a maximum of 5,000 crawled accessed; then discuss parsing and syntax issues; look at rea-
documents per pay-level-domain (or PLD, viz.: a domain soning issues, including inconsistent data; and nally, intro-
that requires payment, such as deri.ie or data.gov.uk) so as duce and discuss ontology hijacking.
to ensure a diverse Web dataset covering a wide spectrum of
publishers. Indeed, we only crawled for RDF/XML and not 2.1 URI/HTTP: accessibility and derefencabil-
for other formats such as RDFa; RDF/XML is currently by ity
far the most popular format with RDFa growing in popular- As previously alluded to, the Linked Open Data movement
ity. Still, one could expect a small percentage of documents has been integral to RDF publishing on the Web, emphasis-
to contain { e.g., RDFa metadata { which we admittedly ing four basic principles [2]: (i) use URIs as names for things;
overlook in our illustrative statistics. (ii) use HTTP URIs so that those names can be looked up;
The crawl accessed 149,057 URIs (including 39,439 redi- (iii) provide useful information when a look-up on that URI
rects), 54,836 (36.8%) of which resulted in valid RDF/XML is made; and (iv) include links using external URIs.
documents (almost precisely 50% excluding redirects). The With regards to providing information about a resource
nal dataset contains 12,534,481 RDF statements mention- upon a HTTP lookup of its URI { called dereferencing { em-
ing 1,598,521 URIs { including 5,850 classes and 9,507 prop- phasis is placed on providing information in RDF and disam-
erties. biguating identi cation of information resources (document
Based on this dataset, we present selected issues in RDF URIs) from non-information resources (entities described in
data published on the Web. We focus on errors that we can those documents). Now, using statistics of our crawl which
systematically detect, and thus one should not consider the consisted of lookups on URIs in the data, we can draw some
following an exhaustive list; similarly, it is important to note initial conclusions relating to Linked Data practices on the
{ given the diminutive scale and perhaps even age of our Web of Data.
dataset { that the statistics presented herein are intended to
be illustrative, not exhaustive. That said, we still claim that 2.1.1 Dereferencability issues
the analysis of our dataset o ers a valuable insight into cur- Category : incomplete
rent issues relating to RDF Web publishing: although inter- In accordance with \use HTTP URIs so that those names
polating the exact prevalence of such problems to the entire can be looked up", dereferencing a URI consists of retrieving
Web of Data may not be sensible, our statistics should o er content as de ned by RFC39862 .
an indication as to the relative and approximate prevalence Firstly, 5.3% of URIs returned an error (4xx client er-
of such problems. ror/5xx server error) response code, in con ict with the third
Throughout the paper, we endeavour to present examples Linked Data principle above: \provide useful information
of the various publishing errors by giving links to RDF Web when a look-up on that URI is made". In most such { ad-
documents exhibiting such. Note that the purpose of provid- mittedly relatively rare { cases simply nothing exists at that
ing these examples is to: (i) to give concrete and tangible ex- location and a 404 Not Found code is returned (4.3% overall,
amples to the errors, giving indications as to how they might 81% of error codes).
have occurred, how they might be presently solved, and how Secondly, 26.5% of URI lookups resulted in a redirect (30x
they could be avoided in future; (ii) show that noise is present code). In fact, Linked Data principles encourage the use
in a diverse range of sources, describing a diverse range of of redirects, particularly for identifying non-information re-
domains; and (iii) to show that errors in RDF publishing are sources (i.e. URIs which denote things rather than les):
not only the result of inexperience { we show examples of speci cally, the 303 redirect is recommended. Of the redi-
errors in academic publishing, community-based publishing, rection URIs, 55.1% (14.6% of total) o ered a 303 redirect
popular vocabularies, and even documents published by the to another location as recommended; however, 30.2% (8% of
authors of this paper. The purpose of the examples is thus total) used a 302 redirect and the remaining 14.7% (3.9% of
not to \point the nger", but to give an honest appraisal of total) used a 301 redirect.
such issues so as to identify possible directions forward. In the machine-oriented world of Linked Data, publishers
For posterity, we provide snapshots of documents and enu- should be even more careful to avoid broken links and to
merate the namespace pre xes referenced in this paper at make URIs dereferencable, thus enabling automatic data-
http://aidanhogan.com/pedantic/. access for Semantic Web applications and providing them {
In order to structure the highlighted issues, we identify and ultimately end-users { a complete, coherent picture.
four categories of symptoms:
Publisher Recommendations : Publishers should carefully
incomplete: equatable to a dead-link in the current follow Linked Data best practices when \minting" URIs.
HTML web { a software agent will not be able to re- Consumer Recommendations : Applications should not
trieve data relevant to a particular task; expect high recall when dereferencing URIs found in RDF
data: for high recall, applications may have to consider pre-
incoherent: a software agent will not be able to cor- fetching/data-warehousing approaches.
rectly interpret some local piece of data as either the
publisher or vocabulary maintainer would expect; 2.1.2 No structured data available
Category : incomplete
hijack: a software agent will not be able to correctly Excluding redirects, 92.8% of URIs return a 200 OK re-
interpret some remote piece of data as would be ex- sponse code along with content; but what do these docu-
pected; ments contain? Linked Data principles require that useful
inconsistent: a software agent will interpret a contra- 2
http://labs.apache.org/webarch/uri/rfc/rfc3986.
diction in the data. html
data be returned upon lookup of a URI from the Web of application/rdf+xml, only 571 (1.2%) were invalid RDF/XML
Data, with particular emphasis on returning RDF. Thus, documents; usually caused by simple errors such as unescaped
from our crawl requesting application/rdf+xml content, we special characters, misuse of RDF/XML shortcuts, and omis-
would reasonably expect a high percentage of documents re- sion of namespace. Again, such issues are relatively rare, pre-
turning RDF/XML3 . sumably due to use of mature RDF/XML APIs for producing
Of the 101,709 URIs which returned content with response data and the popularity of the W3C RDF/XML validation
code 200 OK, we observed that only 45.4% of URIs report a service5 .
content-type application/rdf+xml, with a further 34.8% re- Publisher Recommendations : Publishers should use an
porting text/html. Commonly in RDF data, information re- appropriate syntactic validator for their content, or only use
source URIs are used to identify themselves (or more prob- trusted APIs to produce content.
lematically to identify related resources); for example, in
RDF, HTML documents are naturally identi ed using their Consumer Recommendations : Applications could pos-
native URI. In almost all instances of a non-RDF content- sibly investigate the use of tools for xing syntax errors:
type, the URI is simply a document without any supporting e.g., use standard XML syntax cleaning tools for XML-based
RDF metadata. Hence, as before, Semantic Web agents will RDF syntax. We have no experience in using such tools, and
not be able to properly exploit the content as expected by they would have to be evaluated in the given application sce-
end-users. nario: again in any case, syntax errors are admittedly rare
and such concerns would only apply to applications with a
Publisher Recommendations : HTML pages { especially large emphasis on high recall.
those whose URIs are mentioned in RDF documents { could
be embedded with RDFa. 2.3 Reasoning: noise and inconsistency
Consumer Recommendations : A possible { and admit- Thus far, we have seen that about half of the URIs used to
tedly quick and dirty { solution to avoid dead-links would be identify resources in the Web of Data resolve to some valid
to convert the header information of HTTP URIs into RDF RDF/XML data. We now look at issues relating to the data
using the terms from the W3C published \HTTP Vocabu- contained within those documents: i.e., what they say and
lary in RDF 1.0"4 . More ambitiously, a system may consider how the machine interprets the data.
extracting RDF from non-RDF content, such as the title of Layered on top of RDF are the core RDF Schema (RDFS)
a HTML page or metadata for images. Such measures would and Web Ontology Language (OWL) standards, which allow
ensure that at the very least, some structured information for de ning the semantics or meaning of RDF data through
can be retrieved for a wider variety of URIs, thus avoiding de nitions of classes and properties in schemas/ontologies.
`dead-links'. For example, the Friend Of A Friend (FOAF)6 project pub-
lishes OWL de nitions of a set of classes and properties which
2.1.3 Misreported content-types forms a structured and popular vocabulary for describing
Category : incomplete people in RDF.
A HTTP response contains an optional header eld stat- Classes represent a grouping of resources: e.g., FOAF de-
ing the content type of the returned le. A consumer ap- nes the class foaf:Person and one can assign ex:Alice and
plication can then decide from the header whether the con- ex:Bob as members of this class. Using RDFS and OWL, a
tent is suitable for consumption, and whether the content publisher can then de ne characteristics of such classes (and,
should be accessed. However, we observed that RDF/XML thus, of all of its members); e.g., by de ning foaf:Person
content is commonly returned with a reported content-type as a subclass of foaf:Agent, FOAF implies that ex:Alice,
other than application/rdf+xml: from our crawl, 16.9% of ex:Bob and all other foaf:Persons are also members of the
valid RDF/XML documents were returned with an incom- class foaf:Agent.
patible or more generic content type; e.g.,: text/xml (9.5%), Properties represent the de nable attributes of resources,
application/xml (5.9%), text/plain (1%) & text/html (0.4%). and also relationships that are possible between resources;
Publisher Recommendations : Publishers should ensure e.g., FOAF de nes foaf:knows as a relationship that can exist
that the most speci c available MIME-type is reported for from one member of foaf:Person to another, or that members
their content. of foaf:Person can have the attribute foaf:surname which has
a string value. Other publishers across the Web can then re-
Consumer Recommendations : Herein, a trade-o exists use and extend de nitions of classes and properties { such as
for consumer agents: an agent with emphasis on perfor- the ones from FOAF.
mance may still use the reported content-type to lter non- Thereafter, reasoning can use the semantics of these classes
supported content formats, whereas an agent with more em- and properties to interpret the data, and to infer new knowl-
phasis on recall should relax { or possibly ignore { ltering edge (e.g., that ex:Alice is also a foaf:Agent, or that if ex:-
based on reported content-type. Alice foaf:knows ex:Bob, then ex:Alice and ex:Bob are foaf:-
Persons).
2.2 Syntax errors Some errors in RDF only reveal themselves after reasoning
2.2.1 RDF/XML Syntax Errors { e.g., some unforeseen incorrect inferences occur { and as
such, can stay hidden from the publisher. In this section, we
Category : incomplete will look at issues relating to the interpretation of RDF data
At the outset of the Semantic Web movement, publish- on the Web { in particular focussing on reasoning issues; in
ers opted to employ the existing XML standard to encode order to shed light on such issues, we applied reasoning over
RDF; RDF/XML is still the most popular means of pub- our crawl using the Scalable Authoritative OWL Reasoner
lishing RDF today. Although its syntax is quite complex, (SAOR) [12], which we will discuss as pertinent.
we encountered relatively few syntax errors in RDF/XML
documents accessed during our crawl. Of the 46,136 doc- 2.3.1 Atypical use of collections, containers and reifi-
uments which return response code 200 and content-type cation
3 5
http://www.w3.org/RDF/Validator/
4 6
http://www.w3.org/TR/HTTP-in-RDF10/ http://foaf-project.org/
Unde ned Property Triples Used
Category : incoherent foaf:member name 148,251
There is a set of URI names which are reserved by the foaf:tagLine 148,250
RDF speci cation for special interpretation in a set of triples; foaf:image 140,791
although the RDF speci cation does not formally restrict cycann:label 123,058
usage of these reserved names, misuse is often inadvertent. qdoslf:neighbour 100,339
We rstly discuss the RDF collection vocabulary, which Table 1: Count of the top ve properties used with-
consists of four constructs: fList, first, rest, nilg. Indeed, out a de nition
few examples of atypical collection usage exist on the Web, Unde ned Class Triples Used
probably attributable to widespread usage of the RDF/XML sioc:UserGroup 21,395
shortcut rdf:parseType="Collection" for specifying collections; rss:item 19,259
this shortcut shields users from the underlying complexity of linkedct:link 17,356
collections on the triple level and generally ensures typical politico:Term 14,490
collection use. The only atypical collection usage we found in bibtex:inproceedings 11,975
our Web-crawl was one document which speci ed resources
of type List without first or rest properties attached7 . Table 2: Count of the top ve classes used without
A related issue is that of atypical container usage, which a de nition
is concerned with the following constructs: Alt, Bag, Seq, total of almost 300k triples11 { the FOAF vocabulary does
1... n and the syntactic keyword li. Again, atypical con-
not contain these properties and they are not de ned else-
tainer usage is uncommon on the Web: we found one domain where; such a practice of deliberately inventing unde ned
(viz. semanticweb.org8 ) which, in 229 documents, exports properties within a related namespace is common on the
RDF containers without choosing a type of Alt, Bag or Seq. Web. Sometimes publishers make simple spelling mistakes:
Finally, there may exist atypical usage of the rei cation again, the property foaf:image is incorrectly used instead of
constructs: Statement, subject, predicate, object. However, foaf:img in the livejournal.com domain; to take another ex-
in our dataset we only found one such example9 wherein ample, the term qdoslf:neighbour is commonly used { in 100k
predicate is assigned a blank node value and used alone with-
triples { instead of the property qdoslf:neighbours de ned in
out subject or object. the namespace.12
Publisher Recommendations : Where possible, publishers Similarly, there were 1.01M triples (8.1%) mentioning un-
should abide by the standard usage of such RDF terms to de ned classes in 21.3k documents (38.8%); the top ve in-
enable interoperability. stantiated such classes are enumerated in Table 2. Neither
Consumer Recommendations : Although we found that of the rst three classes nor the last class are de ned in the
atypical usage of the core RDF terms is relatively uncom- dereferenced documents; for example, all of the sioc:User-
mon, consumer applications should be tolerant of such atypi- Group instances come from the apassant.net domain13 . To
cal usage; for example, developers of reasoning engines which take another example, the class politico:Term is generically
operate over Web data and consider RDF collections as part described in the dereferenced document, but is neither im-
of complex OWL class descriptions { and even though we plicitly nor explicitly typed as a class.
did not nd such usage in our dataset { should implement Publisher Recommendations : Many such errors are inde-
simple checks to ensure that the respective engine is tolerant liberate and due to spelling or syntactic mistakes resolvable
to cyclic, non-terminating and branching collection descrip- through minor xes to the respective ontologies or exporters.
tions. Where terms have been knowingly invented, we suggest that
the term be recommended as an addition to the respective
2.3.2 Use of undefined classes and properties ontology { or de ned in a separate namespace { to enable
Category : incoherent re-use.
Oftentimes on the Web of Data, properties and classes are Consumer Recommendations : Liberal consumer appli-
used without any formal de nition. For example, publish- cations could, for example, use fuzzy string matching tech-
ers might say that ex:Alice ex:colleague ex:Bob even though niques { e.g., Levenstein distance measures { between un-
ex:colleague is not de ned as a property. Again, although de ned classes and properties encountered in the data, and
such practice is not prohibited, by using ad-hoc unde ned classes and properties de ned in the vocabularies. Generally
classes and properties publishers make automatic integration however, consumer applications can usually overlook such
of data less e ective and forego the possibility of making in- mistakes and simply accept the consequence of incomplete
ferences through reasoning. reasoning for triples using such unde ned terms.
From our crawl, 1.78M triples (14.3% of all triples) use
unde ned properties, appearing in 39.7k documents (72.4% 2.3.3 Misplaced classes/properties
of valid RDF/XML documents): Table 1 enumerates the top Category : incoherent
ve.10 Sometimes, a URI de ned as a class is used as a property
For example, from our crawl, the livejournal.com domain (appears in the predicate position of a triple) or, conversely,
uses the properties foaf:member name and foaf:tagLine in a a URI de ned as a property is used as a class (appears in
the object position of an rdf:type triple); although not pro-
7 hibited, such usage is usually inadvertent and can ruin the
http://scripts.mit.edu/~kennylu/myself.rdf
8 machine-interpretation of the associated data.
cf. http://iswc2006.semanticweb.org/submissions/
Harth2006dq_Harth_Andreas
11
9
http://web.mit.edu/dsheets/www/foaf.rdf cf. http://danbri.livejournal.com/data/foaf
12
10
It is important to note that herein, when we mention \un- cf. http://foafbuilder.qdos.com/people/danbri.org/
de ned" classes or properties, we loosely refer to classes or foaf.rdf
13
properties \not de ned in our crawl". In any case, our crawl cf. http://apassant.net/home/2007/12/flickrdf/data/
would contain any property- or class-descriptions published people/36887937@N00 { indeed the authors herein are also
according to best practices (i.e., using dereferencable terms). prone to making simple errors in their publishing.
Class # Misplaced D.type Prop. # Non-literal % Non-literal
rdfs:range 8,012 swrc:journal 19,853 97.8%
foaf:Image 639 swrc:series 14,963 97.3%
rdfs:Class 94 ical:location 4 2.6%
wot:PubKey 18 foaf:name 4 0%
foaf:OnlineAccount 15 foaf:msnChatID 3 0.4%
Table 3: Top ve \classes" used in the predicate po- Table 5: Top ve datatype-properties used with non-
sition of a triple literal values
Property # Misplaced Obj. Prop. # Literal % Literal
foaf:knows 4 affy:startsAt 6,234 100%
foaf:name 4 affy:stopsAt 6,234 100%
foaf:sha1 2 affy:cdsType 5,193 100%
swrc:author 1 affy:frame 4,882 100%
foaf:based near 1 affy:commonToAll 4,814 100%
Table 4: Top ve properties found in the object po- Table 6: Top ve object-properties used with literal
sition of an rdf:type triple values
Table 3 shows the top ve classes used as a property in our that that term is a class or property { for example, rule rdf1
crawl. In fact, rdfs:range is a core RDFS property, but is in RDFS [10]. Aside from this, consumer applications will
de ned in one document14 as a class; hence the 8,012 occur- probably have to accept incomplete inferencing over such er-
rences are valid use of the property and the single declaration roneous triples.
of rdfs:range as a class is at fault (this is also an instance
of ontology hijacking, which we will discuss in Section 2.4). 2.3.4 Misuse of owl:DatatypeProperty/owl:ObjectProp-
Most occurrences of the foaf:Image class used as a property erty
stem from the sembase.at domain15 ; here the foaf:depiction Category : incoherent
property would be more suitable. Use of rdfs:Class as a The built-in term owl:DatatypeProperty describes proper-
property comes from the ajft.org and rdfweb.org domains16 ties which relate some resource to a literal value, i.e., an
where rdfs:Class is seemingly mistaken as rdf:type. The \attribute" property (in terms of Object-Oriented Program-
class wot:PubKey is mistakenly used instead of wot:hasKey17 . ming); similarly, the OWL term owl:ObjectProperty describes
Misuse of foaf:OnlineAccount stems from one document18 properties which relate one resource to another (i.e., a \re-
wherein the RDF/XML shortcut rdf:parseType="Resource" lation" property). Oftentimes, attribute properties are used
is used inappropriately, causing parsing of foaf:OnlineAcc- between two resources, and relation properties are used with
ount elements as predicates.
literal values.
After reasoning, more such errors were discovered, partic- From our crawl, we found a total of 34.8k triples (0.3%)
ularly in the affymetrix.com domain19 which describes genes with datatype-properties given non-literal objects (in 1,194
and mistakingly uses rdfs:subClassOf to assert subsumption [2.2%] documents across 9 domains). Table 5 lists the top
relations between properties (amongst many other issues); ve; the only signi cant errors stem from l3d.de21 which ex-
this resulted in properties { which, combined, were used in ports RDF from the Digital Bibliography & Library Project
37,454 triples { being typed as classes. (DBLP) { they de ne two datatype-properties in the swrc:
Conversely, the usage of properties in the class position { namespace but only use the properties with non-literal ob-
viz. the object position of an rdf:type tripe { is much less jects.
common; Table 4 lists the results, with most errors stemming Analogously, there were 41.7k triples (0.3%) with object-
from one document20 . properties given literal values (in 4,438 [8%] documents from
Publisher Recommendations : Again, all such errors could 91 domains). Table 6 lists the top ve; many such occur-
easily be xed by the publishers once they are made aware. rences come from the affymetrix.com domain which com-
Many of the above encountered errors were as a result of mis- monly uses ve di erent object-properties with literal val-
use of RDF syntactic terms, such as rdf:parseType="Resource", ues (in a total of 27.4k triples from our crawl). However,
or more generally as syntactic mistakes in their documents: there were many other such properties with signi cant mis-
thus, publishers should not only ensure that their documents use including miscellaneous properties from the opencyc.org
are syntactically valid, but also that they parse into the domain (6,161), foaf:page (3,160), foaf:based near (1,078),
triples expected. ical:organizer (456), amongst others; again, the errors were
Consumer Recommendations : Applications which incor- spread over 92 di erent domains. In fact, the property foaf:-
porate reasoning should consider foregoing standard infer- myersBriggs (in the popularly used FOAF speci cation itself)
ences which rely on the position of a term in a triple to infer was until recently incorrectly de ned as an owl:ObjectProperty
14
with rdfs:range rdfs:Literal and had 35 literal values in our
http://www.w3.org/2000/10/swap/infoset/ dataset.
infoset-diagram.rdf
15
cf. http://wiki.sembase.at/index.php/Special: Publisher Recommendations : Where datatype- or object-
ExportRDF/Dieter_Fensel property constraints are erroneously speci ed { e.g., swrc:-
16
cf. http://swordfish.rdfweb.org/discovery/2004/01/ journal, swrc:series, foaf:myersBriggs { they can simply be
www2004/files/1101776794087.rdf reversed by the ontology maintainers. However, in many
17
cf. http://www.snell-pym.org.uk/alaric/alaric-foaf. cases such constraints are purposefully de ned to ensure con-
rdf sistent usage of the term; in this case, the onus is on pub-
18
cf. http://tommorris.org/foaf lishers to thereby abide.
19
cf. http://affymetrix.com/community/publications/
affymetrix/tmsplice/all_genes.1.rdf 21
20
cf. http://dblp.l3s.de/d2r/data/publications/conf/
http://www.marconeumann.org/foaf.rdf aswc/HoganHP08
Consumer Recommendations : Applications would typi- Now, all such users can be interpreted as equivalent { i.e.,
cally use such constraints for form generation in the context representing the same real-world person { according to the
of instance data creation. Liberal versions of such applica- semantics of the foaf:mbox sha1sum property. This problem
tions may decide to automatically reverse such constraints, is quite widespread: even in our diminutive crawl, 52 hosts
where { in examples such as the affy: properties above { all contribute 1,169 di erent bogus values in 1,041 documents.
usage is contrary to the speci ed constraint. Indeed, some For example, 194 errors come from the bleeper.de domain23 ,
weighting scheme may be adopted for examples { such as the 189 from identi.ca24 , 166 from uni-karlsruhe.de25 , 163 from
swrc: properties { where most usage is contrary to the vocab- twit.tv26 and 92 from tweet.ie27 ; Table 7 details the top ve
ulary constraint. Again, such approaches would admittedly void values for inverse-functional properties which we found
need evaluation in the setting of the given application. in our dataset.
According to the standard re exive, symmetric and transi-
2.3.5 Members of deprecated classes/properties tive semantics of equality (represented in RDF by the equal-
Category : incoherent ity relation owl:sameAs), if we take for example the 986 entries
Brie y, the OWL classes owl:DeprecatedClass and owl:- with the same null sha1 value, 9862 =972k owl:sameAs rela-
DeprecatedProperty are used to indicate classes or properties
tions would be inferred. Further, assuming, for example, an
that are no longer recommended for use: vocabulary publish- average of eight triples mentioning each equivalent resource,
ers usually assert deprecation for classes or properties which 972k*8 = 7.8M statements would be inferred by substitut-
have been considered to be obsoleted by more popular terms ing each equivalent identi er into each statement. In other
in local or remote vocabularies, or perhaps even where the words, such chains of equality cause a quadratic explosion of
original term is contrary to some naming scheme or consid- inferences; when one considers larger Web-crawls, the prob-
ered outside of the scope of the vocabulary. In our dataset, lem becomes quite critical.
we did not nd any members of a deprecated class; however, Publisher Recommendations : For publishers, the issue is
we found 290 instances (in 115 documents) of four depre- easily resolved by, for example, validating user input and
cated properties: wordmap:subCategory (260), sioc:has group checking the uniqueness and validity of inverse-functional
(15), sioc:content encoded (10) and sioc:description (5). values. Conversely, vocabulary maintainers should be care-
Publisher Recommendations : Publishers of instance data ful to clearly state that a property is inverse-functional in
should intermittently verify that no terms used have since the human-readable speci cation, and select labels for prop-
been considered deprecated by the vocabulary maintainer, erty URIs which give an indication of the inverse-functional
and should take appropriate action to use { possibly novel { nature of the property { for example, choose the label ex:-
personalMbox over ex:mbox.
recommended terms where possible.
Consumer Recommendations : Applications could con- Consumer Recommendations : A simple solution com-
sider specifying manual mappings from deprecated terms to monly used by reasoning agents is to simply blacklist void
compatible terms now recommended for use. Less liberal values. Although an exhaustive list of blacklist candidates
applications may consider omitting triples which use depre- may be dicult to derive, the above values would { in our
cated terms. Generally, however, usage of deprecated terms experience { constitute most of the void values. Other heuris-
does not require special treatment. tics may be employed to ensure correct equality reasoning {
for example, use of a disambiguation step to quickly remove
2.3.6 Bogus owl:InverseFunctionalProperty values obviously incorrect equality inferences.
Category : incoherent/hijack 2.3.7 Malformed datatype literals
Aside from URIs { which can be hard to agree upon { Category : incoherent
resources are also commonly identi ed by values for proper-
ties which uniquely identify a resource; such keys are pre- In RDF, a subset of well-de ned XML datatypes are used
existing and easier to agree upon. These properties are to provide structure and semantics to literal (string) val-
termed \inverse-functional" and are identi ed in OWL with ues. For example, string date values can be speci ed us-
the term owl:InverseFunctionalProperty. If two resources ing the xsd:date datatype, which provides a lexical syntax
share a common value for one of these properties, reason- for date strings and a mapping from date strings to date
ing will view these resources as equivalent (referring to the values interpretable by an application. From the content
same resource). For example, the FOAF ontology has de- of the crawl, we found 3,666,840 literals of which 170,351
ned a number of inverse-functional properties for identi- (4.6%) used a datatype. Of these, the top ve most popular
fying people; these include foaf:homepage, foaf:mbox (email), datatypes were xsd:string (53,879), xsd:nonNegativeInteger
foaf:mbox sha1sum (sha1 encoded email to prevent spamming), (38,501), xsd:integer (15,826), xsd:dateTime (15,824), and
amongst others. Herein, FOAF holds the intuition that the xsd:unsignedLong (12,318).
values for such properties should be unique to an individ- Unfortunately, incorrect use of datatypes is relatively com-
ual, and that the usage of such properties should re ect that mon in the Web of Data. Firstly, datatype literals can be
(i.e., foaf:mbox should only be used for personal and unshared malformed : i.e., ill-typed literals which do not abide by the
email-addresses). lexical syntax for their respective datatype. There were 4,650
However, FOAF exporters commonly do not respect the malformed datatype literals (2.7% of all typed literals) in our
semantics of these inverse-functional properties and export crawl: Table 8 summarises the top ve datatypes to be in-
`void' values given partial user-input. The most widespread stantiated with malformed values.
example is 08445a31a78661b5c746feff39a9db6e4e2cc5cf, which The two most common errors for xsd:dateTime stem from
is the encrypted SHA1 value of `mailto:' and is commonly 23
assigned by FOAF exporters { as values for foaf:mbox sha1sum cf. http://bleeper.de/powerboy/foaf
24
{ to users who don't specify an email in some input form.22 cf. http://identi.ca/whataboutbob/foaf
25
cf. http://www.aifb.uni-karlsruhe.de/Personen/
22
In fact, at the time of writing, a Google search for this SHA1 viewPersonFOAF/foaf_1876.rdf
26
string will result in nearly two million hits { seemingly almost cf. http://army.twit.tv/takeit2/foaf
27
all of which are FOAF RDF documents. cf. http://tweet.ie/seank/foaf
Inverse-Functional Property Void Value Count
foaf:mbox sha1sum "08445a31a78661b5c746feff39a9db6e4e2cc5cf" 986
foaf:mbox sha1sum "da39a3ee5e6b4b0d3255bfef95601890afd80709" 167
foaf:homepage 11
foaf:mbox sha1sum "" 5
foaf:isPrimaryTopicOf 2
Table 7: Count of the ve most common void inverse-functional property values
Datatype # Malformed % Malformed Datatype Property # Clashes % Clashes
xsd:dateTime 4,042 26.4% sl:creationDate 9,212 100%
xsd:int 250 2.1% scot:ownAFrequency 529 100%
xsd:nonNegativeInteger 232 0.6% owl:cardinality 464 65.2%
xsd:gYearMonth 67 100% ical:description 262 21.8%
xsd:gYear 27 1.4% wn20schema:tagCount 204 100%
Table 8: Top ve datatypes having malformed values Table 9: Top ve properties with datatype-clashes
and percentage of all values which are malformed and percentage of all values which cause clashes
(i) the wasab.dk domain28 whereby time-zones are missing so are disjoint with xsd:date. The property scot:ownAFreq-
the required `:' separator; and (ii) the soton.ac.uk domain29 uency is given range xsd:float but only ever used in the do-
wherein the mandatory seconds- eld is not speci ed. For main linkeddata.org34 with xsd:integer objects; xsd:integer
xsd:int, almost all errors stem from the freebase.com domain is a sub-type of xsd:decimal and is disjoint with xsd:float [4].
whereby boolean values True and False are found30 . For owl:cardinality is often used with plain-literal objects35 con-
xsd:nonNegativeInteger, all stem from the deri.ie domain31 trary to the de ned range xsd:nonNegativeInteger. The prop-
where non-numeric strings are incorrectly used. Finally, for erty ical:description { de ned as having range xsd:string
xsd:gYearMonth and xsd:gYear, all illegal usage comes from { is almost always instantiated with a plain-literal object
the dbpedia.org domain32 where full xsd:dateTime literals are (99.8%); however, only the 21.8% which use language tags
used instead. constitute an inconsistency36 . Finally, wn20schema:tagCount
Publisher Recommendations : Clearly, malformed literals has range xsd:nonNegativeInteger but is only used with plain
literals in the w3.org domain37 .
are quite common. In all examples, the errors can be resolved
by simple syntactic xes to the publishing framework, or Publisher Recommendations : In all such cases, the root
removing or changing the datatype on the literal; one can problem could be resolved if the vocabulary publisher re-
conclude { especially in the absence of a popular validator moves the range on the property; in many cases such an
for datatype syntax { that publishers are simply not aware approach may even be suitable: properties such as ical:-
of such issues. description which are intended to have prose values should
Consumer Recommendations : Although datatype-aware remove xsd:string constraints { optionally setting the range
as the more inclusive rdf:PlainLiteral datatype to encourage
agents could incorporate heuristics to shoulder common mis-
takes { e.g., publishers commonly omit the mandatory sec- literal values { and thus allow use of language tags. However,
onds eld from date-time literals { not all such mistakes can the majority of such datatype domain constraints are validly
feasibly be accounted for. Again { and in cases where the used to restrict possible values for the property and the onus
issue next discussed does not apply { such literals can simply is on data-publishers to thereby abide.
be interpreted as plain literals. Consumer Recommendations : Again, liberal agents could
consider changing the de ned range of the property to re ect
2.3.8 Literals incompatible with datatype range some notion of \common" usage. Also, although the usage of
Category : incoherent/inconsistent properties often does not re ect the de ned datatype range,
Aside from explicitly typed literals, the range of properties in our dataset we found that the literal strings were almost al-
may also be constrained to be a certain datatype, mandating ways within the lexical space of the range datatype and that
respectively typed values for that property; e.g., one can say they were just poorly typed. We only found two properties
that the attribute property ex:bornOnDate has xsd:date val- which were given objects malformed according to the range
ues. A datatype clash can then occur if the property is given datatype (before, we were concerned with malformed liter-
a value (i) that is malformed, or (ii) that is a member of an als given an explicit datatype): viz. exif:exposureTime with
incompatible datatype. Table 9 provides counts of datatype range xsd:decimal (given 49 plain literals with malformed
clashes for the top ve such properties. decimal values in one document38 ) and cfp:deadline with
The property sl:creationDate has the range xsd:date but range xsd:dateTime (given 3 plain literals with malformed
all triples with sl:creationDate in the predicate position have date-time values in 3 documents39 ). Thus, in all but the
plain-literal objects { all such usage originates from the sem- latter cases, liberal software agents could ignore mismatches
anlink.net tagging system33 ; please note that plain literals 34
cf. http://community.linkeddata.org/dataspace/kidehen2/
without language tags are considered as xsd:strings [10] and subscriptions/Kingsley_Feed_Collection/tag/rdf
35
28 425 of 464 such examples stem from http://bioinfo.
cf. http://www.wasab.dk/morten/2004/08/photos/1/ icapture.ubc.ca/subversion/Cartik/Object-OWLDL2.owl
index.rdf 36
cf. http://www.ivan-herman.net/professional/CV/
29
cf. http://rdf.ecs.soton.ac.uk/publication/10006 W3CTalks.rdf
30
cf. http://rdf.freebase.com/rdf/aviation/aircraft_ 37
cf. http://www.w3.org/2006/03/wn/wn20/instances/
ownership_count wordsense-act-verb-3.rdf
31
cf. http://www.deri.ie/fileadmin/scripts/foaf.php? 38
http://kasei.us/pictures/2005/20050422-WCCS_
id=320 Dinner/index.rdf
32
cf. http://dbpedia.org/data/1994_San_Marino_Grand_ 39
cf. http://sw.deri.org/2005/08/conf/ssws2006.rdf {
Prix.xml an example of errors admittedly generated by an author of
33
cf. http://www.semanlink.net/tag/rdf.rdf this paper.
Disjoint Classes # Instances foaf:Person in the opiumfield.com domain41 and inferred to
foaf:Agent u foaf:Document 502 be members of foaf:Document in the dbtune.org domain42 .
foaf:Organization u foaf:Person 328 Again, there are many other exporters and domains which
foaf:Document u foaf:Person 232 contribute; for example, an exporter of Wikipedia data in
sioc:Container u sioc:Item 194 the sioc-project.org domain43 uses the same URI to iden-
sioc:Item u sioc:User 35 tify users and the users' Wikipedia pro le page.
Table 10: Top ve instantiated pairs of disjoint Publisher Recommendations : Such problems with incon-
classes sistent data { especially those arising from multiple sources
{ may be quite dicult to solve. The obvious and lazy
between an object's datatype and that speci ed by the prop- solution is to remove the disjointness constraints from the
erty's range, parsing the literal string into the value space relevant ontologies; however, these constraints are intended
of the range datatype; however, caution is required when to ag nonsensical or con icting information and removing
considering non-standard datatypes: consider if a property them clearly does not solve the root cause. Currently, the
ex:temp has the datatype ex:celcius as range and is used main observed cause for such inconsistencies is the use of
with an ex:fahrenheit value { clearly the value should not incompatible naming schemes { using URIs to identify two
be parsed as ex:celcius although in it's lexical space. completely di erent things { most often across di erent do-
mains; agreement must be reached on what is an appropriate
2.3.9 OWL inconsistencies identi er for the contentious resource.
Category : inconsistent Consumer Recommendations : There are two standard
The Web Ontology Language (OWL) includes features { approaches for handling inconsistencies in agents incorpo-
such as de ning disjoint classes, inequality between resources, rating reasoning: resolve or overlook; the former approach
etc. { which can additionally be used to check if some data { which requires `defeating' the `marginal view' { may not
agrees with the underlying ontology; i.e., that the data is be so in tune with the open philosophy of the Web, where
consistent. contradiction could be considered a `healthy' symptom of dif-
To begin with, we quickly mention inconsistency checks fering opinions. Rule-based reasoning approaches have the
which we performed, but which did not detect anything in luxury of optionally overlooking inconsistencies, where in-
the crawl. Firstly, the class owl:Nothing is intended to rep- consistent data can simply be agged (e.g., see OWL 2 RL
resent the empty class, and, as such, should not contain rules in [7] with false consequences). However, tableaux al-
any members; in our dataset, we found no directly asserted gorithms are less resistent to inconsistencies and are tied by
members of owl:Nothing. Also, an inconsistency can occur the principle of explosion: ex contradictione quodlibet (from
when owl:sameAs and owl:differentFrom overlap; again, how- contradiction follows anything); some works focus on para-
ever, we found no such examples in our crawl { in fact, we consistent reasoning { tableaux reasoning tolerant to incon-
found no usage of owl:differentFrom in the predicate position sistency { although such approaches are expensive in prac-
of a triple. Similarly, although we found two instances of tice (cf. [14]). In any case, in either rule- or tableaux-based
owl:AllDifferent/ owl:distinctMembers usage, none resulted approaches { and depending on the application scenario {
in an inconsistency. Continuing, we also performed sim- inconsistent data may be pre-processed with those triples
ilar checks for instances of classes which were de ned as causing inconsistencies dropped according to some heuristic
complements of each other using owl:complementOf; however, measures.
again we found no owl:complementOf relations in our dataset.
Brie y, we also performed simple checks for unsatis able con- 2.4 Non-authoritative contributions
cepts whereby, for example, one class is (possibly indirectly)
both a subclass-of and disjoint-with another class: for each 2.4.1 Ontology-hijacking
class found, we performed reasoning on an arbitrary mem-
bership of that class and checked whether any of the inferred
Category : incoherent/hijack
memberships were of disjoint classes; however, we found no In previous work, we encountered a behaviour which we
such concepts on the Web. termed \ontology hijacking" [12]: the rede nition by third
In fact, all inconsistencies we found in our crawl were re- parties of external classes/properties such that reasoning over
lated to memberships of disjoint classes. The OWL property data using those external terms is a ected: herein { and
owl:disjointWith is used to relate classes which cannot share
loosely { we de ne the notion of an authoritative document
members; disjoint classes are used in popular Web ontologies for a term as the document resolved by dereferencing the
as an indicator of inconsistent information. For example, in term, and consider all other (non-authoritative) documents
FOAF the classes foaf:Person and foaf:Document are de ned as third-party documents (please see [12] for a more exhaus-
as being disjoint: something cannot be both. Resources can tive discussion). Web ontologies/vocabularies published ac-
be asserted to be members of disjoint classes either directly cording to best-practices are thereby the only document au-
by document owners, or inferred through reasoning. We only thoritative for the terms in their namespace.
detected a small number of such direct assertions in our crawl In our dataset, we found that 5,211 document engaged
{ generally, a resource is asserted to be a member of one class in some form of ontology hijacking { most such occurrences
in one document and a disjoint class in a remote document.40 were due to third party sources `echoing' the authoritative
However, after reasoning on our dataset, there were 1,329 de nition of a class or property in their local ontology. How-
occurrences of inconsistencies caused by disjoint classes; Ta- ever, we also encountered examples of third-parties rede n-
ble 10 enumerates the top ve. ing class/properties. As an example, we found one document
The most prominent cause of such problems stem from which rede nes the core property rdf:type { de ning nine of
two incompatible FOAF exporters for LastFM data: the its properties as being the domain of rdf:type { e ectively
same resources are simultaneously de ned as being of type 41
cf., http://rdf.opiumfield.com/lastfm/profile/danbri
42
cf., http://dbtune.org/last-fm/danbri.rdf
40 43
http://apassant.net/blog/2009/05/17/ cf. http://ws.sioc-project.org/mediawiki/mediawiki.php?wiki=
inconsistencies-lod-cloud http://en.wikipedia.org/wiki/User:Andy_Dingley
leading to every entity described on the Web being inferred our focus is much more broad in characterising errors in RDF
as a member of those nine properties.44 Again, for example, Web data.
we found 219 statements declaring foaf:Image { authorita-
tively de ned as a class { to be a property; these were from
the sembase.at domain (again see Footnote 15). 4. WHAT ABOUT ALICE?
Publisher Recommendations : This particular issue fo- We can now see that although our protagonist Alice is
purely hypothetical, her adventures in Linked Data wonder-
cuses on how vocabulary publishers re-use existing vocabu-
laries: we would thus particularly encourage vocabularies to land are disappointingly less so; in our analysis, we have
extend external terms, and not rede ne them. Such usage is shown the types of issues in RDF data on the Web that have
more generally related to the principle of modularity, encour- made her journey so disconcerting. We have presented, pro-
aging the modular design of Web vocabularies and avoiding vided statistics and examples for, and discussed a plethora of
the mess implied by the cross-de nition of terms over the di erent types of errors, hopefully raising awareness of such
Web. issues amongst data publishers and developers of agents who
wish to access and interpret such data. As typi ed by Al-
Consumer Recommendations : Clearly, on the Web, peo- ice, such issues can dramatically lower the quality of some
ple should not be constrained in what they express and where applications, and consequently their end-user appeal; the er-
they express it; however, to do useful reasoning, developers rors do not come from the engine, but from the underlying
must take contextual information into account and provide data and thus, reasonable e orts to resolve data issues are
some means of insulating ontologies from wayward external as important as developing tolerant applications.
contributions. Again, in previous work we have described our So, how can we help Alice?
system for performing reasoning over RDF Web data called We have already determined that many such issues are
SAOR [12], and found it essential to introduce our notion easily resolvable by the publisher and therefore concluded
of authority when doing reasoning: in particular, we de ne that publishers are unaware of the problems resident in their
our notion of an \authoritative rule application" which will data. One solution would be to provide a system for validat-
not produce inferences from non-authoritative triples which ing RDF data being published to the Web: several systems
rede ne external terms. An orthogonal approach to the exist but do not cover the broad range of issues discussed in
same problem is that of \quarantined reasoning" described this paper. From a syntactic point of view, the rst valida-
in [5], which loosely constitutes \per-document" reasoning, tor available was the W3C RDF Validator45 , being able to
and scopes inferences based on a closed notion of context check the syntax of any RDF/XML document (however, not
derived from the implicit and explicit imports of each input datatype syntax). The DAML validator46 provides check-
document, thus excluding third-party contributions (please ing of a large number of issues; however the validator is out
see [12] for a more in-depth comparison). of date (does not support OWL), and, at the time of writ-
ing, does not work. With regards to the protocol issues, the
3. RELATED WORK online Vapour validator47 [3] aims at validating the compli-
Earlier papers analysing problems in RDF Web data and ance of published RDF data (either vocabularies or instances
the uptake of standards mainly focus on the categorisation data) according to Linked Data principles [2]. The online
and validation of documents with respect to the various OWL Pellet [16] validator48 enables species validation as well as
species. In [1], the authors performed validation { based on other criteria we identi ed such as checking ontology consis-
OWL-DL constraints { for a sample group of 201 OWL on- tency and nding unsatis able concepts.
tologies which were all found to be OWL Full for mainly There are also a number of command-line validators. The
trivial reasons; the authors then suggested means of patch- Validating RDF Parser (VRP)49 operates on speci ed RDF
ing the ontologies to be OWL-DL conformant. A similar Schema constraints, with some support for datatypes. The
but more extensive survey was conducted in [19] over 1,275 Eyeball50 project provides command-line validation of RDF
ontologies; the authors provided categorisation of the expres- data for common problems including use of unde ned prop-
sivity and species and discussion related to patching of the erties and classes, poorly formed namespaces, problematic
ontologies. At the moment, we do not o er species validation pre xes, literal syntax validation and other optional heuris-
for RDFS/OWL and our scope is much broader with respect tics.
to validation. However, none of the above validators cover the plethora
In [15], the authors describe common user errors in model- of issues we have encountered; thus, we have developed and
ing OWL-DL ontologies. In [17], the authors describe some now provide RDF:Alerts : http://swse.deri.org/RDFAlerts/.
error checking for OWL ontologies using integrity constraints Given a URI, the system provides validation for many of the
involving the Unique Name Assumption (UNA) and also the issues enumerated in this paper; Figure 2 shows a screenshot
Closed World Assumption (CWA). Similarly, in [18], vari- of feedback for an erroneous document. We further intend to
ous errors and constraints are introduced for error check- extend the tool { to include all of the presented issues and
ing; the primary contribution is the introduction of ve `in- suggestions from the community { and to improve usabil-
congruencies' (e.g., an individual not satisfying a cardinality ity; we may also consider extending such a tool to provide
constraint according to UNA/CWA) with cases, causes and intermittent automatic reporting to publishers who opt in,
methods of detection. However, all of these papers have a de- depending on the perceived demand of such a service.
cidedly more OWL-centric focus than our work and provide Still, other issues { particularly relating to inter-dataset
no analysis or discussion of Web data. incompatibility, naming, and inconsistent use of vocabulary
In [6], the authors provided an in-depth analysis of the terms { may be more dicult to resolve. Indeed, we have
landscape of RDF Web data in a crawl of 300M triples. Also 45
they identi ed some statistics about classes and properties http://www.w3.org/RDF/Validator/
46
(SWTs) in RDF data; e.g., they found that 2.2% of classes http://www.daml.org/validator/
47
and properties had no de nition and that 0.08% of terms http://validator.linkeddata.org
48
had both class and property meta-usage. However, again http://www.mindswap.org/2003/pellet/demo.shtml
49
http://139.91.183.30:9090/RDF/
44 50
http://www.eiao.net/rdf/1.0 http://jena.sourceforge.net/Eyeball/
5. REFERENCES
[1] S. Bechhofer and R. Volz. Patching syntax in OWL
ontologies. In International Semantic Web Conference,
volume 3298 of Lecture Notes in Computer Science,
pages 668{682. Springer, November 2004.
[2] T. Berners-Lee. Linked Data. Design issues for the
World Wide Web, World Wide Web Consortium, 2006.
http://www.w3.org/DesignIssues/LinkedData.html.
[3] D. Berrueta, S. Fernndez, and I. Frade. Cooking
HTTP content negotiation with Vapour. In
Proceedings of 4th Workshop on Scripting for the
Semantic Web (SFSW2008), June 2008.
[4] P. V. Biron and A. Malhotra. XML Schema part 2:
Figure 2: Screenshot of validation results from Datatypes second edition. W3C Recommendation,
RDF:Alerts system. Oct. 2004. http://www.w3.org/TR/xmlschema-2/.
also not properly discussed issues introduced by versioning, [5] R. Delbru, A. Polleres, G. Tummarello, and S. Decker.
where, for example, a vocabulary maintainer makes changes Context dependent reasoning for semantic documents
to the de nition of a term breaking backwards-compatibility in sindice. In Proceedings of the 4th International
with legacy usage of that term { indeed, we recognise that Workshop on Scalable Semantic Web Knowledge Base
casual versioning may explain some of the discrepancies we Systems (SSWS 2008), Karlsruhe, Germany, Oct. 2008.
have encountered in this paper, though systematic detection [6] L. Ding and T. Finin. Characterizing the Semantic
of such errors is dicult given our static snapshot dataset. Web on the Web. In Proceedings of the 5th
The resolution of such errors may sometimes require com- International Semantic Web Conference, November
promise between maintainers of ontologies and maintainers 2006.
of exporters which populate the ontologies' terms, re ect- [7] B. C. Grau, B. Motik, Z. Wu, A. Fokoue, and C. Lutz.
ing the current social and community driven nature of Web OWL 2 Web Ontology Language: Pro les. W3C
publishing. Re ecting such community driven e orts, con- Working Draft, Apr. 2008.
sideration is being given to more open ontology editing and http://www.w3.org/TR/owl2-profiles/.
creation. In VoCamp events51 , people from di erent back- [8] A. Harth, J. Umbrich, and S. Decker. Multicrawler: A
grounds and with di erent perspectives meet to work on pipelined architecture for crawling and indexing
modelling lightweight ontologies for immediate use. In order semantic web data. In 5th International Semantic Web
to allow ontologies to evolve according to user needs, initia- Conference, pages 258{271, 2006.
tives such as semantic wikis for ontology management [13] [9] M. Hausenblas. Exploiting linked data to build
and services such as OpenVocab52 allow users to more freely applications. IEEE Internet Computing, 13(4):68{73,
interact with the ontology terms they wish to use and share. 2009.
Although such approaches may again su er from human er- [10] P. Hayes. RDF semantics. W3C Recommendation,
ror and disagreement { and have many open issues such as Feb. 2004. http://www.w3.org/TR/rdf-mt/.
versioning and editing privileges { such community-driven [11] T. Heath. How will we interact with the web of data?
e orts could lead to a more extensive vocabulary of terms IEEE Internet Computing, 12(5):88{91, 2008.
for use on the Web.
We have also initiated a community driven e ort which [12] A. Hogan, A. Harth, and A. Polleres. Scalable
we call \The Pedantic Web Group"53 , which aims to engage Authoritative OWL Reasoning for the Web. Int. J.
Semantic Web Inf. Syst., 5(2), 2009.
with publishers and help them improve the quality of their
data. Firstly, we have provided some pragmatic educational [13] M. Krotzsch, S. Scha ert, and D. Vrandecic. Reasoning
material for publishers, including a list of validation tools in semantic wikis. In Reasoning Web, pages 310{329,
and of frequently observed problems in RDF publishing. Sec- 2007.
ondly, we have created a mailing list for actively contacting [14] Y. Ma, P. Hitzler, and Z. Lin. Algorithms for
publishers about their mistakes and for various discussions Paraconsistent Reasoning with OWL. In ESWC, pages
on the quality of the Web of Data { subscription to which 399{413, 2007.
is open to the community. Indeed, such e orts may be the [15] A. L. Rector, N. Drummond, M. Horridge, J. Rogers,
only means to resolve issues which require the co-ordination H. Knublauch, R. Stevens, H. Wang, and C. Wroe.
of multiple publishers. As such, we see the Pedantic Web Owl pizzas: Practical experience of teaching owl-dl:
Group as a go-to point for tackling publishing-related issues Common errors & common patterns. In EKAW, pages
on the Web of Data, and as a community-driven means of 63{81, 2004.
promoting better quality publishing for the Web of Data. [16] E. Sirin, B. Parsia, B. C. Grau, A. Kalyanpur, and
To nally conclude, we would like to replace the present Y. Katz. Pellet: A practical OWL-DL reasoner.
hypothetical Alice with a possible future Alice who is again Journal of Web Semantics, 5(2):51{53, 2007.
browsing the Web of Data { however this time using an ap- [17] E. Sirin, M. Smith, and E. Wallace. Opening, closing
plication which has been tempered for noisy data, where the worlds - on integrity constraints. In OWLED, 2008.
documents have been validated, consistent identi ers used, [18] J. Tao, L. Ding, and D. L. McGuinness. Instance data
and resources described using a rich vocabulary of community- evaluation for semantic web-based knowledge
endorsed terms. We hope that such an Alice might be amazed management systems. In HICSS, pages 1{10, 2009.
{ this time for the right reasons. [19] T. D. Wang, B. Parsia, and J. A. Hendler. A survey of
the web ontology landscape. In Proceedings of the 5th
51
http://vocamp.org/wiki/Main_Page International Semantic Web Conference (ISWC 2006),
52
http://open.vocab.org/ pages 682{694, Athens, GA, USA, Nov. 2006.
53
http://pedantic-web.org/