Weaving the Pedantic Web Aidan Hogany , Andreas Harthy z , Alexandre Passanty , Stefan Deckery , Axel Polleresy y Digital Enterprise Research Institute, National University of Ireland, Galway z AIFB, Karlsruhe Institute of Technology, Germany yf firstname.lastnameg@deri.org, z harth@kit.edu ABSTRACT \show me all American female models who have also won an Over a decade after RDF has been published as a W3C rec- Academy Award for Best Supporting Actress", constructed ommendation, publishing open and machine-readable con- using facets in the user-interface; the results are automati- tent on the Web has recently received a lot more attention, cally aggregated from thirteen distinct sources. including from corporate and governmental bodies; notably However, all has not been plain sailing: this new paradigm thanks to the Linked Open Data community, there now ex- in Web publishing and interaction [11] has inevitably led to ists a rich vein of heterogeneous RDF data published on the many teething problems. As we will discuss in this paper, Web (the so-called \Web of Data") accessible to all. How- there exists a lot of noise within the Web of Data which ever, RDF publishers are prone to making errors which com- inhibits applications from e ectively exploiting this rich lode promise the e ectiveness of applications leveraging the re- of open, well-de ned and structured information. sulting data. In this paper, we discuss common errors in To illustrate, we introduce Alice: a hypothetical end-user RDF publishing, their consequences for applications, along of an application for searching and browsing the Web of Data. with possible publisher-oriented approaches to improve the Alice loads some interesting data about herself and is im- quality of structured, machine-readable and open data on mediately impressed by the integrated view of data from the Web. publication, blog, social network and workplace exporters; however, for every second resource she explores, the applica- tion cannot locate or parse any relevant data. She tries to 1. INTRODUCTION load her publications into a calendar view, but one quarter Based on the simple principle of using URIs to name and of them are missing as the dates/times contain illegal val- link things { not just documents { the Resource Description ues. She wants more information relating to properties and Framework (RDF) o ers a standardised means of represent- classes used to describe herself, but some do not exist; dis- ing information on the Web such that: (i) structured data couraged, she clicks on a friend of hers but nds that he has is available to all over the Web; (ii) data can be handled 1,169 names and email addresses (she knew him as \Bob"). through standard APIs and applications; (iii) the meaning She begins to notice that all resources she explores are in- of the data is well-de ned using lightweight ontologies (or stances of nine strange properties { and then the nal straw: vocabularies); and (iv) data is interoperable with other RDF she now nds out that her professor is actually a document. on the Web and can be re-used and extended by other pub- We will provide evidence in this paper as to how Alice lishers and application developers. could have had such an experience browsing the Web of Data. Over the past few years, many Web publishers have turned In so doing, we will take stock of some of the diculties to RDF as a means of disseminating information in an open currently apparent in RDF publishing, and discuss how we { and machine-interpretable way, resulting in a \Web of Data" and the now decade old Semantic Web community at large which now includes interlinked content exported from corpo- { can help to improve the current and future quality of RDF rate bodies (e.g., BBC, New York Times, Freebase), commu- data published on the Web. nity e orts (e.g., Wikipedia, GeoNames), biomedical datasets (e.g., DrugBank, Linked Clinical Trials) { even UK govern- mental entities, where public sector organisations must now additionally disclose their consultations in RDF1 . Applica- tions and search engines are now starting to exploit this rich vein of structured and linked data [9]. For example, Figure 1 shows the results returned by the VisiNav (http://visinav.deri.org/) system for the query We would like to acknowledge and thank Richard Cyga- niak, Michael Hausenblas, Stphane Corlosquet and Antoine Zimmermann with whom we co-founded the Pedantic Web Group. We would also like to thank anonymous reviewers of various incarnations of this paper for their valued feed- back. The work presented in this paper has been funded in part by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2) and by an IRCSET Postgradu- ate scholarship. 1 http://coi.gov.uk/guidance.php?page=315 Figure 1: Results from VisiNav showing American Copyright is held by the author/owner(s). LDOW2010, April 27, 2010, Raleigh, USA. models who have won Best Supporting Actress in . the Academy Awards. 2. WEB OF DATA ANALYSIS Finally, we will also endeavour to provide discussion for We herein present some analysis based on an RDF dataset each issue, both from the perspective of publishers and from retrieved from the Web in April 2009 using MultiCrawler [8]. the perspective of data consumers. We performed a seven-hop breadth- rst crawl for RDF/XML We begin with issues relating to how data is found and documents where we enforced a maximum of 5,000 crawled accessed; then discuss parsing and syntax issues; look at rea- documents per pay-level-domain (or PLD, viz.: a domain soning issues, including inconsistent data; and nally, intro- that requires payment, such as deri.ie or data.gov.uk) so as duce and discuss ontology hijacking. to ensure a diverse Web dataset covering a wide spectrum of publishers. Indeed, we only crawled for RDF/XML and not 2.1 URI/HTTP: accessibility and derefencabil- for other formats such as RDFa; RDF/XML is currently by ity far the most popular format with RDFa growing in popular- As previously alluded to, the Linked Open Data movement ity. Still, one could expect a small percentage of documents has been integral to RDF publishing on the Web, emphasis- to contain { e.g., RDFa metadata { which we admittedly ing four basic principles [2]: (i) use URIs as names for things; overlook in our illustrative statistics. (ii) use HTTP URIs so that those names can be looked up; The crawl accessed 149,057 URIs (including 39,439 redi- (iii) provide useful information when a look-up on that URI rects), 54,836 (36.8%) of which resulted in valid RDF/XML is made; and (iv) include links using external URIs. documents (almost precisely 50% excluding redirects). The With regards to providing information about a resource nal dataset contains 12,534,481 RDF statements mention- upon a HTTP lookup of its URI { called dereferencing { em- ing 1,598,521 URIs { including 5,850 classes and 9,507 prop- phasis is placed on providing information in RDF and disam- erties. biguating identi cation of information resources (document Based on this dataset, we present selected issues in RDF URIs) from non-information resources (entities described in data published on the Web. We focus on errors that we can those documents). Now, using statistics of our crawl which systematically detect, and thus one should not consider the consisted of lookups on URIs in the data, we can draw some following an exhaustive list; similarly, it is important to note initial conclusions relating to Linked Data practices on the { given the diminutive scale and perhaps even age of our Web of Data. dataset { that the statistics presented herein are intended to be illustrative, not exhaustive. That said, we still claim that 2.1.1 Dereferencability issues the analysis of our dataset o ers a valuable insight into cur- Category : incomplete rent issues relating to RDF Web publishing: although inter- In accordance with \use HTTP URIs so that those names polating the exact prevalence of such problems to the entire can be looked up", dereferencing a URI consists of retrieving Web of Data may not be sensible, our statistics should o er content as de ned by RFC39862 . an indication as to the relative and approximate prevalence Firstly, 5.3% of URIs returned an error (4xx client er- of such problems. ror/5xx server error) response code, in con ict with the third Throughout the paper, we endeavour to present examples Linked Data principle above: \provide useful information of the various publishing errors by giving links to RDF Web when a look-up on that URI is made". In most such { ad- documents exhibiting such. Note that the purpose of provid- mittedly relatively rare { cases simply nothing exists at that ing these examples is to: (i) to give concrete and tangible ex- location and a 404 Not Found code is returned (4.3% overall, amples to the errors, giving indications as to how they might 81% of error codes). have occurred, how they might be presently solved, and how Secondly, 26.5% of URI lookups resulted in a redirect (30x they could be avoided in future; (ii) show that noise is present code). In fact, Linked Data principles encourage the use in a diverse range of sources, describing a diverse range of of redirects, particularly for identifying non-information re- domains; and (iii) to show that errors in RDF publishing are sources (i.e. URIs which denote things rather than les): not only the result of inexperience { we show examples of speci cally, the 303 redirect is recommended. Of the redi- errors in academic publishing, community-based publishing, rection URIs, 55.1% (14.6% of total) o ered a 303 redirect popular vocabularies, and even documents published by the to another location as recommended; however, 30.2% (8% of authors of this paper. The purpose of the examples is thus total) used a 302 redirect and the remaining 14.7% (3.9% of not to \point the nger", but to give an honest appraisal of total) used a 301 redirect. such issues so as to identify possible directions forward. In the machine-oriented world of Linked Data, publishers For posterity, we provide snapshots of documents and enu- should be even more careful to avoid broken links and to merate the namespace pre xes referenced in this paper at make URIs dereferencable, thus enabling automatic data- http://aidanhogan.com/pedantic/. access for Semantic Web applications and providing them { In order to structure the highlighted issues, we identify and ultimately end-users { a complete, coherent picture. four categories of symptoms: Publisher Recommendations : Publishers should carefully  incomplete: equatable to a dead-link in the current follow Linked Data best practices when \minting" URIs. HTML web { a software agent will not be able to re- Consumer Recommendations : Applications should not trieve data relevant to a particular task; expect high recall when dereferencing URIs found in RDF data: for high recall, applications may have to consider pre-  incoherent: a software agent will not be able to cor- fetching/data-warehousing approaches. rectly interpret some local piece of data as either the publisher or vocabulary maintainer would expect; 2.1.2 No structured data available Category : incomplete  hijack: a software agent will not be able to correctly Excluding redirects, 92.8% of URIs return a 200 OK re- interpret some remote piece of data as would be ex- sponse code along with content; but what do these docu- pected; ments contain? Linked Data principles require that useful  inconsistent: a software agent will interpret a contra- 2 http://labs.apache.org/webarch/uri/rfc/rfc3986. diction in the data. html data be returned upon lookup of a URI from the Web of application/rdf+xml, only 571 (1.2%) were invalid RDF/XML Data, with particular emphasis on returning RDF. Thus, documents; usually caused by simple errors such as unescaped from our crawl requesting application/rdf+xml content, we special characters, misuse of RDF/XML shortcuts, and omis- would reasonably expect a high percentage of documents re- sion of namespace. Again, such issues are relatively rare, pre- turning RDF/XML3 . sumably due to use of mature RDF/XML APIs for producing Of the 101,709 URIs which returned content with response data and the popularity of the W3C RDF/XML validation code 200 OK, we observed that only 45.4% of URIs report a service5 . content-type application/rdf+xml, with a further 34.8% re- Publisher Recommendations : Publishers should use an porting text/html. Commonly in RDF data, information re- appropriate syntactic validator for their content, or only use source URIs are used to identify themselves (or more prob- trusted APIs to produce content. lematically to identify related resources); for example, in RDF, HTML documents are naturally identi ed using their Consumer Recommendations : Applications could pos- native URI. In almost all instances of a non-RDF content- sibly investigate the use of tools for xing syntax errors: type, the URI is simply a document without any supporting e.g., use standard XML syntax cleaning tools for XML-based RDF metadata. Hence, as before, Semantic Web agents will RDF syntax. We have no experience in using such tools, and not be able to properly exploit the content as expected by they would have to be evaluated in the given application sce- end-users. nario: again in any case, syntax errors are admittedly rare and such concerns would only apply to applications with a Publisher Recommendations : HTML pages { especially large emphasis on high recall. those whose URIs are mentioned in RDF documents { could be embedded with RDFa. 2.3 Reasoning: noise and inconsistency Consumer Recommendations : A possible { and admit- Thus far, we have seen that about half of the URIs used to tedly quick and dirty { solution to avoid dead-links would be identify resources in the Web of Data resolve to some valid to convert the header information of HTTP URIs into RDF RDF/XML data. We now look at issues relating to the data using the terms from the W3C published \HTTP Vocabu- contained within those documents: i.e., what they say and lary in RDF 1.0"4 . More ambitiously, a system may consider how the machine interprets the data. extracting RDF from non-RDF content, such as the title of Layered on top of RDF are the core RDF Schema (RDFS) a HTML page or metadata for images. Such measures would and Web Ontology Language (OWL) standards, which allow ensure that at the very least, some structured information for de ning the semantics or meaning of RDF data through can be retrieved for a wider variety of URIs, thus avoiding de nitions of classes and properties in schemas/ontologies. `dead-links'. For example, the Friend Of A Friend (FOAF)6 project pub- lishes OWL de nitions of a set of classes and properties which 2.1.3 Misreported content-types forms a structured and popular vocabulary for describing Category : incomplete people in RDF. A HTTP response contains an optional header eld stat- Classes represent a grouping of resources: e.g., FOAF de- ing the content type of the returned le. A consumer ap- nes the class foaf:Person and one can assign ex:Alice and plication can then decide from the header whether the con- ex:Bob as members of this class. Using RDFS and OWL, a tent is suitable for consumption, and whether the content publisher can then de ne characteristics of such classes (and, should be accessed. However, we observed that RDF/XML thus, of all of its members); e.g., by de ning foaf:Person content is commonly returned with a reported content-type as a subclass of foaf:Agent, FOAF implies that ex:Alice, other than application/rdf+xml: from our crawl, 16.9% of ex:Bob and all other foaf:Persons are also members of the valid RDF/XML documents were returned with an incom- class foaf:Agent. patible or more generic content type; e.g.,: text/xml (9.5%), Properties represent the de nable attributes of resources, application/xml (5.9%), text/plain (1%) & text/html (0.4%). and also relationships that are possible between resources; Publisher Recommendations : Publishers should ensure e.g., FOAF de nes foaf:knows as a relationship that can exist that the most speci c available MIME-type is reported for from one member of foaf:Person to another, or that members their content. of foaf:Person can have the attribute foaf:surname which has a string value. Other publishers across the Web can then re- Consumer Recommendations : Herein, a trade-o exists use and extend de nitions of classes and properties { such as for consumer agents: an agent with emphasis on perfor- the ones from FOAF. mance may still use the reported content-type to lter non- Thereafter, reasoning can use the semantics of these classes supported content formats, whereas an agent with more em- and properties to interpret the data, and to infer new knowl- phasis on recall should relax { or possibly ignore { ltering edge (e.g., that ex:Alice is also a foaf:Agent, or that if ex:- based on reported content-type. Alice foaf:knows ex:Bob, then ex:Alice and ex:Bob are foaf:- Persons). 2.2 Syntax errors Some errors in RDF only reveal themselves after reasoning 2.2.1 RDF/XML Syntax Errors { e.g., some unforeseen incorrect inferences occur { and as such, can stay hidden from the publisher. In this section, we Category : incomplete will look at issues relating to the interpretation of RDF data At the outset of the Semantic Web movement, publish- on the Web { in particular focussing on reasoning issues; in ers opted to employ the existing XML standard to encode order to shed light on such issues, we applied reasoning over RDF; RDF/XML is still the most popular means of pub- our crawl using the Scalable Authoritative OWL Reasoner lishing RDF today. Although its syntax is quite complex, (SAOR) [12], which we will discuss as pertinent. we encountered relatively few syntax errors in RDF/XML documents accessed during our crawl. Of the 46,136 doc- 2.3.1 Atypical use of collections, containers and reifi- uments which return response code 200 and content-type cation 3 5 http://www.w3.org/RDF/Validator/ 4 6 http://www.w3.org/TR/HTTP-in-RDF10/ http://foaf-project.org/ Unde ned Property Triples Used Category : incoherent foaf:member name 148,251 There is a set of URI names which are reserved by the foaf:tagLine 148,250 RDF speci cation for special interpretation in a set of triples; foaf:image 140,791 although the RDF speci cation does not formally restrict cycann:label 123,058 usage of these reserved names, misuse is often inadvertent. qdoslf:neighbour 100,339 We rstly discuss the RDF collection vocabulary, which Table 1: Count of the top ve properties used with- consists of four constructs: fList, first, rest, nilg. Indeed, out a de nition few examples of atypical collection usage exist on the Web, Unde ned Class Triples Used probably attributable to widespread usage of the RDF/XML sioc:UserGroup 21,395 shortcut rdf:parseType="Collection" for specifying collections; rss:item 19,259 this shortcut shields users from the underlying complexity of linkedct:link 17,356 collections on the triple level and generally ensures typical politico:Term 14,490 collection use. The only atypical collection usage we found in bibtex:inproceedings 11,975 our Web-crawl was one document which speci ed resources of type List without first or rest properties attached7 . Table 2: Count of the top ve classes used without A related issue is that of atypical container usage, which a de nition is concerned with the following constructs: Alt, Bag, Seq, total of almost 300k triples11 { the FOAF vocabulary does 1... n and the syntactic keyword li. Again, atypical con- not contain these properties and they are not de ned else- tainer usage is uncommon on the Web: we found one domain where; such a practice of deliberately inventing unde ned (viz. semanticweb.org8 ) which, in 229 documents, exports properties within a related namespace is common on the RDF containers without choosing a type of Alt, Bag or Seq. Web. Sometimes publishers make simple spelling mistakes: Finally, there may exist atypical usage of the rei cation again, the property foaf:image is incorrectly used instead of constructs: Statement, subject, predicate, object. However, foaf:img in the livejournal.com domain; to take another ex- in our dataset we only found one such example9 wherein ample, the term qdoslf:neighbour is commonly used { in 100k predicate is assigned a blank node value and used alone with- triples { instead of the property qdoslf:neighbours de ned in out subject or object. the namespace.12 Publisher Recommendations : Where possible, publishers Similarly, there were 1.01M triples (8.1%) mentioning un- should abide by the standard usage of such RDF terms to de ned classes in 21.3k documents (38.8%); the top ve in- enable interoperability. stantiated such classes are enumerated in Table 2. Neither Consumer Recommendations : Although we found that of the rst three classes nor the last class are de ned in the atypical usage of the core RDF terms is relatively uncom- dereferenced documents; for example, all of the sioc:User- mon, consumer applications should be tolerant of such atypi- Group instances come from the apassant.net domain13 . To cal usage; for example, developers of reasoning engines which take another example, the class politico:Term is generically operate over Web data and consider RDF collections as part described in the dereferenced document, but is neither im- of complex OWL class descriptions { and even though we plicitly nor explicitly typed as a class. did not nd such usage in our dataset { should implement Publisher Recommendations : Many such errors are inde- simple checks to ensure that the respective engine is tolerant liberate and due to spelling or syntactic mistakes resolvable to cyclic, non-terminating and branching collection descrip- through minor xes to the respective ontologies or exporters. tions. Where terms have been knowingly invented, we suggest that the term be recommended as an addition to the respective 2.3.2 Use of undefined classes and properties ontology { or de ned in a separate namespace { to enable Category : incoherent re-use. Oftentimes on the Web of Data, properties and classes are Consumer Recommendations : Liberal consumer appli- used without any formal de nition. For example, publish- cations could, for example, use fuzzy string matching tech- ers might say that ex:Alice ex:colleague ex:Bob even though niques { e.g., Levenstein distance measures { between un- ex:colleague is not de ned as a property. Again, although de ned classes and properties encountered in the data, and such practice is not prohibited, by using ad-hoc unde ned classes and properties de ned in the vocabularies. Generally classes and properties publishers make automatic integration however, consumer applications can usually overlook such of data less e ective and forego the possibility of making in- mistakes and simply accept the consequence of incomplete ferences through reasoning. reasoning for triples using such unde ned terms. From our crawl, 1.78M triples (14.3% of all triples) use unde ned properties, appearing in 39.7k documents (72.4% 2.3.3 Misplaced classes/properties of valid RDF/XML documents): Table 1 enumerates the top Category : incoherent ve.10 Sometimes, a URI de ned as a class is used as a property For example, from our crawl, the livejournal.com domain (appears in the predicate position of a triple) or, conversely, uses the properties foaf:member name and foaf:tagLine in a a URI de ned as a property is used as a class (appears in the object position of an rdf:type triple); although not pro- 7 hibited, such usage is usually inadvertent and can ruin the http://scripts.mit.edu/~kennylu/myself.rdf 8 machine-interpretation of the associated data. cf. http://iswc2006.semanticweb.org/submissions/ Harth2006dq_Harth_Andreas 11 9 http://web.mit.edu/dsheets/www/foaf.rdf cf. http://danbri.livejournal.com/data/foaf 12 10 It is important to note that herein, when we mention \un- cf. http://foafbuilder.qdos.com/people/danbri.org/ de ned" classes or properties, we loosely refer to classes or foaf.rdf 13 properties \not de ned in our crawl". In any case, our crawl cf. http://apassant.net/home/2007/12/flickrdf/data/ would contain any property- or class-descriptions published people/36887937@N00 { indeed the authors herein are also according to best practices (i.e., using dereferencable terms). prone to making simple errors in their publishing. Class # Misplaced D.type Prop. # Non-literal % Non-literal rdfs:range 8,012 swrc:journal 19,853 97.8% foaf:Image 639 swrc:series 14,963 97.3% rdfs:Class 94 ical:location 4 2.6% wot:PubKey 18 foaf:name 4 0% foaf:OnlineAccount 15 foaf:msnChatID 3 0.4% Table 3: Top ve \classes" used in the predicate po- Table 5: Top ve datatype-properties used with non- sition of a triple literal values Property # Misplaced Obj. Prop. # Literal % Literal foaf:knows 4 affy:startsAt 6,234 100% foaf:name 4 affy:stopsAt 6,234 100% foaf:sha1 2 affy:cdsType 5,193 100% swrc:author 1 affy:frame 4,882 100% foaf:based near 1 affy:commonToAll 4,814 100% Table 4: Top ve properties found in the object po- Table 6: Top ve object-properties used with literal sition of an rdf:type triple values Table 3 shows the top ve classes used as a property in our that that term is a class or property { for example, rule rdf1 crawl. In fact, rdfs:range is a core RDFS property, but is in RDFS [10]. Aside from this, consumer applications will de ned in one document14 as a class; hence the 8,012 occur- probably have to accept incomplete inferencing over such er- rences are valid use of the property and the single declaration roneous triples. of rdfs:range as a class is at fault (this is also an instance of ontology hijacking, which we will discuss in Section 2.4). 2.3.4 Misuse of owl:DatatypeProperty/owl:ObjectProp- Most occurrences of the foaf:Image class used as a property erty stem from the sembase.at domain15 ; here the foaf:depiction Category : incoherent property would be more suitable. Use of rdfs:Class as a The built-in term owl:DatatypeProperty describes proper- property comes from the ajft.org and rdfweb.org domains16 ties which relate some resource to a literal value, i.e., an where rdfs:Class is seemingly mistaken as rdf:type. The \attribute" property (in terms of Object-Oriented Program- class wot:PubKey is mistakenly used instead of wot:hasKey17 . ming); similarly, the OWL term owl:ObjectProperty describes Misuse of foaf:OnlineAccount stems from one document18 properties which relate one resource to another (i.e., a \re- wherein the RDF/XML shortcut rdf:parseType="Resource" lation" property). Oftentimes, attribute properties are used is used inappropriately, causing parsing of foaf:OnlineAcc- between two resources, and relation properties are used with ount elements as predicates. literal values. After reasoning, more such errors were discovered, partic- From our crawl, we found a total of 34.8k triples (0.3%) ularly in the affymetrix.com domain19 which describes genes with datatype-properties given non-literal objects (in 1,194 and mistakingly uses rdfs:subClassOf to assert subsumption [2.2%] documents across 9 domains). Table 5 lists the top relations between properties (amongst many other issues); ve; the only signi cant errors stem from l3d.de21 which ex- this resulted in properties { which, combined, were used in ports RDF from the Digital Bibliography & Library Project 37,454 triples { being typed as classes. (DBLP) { they de ne two datatype-properties in the swrc: Conversely, the usage of properties in the class position { namespace but only use the properties with non-literal ob- viz. the object position of an rdf:type tripe { is much less jects. common; Table 4 lists the results, with most errors stemming Analogously, there were 41.7k triples (0.3%) with object- from one document20 . properties given literal values (in 4,438 [8%] documents from Publisher Recommendations : Again, all such errors could 91 domains). Table 6 lists the top ve; many such occur- easily be xed by the publishers once they are made aware. rences come from the affymetrix.com domain which com- Many of the above encountered errors were as a result of mis- monly uses ve di erent object-properties with literal val- use of RDF syntactic terms, such as rdf:parseType="Resource", ues (in a total of 27.4k triples from our crawl). However, or more generally as syntactic mistakes in their documents: there were many other such properties with signi cant mis- thus, publishers should not only ensure that their documents use including miscellaneous properties from the opencyc.org are syntactically valid, but also that they parse into the domain (6,161), foaf:page (3,160), foaf:based near (1,078), triples expected. ical:organizer (456), amongst others; again, the errors were Consumer Recommendations : Applications which incor- spread over 92 di erent domains. In fact, the property foaf:- porate reasoning should consider foregoing standard infer- myersBriggs (in the popularly used FOAF speci cation itself) ences which rely on the position of a term in a triple to infer was until recently incorrectly de ned as an owl:ObjectProperty 14 with rdfs:range rdfs:Literal and had 35 literal values in our http://www.w3.org/2000/10/swap/infoset/ dataset. infoset-diagram.rdf 15 cf. http://wiki.sembase.at/index.php/Special: Publisher Recommendations : Where datatype- or object- ExportRDF/Dieter_Fensel property constraints are erroneously speci ed { e.g., swrc:- 16 cf. http://swordfish.rdfweb.org/discovery/2004/01/ journal, swrc:series, foaf:myersBriggs { they can simply be www2004/files/1101776794087.rdf reversed by the ontology maintainers. However, in many 17 cf. http://www.snell-pym.org.uk/alaric/alaric-foaf. cases such constraints are purposefully de ned to ensure con- rdf sistent usage of the term; in this case, the onus is on pub- 18 cf. http://tommorris.org/foaf lishers to thereby abide. 19 cf. http://affymetrix.com/community/publications/ affymetrix/tmsplice/all_genes.1.rdf 21 20 cf. http://dblp.l3s.de/d2r/data/publications/conf/ http://www.marconeumann.org/foaf.rdf aswc/HoganHP08 Consumer Recommendations : Applications would typi- Now, all such users can be interpreted as equivalent { i.e., cally use such constraints for form generation in the context representing the same real-world person { according to the of instance data creation. Liberal versions of such applica- semantics of the foaf:mbox sha1sum property. This problem tions may decide to automatically reverse such constraints, is quite widespread: even in our diminutive crawl, 52 hosts where { in examples such as the affy: properties above { all contribute 1,169 di erent bogus values in 1,041 documents. usage is contrary to the speci ed constraint. Indeed, some For example, 194 errors come from the bleeper.de domain23 , weighting scheme may be adopted for examples { such as the 189 from identi.ca24 , 166 from uni-karlsruhe.de25 , 163 from swrc: properties { where most usage is contrary to the vocab- twit.tv26 and 92 from tweet.ie27 ; Table 7 details the top ve ulary constraint. Again, such approaches would admittedly void values for inverse-functional properties which we found need evaluation in the setting of the given application. in our dataset. According to the standard re exive, symmetric and transi- 2.3.5 Members of deprecated classes/properties tive semantics of equality (represented in RDF by the equal- Category : incoherent ity relation owl:sameAs), if we take for example the 986 entries Brie y, the OWL classes owl:DeprecatedClass and owl:- with the same null sha1 value, 9862 =972k owl:sameAs rela- DeprecatedProperty are used to indicate classes or properties tions would be inferred. Further, assuming, for example, an that are no longer recommended for use: vocabulary publish- average of eight triples mentioning each equivalent resource, ers usually assert deprecation for classes or properties which 972k*8 = 7.8M statements would be inferred by substitut- have been considered to be obsoleted by more popular terms ing each equivalent identi er into each statement. In other in local or remote vocabularies, or perhaps even where the words, such chains of equality cause a quadratic explosion of original term is contrary to some naming scheme or consid- inferences; when one considers larger Web-crawls, the prob- ered outside of the scope of the vocabulary. In our dataset, lem becomes quite critical. we did not nd any members of a deprecated class; however, Publisher Recommendations : For publishers, the issue is we found 290 instances (in 115 documents) of four depre- easily resolved by, for example, validating user input and cated properties: wordmap:subCategory (260), sioc:has group checking the uniqueness and validity of inverse-functional (15), sioc:content encoded (10) and sioc:description (5). values. Conversely, vocabulary maintainers should be care- Publisher Recommendations : Publishers of instance data ful to clearly state that a property is inverse-functional in should intermittently verify that no terms used have since the human-readable speci cation, and select labels for prop- been considered deprecated by the vocabulary maintainer, erty URIs which give an indication of the inverse-functional and should take appropriate action to use { possibly novel { nature of the property { for example, choose the label ex:- personalMbox over ex:mbox. recommended terms where possible. Consumer Recommendations : Applications could con- Consumer Recommendations : A simple solution com- sider specifying manual mappings from deprecated terms to monly used by reasoning agents is to simply blacklist void compatible terms now recommended for use. Less liberal values. Although an exhaustive list of blacklist candidates applications may consider omitting triples which use depre- may be dicult to derive, the above values would { in our cated terms. Generally, however, usage of deprecated terms experience { constitute most of the void values. Other heuris- does not require special treatment. tics may be employed to ensure correct equality reasoning { for example, use of a disambiguation step to quickly remove 2.3.6 Bogus owl:InverseFunctionalProperty values obviously incorrect equality inferences. Category : incoherent/hijack 2.3.7 Malformed datatype literals Aside from URIs { which can be hard to agree upon { Category : incoherent resources are also commonly identi ed by values for proper- ties which uniquely identify a resource; such keys are pre- In RDF, a subset of well-de ned XML datatypes are used existing and easier to agree upon. These properties are to provide structure and semantics to literal (string) val- termed \inverse-functional" and are identi ed in OWL with ues. For example, string date values can be speci ed us- the term owl:InverseFunctionalProperty. If two resources ing the xsd:date datatype, which provides a lexical syntax share a common value for one of these properties, reason- for date strings and a mapping from date strings to date ing will view these resources as equivalent (referring to the values interpretable by an application. From the content same resource). For example, the FOAF ontology has de- of the crawl, we found 3,666,840 literals of which 170,351 ned a number of inverse-functional properties for identi- (4.6%) used a datatype. Of these, the top ve most popular fying people; these include foaf:homepage, foaf:mbox (email), datatypes were xsd:string (53,879), xsd:nonNegativeInteger foaf:mbox sha1sum (sha1 encoded email to prevent spamming), (38,501), xsd:integer (15,826), xsd:dateTime (15,824), and amongst others. Herein, FOAF holds the intuition that the xsd:unsignedLong (12,318). values for such properties should be unique to an individ- Unfortunately, incorrect use of datatypes is relatively com- ual, and that the usage of such properties should re ect that mon in the Web of Data. Firstly, datatype literals can be (i.e., foaf:mbox should only be used for personal and unshared malformed : i.e., ill-typed literals which do not abide by the email-addresses). lexical syntax for their respective datatype. There were 4,650 However, FOAF exporters commonly do not respect the malformed datatype literals (2.7% of all typed literals) in our semantics of these inverse-functional properties and export crawl: Table 8 summarises the top ve datatypes to be in- `void' values given partial user-input. The most widespread stantiated with malformed values. example is 08445a31a78661b5c746feff39a9db6e4e2cc5cf, which The two most common errors for xsd:dateTime stem from is the encrypted SHA1 value of `mailto:' and is commonly 23 assigned by FOAF exporters { as values for foaf:mbox sha1sum cf. http://bleeper.de/powerboy/foaf 24 { to users who don't specify an email in some input form.22 cf. http://identi.ca/whataboutbob/foaf 25 cf. http://www.aifb.uni-karlsruhe.de/Personen/ 22 In fact, at the time of writing, a Google search for this SHA1 viewPersonFOAF/foaf_1876.rdf 26 string will result in nearly two million hits { seemingly almost cf. http://army.twit.tv/takeit2/foaf 27 all of which are FOAF RDF documents. cf. http://tweet.ie/seank/foaf Inverse-Functional Property Void Value Count foaf:mbox sha1sum "08445a31a78661b5c746feff39a9db6e4e2cc5cf" 986 foaf:mbox sha1sum "da39a3ee5e6b4b0d3255bfef95601890afd80709" 167 foaf:homepage 11 foaf:mbox sha1sum "" 5 foaf:isPrimaryTopicOf 2 Table 7: Count of the ve most common void inverse-functional property values Datatype # Malformed % Malformed Datatype Property # Clashes % Clashes xsd:dateTime 4,042 26.4% sl:creationDate 9,212 100% xsd:int 250 2.1% scot:ownAFrequency 529 100% xsd:nonNegativeInteger 232 0.6% owl:cardinality 464 65.2% xsd:gYearMonth 67 100% ical:description 262 21.8% xsd:gYear 27 1.4% wn20schema:tagCount 204 100% Table 8: Top ve datatypes having malformed values Table 9: Top ve properties with datatype-clashes and percentage of all values which are malformed and percentage of all values which cause clashes (i) the wasab.dk domain28 whereby time-zones are missing so are disjoint with xsd:date. The property scot:ownAFreq- the required `:' separator; and (ii) the soton.ac.uk domain29 uency is given range xsd:float but only ever used in the do- wherein the mandatory seconds- eld is not speci ed. For main linkeddata.org34 with xsd:integer objects; xsd:integer xsd:int, almost all errors stem from the freebase.com domain is a sub-type of xsd:decimal and is disjoint with xsd:float [4]. whereby boolean values True and False are found30 . For owl:cardinality is often used with plain-literal objects35 con- xsd:nonNegativeInteger, all stem from the deri.ie domain31 trary to the de ned range xsd:nonNegativeInteger. The prop- where non-numeric strings are incorrectly used. Finally, for erty ical:description { de ned as having range xsd:string xsd:gYearMonth and xsd:gYear, all illegal usage comes from { is almost always instantiated with a plain-literal object the dbpedia.org domain32 where full xsd:dateTime literals are (99.8%); however, only the 21.8% which use language tags used instead. constitute an inconsistency36 . Finally, wn20schema:tagCount Publisher Recommendations : Clearly, malformed literals has range xsd:nonNegativeInteger but is only used with plain literals in the w3.org domain37 . are quite common. In all examples, the errors can be resolved by simple syntactic xes to the publishing framework, or Publisher Recommendations : In all such cases, the root removing or changing the datatype on the literal; one can problem could be resolved if the vocabulary publisher re- conclude { especially in the absence of a popular validator moves the range on the property; in many cases such an for datatype syntax { that publishers are simply not aware approach may even be suitable: properties such as ical:- of such issues. description which are intended to have prose values should Consumer Recommendations : Although datatype-aware remove xsd:string constraints { optionally setting the range as the more inclusive rdf:PlainLiteral datatype to encourage agents could incorporate heuristics to shoulder common mis- takes { e.g., publishers commonly omit the mandatory sec- literal values { and thus allow use of language tags. However, onds eld from date-time literals { not all such mistakes can the majority of such datatype domain constraints are validly feasibly be accounted for. Again { and in cases where the used to restrict possible values for the property and the onus issue next discussed does not apply { such literals can simply is on data-publishers to thereby abide. be interpreted as plain literals. Consumer Recommendations : Again, liberal agents could consider changing the de ned range of the property to re ect 2.3.8 Literals incompatible with datatype range some notion of \common" usage. Also, although the usage of Category : incoherent/inconsistent properties often does not re ect the de ned datatype range, Aside from explicitly typed literals, the range of properties in our dataset we found that the literal strings were almost al- may also be constrained to be a certain datatype, mandating ways within the lexical space of the range datatype and that respectively typed values for that property; e.g., one can say they were just poorly typed. We only found two properties that the attribute property ex:bornOnDate has xsd:date val- which were given objects malformed according to the range ues. A datatype clash can then occur if the property is given datatype (before, we were concerned with malformed liter- a value (i) that is malformed, or (ii) that is a member of an als given an explicit datatype): viz. exif:exposureTime with incompatible datatype. Table 9 provides counts of datatype range xsd:decimal (given 49 plain literals with malformed clashes for the top ve such properties. decimal values in one document38 ) and cfp:deadline with The property sl:creationDate has the range xsd:date but range xsd:dateTime (given 3 plain literals with malformed all triples with sl:creationDate in the predicate position have date-time values in 3 documents39 ). Thus, in all but the plain-literal objects { all such usage originates from the sem- latter cases, liberal software agents could ignore mismatches anlink.net tagging system33 ; please note that plain literals 34 cf. http://community.linkeddata.org/dataspace/kidehen2/ without language tags are considered as xsd:strings [10] and subscriptions/Kingsley_Feed_Collection/tag/rdf 35 28 425 of 464 such examples stem from http://bioinfo. cf. http://www.wasab.dk/morten/2004/08/photos/1/ icapture.ubc.ca/subversion/Cartik/Object-OWLDL2.owl index.rdf 36 cf. http://www.ivan-herman.net/professional/CV/ 29 cf. http://rdf.ecs.soton.ac.uk/publication/10006 W3CTalks.rdf 30 cf. http://rdf.freebase.com/rdf/aviation/aircraft_ 37 cf. http://www.w3.org/2006/03/wn/wn20/instances/ ownership_count wordsense-act-verb-3.rdf 31 cf. http://www.deri.ie/fileadmin/scripts/foaf.php? 38 http://kasei.us/pictures/2005/20050422-WCCS_ id=320 Dinner/index.rdf 32 cf. http://dbpedia.org/data/1994_San_Marino_Grand_ 39 cf. http://sw.deri.org/2005/08/conf/ssws2006.rdf { Prix.xml an example of errors admittedly generated by an author of 33 cf. http://www.semanlink.net/tag/rdf.rdf this paper. Disjoint Classes # Instances foaf:Person in the opiumfield.com domain41 and inferred to foaf:Agent u foaf:Document 502 be members of foaf:Document in the dbtune.org domain42 . foaf:Organization u foaf:Person 328 Again, there are many other exporters and domains which foaf:Document u foaf:Person 232 contribute; for example, an exporter of Wikipedia data in sioc:Container u sioc:Item 194 the sioc-project.org domain43 uses the same URI to iden- sioc:Item u sioc:User 35 tify users and the users' Wikipedia pro le page. Table 10: Top ve instantiated pairs of disjoint Publisher Recommendations : Such problems with incon- classes sistent data { especially those arising from multiple sources { may be quite dicult to solve. The obvious and lazy between an object's datatype and that speci ed by the prop- solution is to remove the disjointness constraints from the erty's range, parsing the literal string into the value space relevant ontologies; however, these constraints are intended of the range datatype; however, caution is required when to ag nonsensical or con icting information and removing considering non-standard datatypes: consider if a property them clearly does not solve the root cause. Currently, the ex:temp has the datatype ex:celcius as range and is used main observed cause for such inconsistencies is the use of with an ex:fahrenheit value { clearly the value should not incompatible naming schemes { using URIs to identify two be parsed as ex:celcius although in it's lexical space. completely di erent things { most often across di erent do- mains; agreement must be reached on what is an appropriate 2.3.9 OWL inconsistencies identi er for the contentious resource. Category : inconsistent Consumer Recommendations : There are two standard The Web Ontology Language (OWL) includes features { approaches for handling inconsistencies in agents incorpo- such as de ning disjoint classes, inequality between resources, rating reasoning: resolve or overlook; the former approach etc. { which can additionally be used to check if some data { which requires `defeating' the `marginal view' { may not agrees with the underlying ontology; i.e., that the data is be so in tune with the open philosophy of the Web, where consistent. contradiction could be considered a `healthy' symptom of dif- To begin with, we quickly mention inconsistency checks fering opinions. Rule-based reasoning approaches have the which we performed, but which did not detect anything in luxury of optionally overlooking inconsistencies, where in- the crawl. Firstly, the class owl:Nothing is intended to rep- consistent data can simply be agged (e.g., see OWL 2 RL resent the empty class, and, as such, should not contain rules in [7] with false consequences). However, tableaux al- any members; in our dataset, we found no directly asserted gorithms are less resistent to inconsistencies and are tied by members of owl:Nothing. Also, an inconsistency can occur the principle of explosion: ex contradictione quodlibet (from when owl:sameAs and owl:differentFrom overlap; again, how- contradiction follows anything); some works focus on para- ever, we found no such examples in our crawl { in fact, we consistent reasoning { tableaux reasoning tolerant to incon- found no usage of owl:differentFrom in the predicate position sistency { although such approaches are expensive in prac- of a triple. Similarly, although we found two instances of tice (cf. [14]). In any case, in either rule- or tableaux-based owl:AllDifferent/ owl:distinctMembers usage, none resulted approaches { and depending on the application scenario { in an inconsistency. Continuing, we also performed sim- inconsistent data may be pre-processed with those triples ilar checks for instances of classes which were de ned as causing inconsistencies dropped according to some heuristic complements of each other using owl:complementOf; however, measures. again we found no owl:complementOf relations in our dataset. Brie y, we also performed simple checks for unsatis able con- 2.4 Non-authoritative contributions cepts whereby, for example, one class is (possibly indirectly) both a subclass-of and disjoint-with another class: for each 2.4.1 Ontology-hijacking class found, we performed reasoning on an arbitrary mem- bership of that class and checked whether any of the inferred Category : incoherent/hijack memberships were of disjoint classes; however, we found no In previous work, we encountered a behaviour which we such concepts on the Web. termed \ontology hijacking" [12]: the rede nition by third In fact, all inconsistencies we found in our crawl were re- parties of external classes/properties such that reasoning over lated to memberships of disjoint classes. The OWL property data using those external terms is a ected: herein { and owl:disjointWith is used to relate classes which cannot share loosely { we de ne the notion of an authoritative document members; disjoint classes are used in popular Web ontologies for a term as the document resolved by dereferencing the as an indicator of inconsistent information. For example, in term, and consider all other (non-authoritative) documents FOAF the classes foaf:Person and foaf:Document are de ned as third-party documents (please see [12] for a more exhaus- as being disjoint: something cannot be both. Resources can tive discussion). Web ontologies/vocabularies published ac- be asserted to be members of disjoint classes either directly cording to best-practices are thereby the only document au- by document owners, or inferred through reasoning. We only thoritative for the terms in their namespace. detected a small number of such direct assertions in our crawl In our dataset, we found that 5,211 document engaged { generally, a resource is asserted to be a member of one class in some form of ontology hijacking { most such occurrences in one document and a disjoint class in a remote document.40 were due to third party sources `echoing' the authoritative However, after reasoning on our dataset, there were 1,329 de nition of a class or property in their local ontology. How- occurrences of inconsistencies caused by disjoint classes; Ta- ever, we also encountered examples of third-parties rede n- ble 10 enumerates the top ve. ing class/properties. As an example, we found one document The most prominent cause of such problems stem from which rede nes the core property rdf:type { de ning nine of two incompatible FOAF exporters for LastFM data: the its properties as being the domain of rdf:type { e ectively same resources are simultaneously de ned as being of type 41 cf., http://rdf.opiumfield.com/lastfm/profile/danbri 42 cf., http://dbtune.org/last-fm/danbri.rdf 40 43 http://apassant.net/blog/2009/05/17/ cf. http://ws.sioc-project.org/mediawiki/mediawiki.php?wiki= inconsistencies-lod-cloud http://en.wikipedia.org/wiki/User:Andy_Dingley leading to every entity described on the Web being inferred our focus is much more broad in characterising errors in RDF as a member of those nine properties.44 Again, for example, Web data. we found 219 statements declaring foaf:Image { authorita- tively de ned as a class { to be a property; these were from the sembase.at domain (again see Footnote 15). 4. WHAT ABOUT ALICE? Publisher Recommendations : This particular issue fo- We can now see that although our protagonist Alice is purely hypothetical, her adventures in Linked Data wonder- cuses on how vocabulary publishers re-use existing vocabu- laries: we would thus particularly encourage vocabularies to land are disappointingly less so; in our analysis, we have extend external terms, and not rede ne them. Such usage is shown the types of issues in RDF data on the Web that have more generally related to the principle of modularity, encour- made her journey so disconcerting. We have presented, pro- aging the modular design of Web vocabularies and avoiding vided statistics and examples for, and discussed a plethora of the mess implied by the cross-de nition of terms over the di erent types of errors, hopefully raising awareness of such Web. issues amongst data publishers and developers of agents who wish to access and interpret such data. As typi ed by Al- Consumer Recommendations : Clearly, on the Web, peo- ice, such issues can dramatically lower the quality of some ple should not be constrained in what they express and where applications, and consequently their end-user appeal; the er- they express it; however, to do useful reasoning, developers rors do not come from the engine, but from the underlying must take contextual information into account and provide data and thus, reasonable e orts to resolve data issues are some means of insulating ontologies from wayward external as important as developing tolerant applications. contributions. Again, in previous work we have described our So, how can we help Alice? system for performing reasoning over RDF Web data called We have already determined that many such issues are SAOR [12], and found it essential to introduce our notion easily resolvable by the publisher and therefore concluded of authority when doing reasoning: in particular, we de ne that publishers are unaware of the problems resident in their our notion of an \authoritative rule application" which will data. One solution would be to provide a system for validat- not produce inferences from non-authoritative triples which ing RDF data being published to the Web: several systems rede ne external terms. An orthogonal approach to the exist but do not cover the broad range of issues discussed in same problem is that of \quarantined reasoning" described this paper. From a syntactic point of view, the rst valida- in [5], which loosely constitutes \per-document" reasoning, tor available was the W3C RDF Validator45 , being able to and scopes inferences based on a closed notion of context check the syntax of any RDF/XML document (however, not derived from the implicit and explicit imports of each input datatype syntax). The DAML validator46 provides check- document, thus excluding third-party contributions (please ing of a large number of issues; however the validator is out see [12] for a more in-depth comparison). of date (does not support OWL), and, at the time of writ- ing, does not work. With regards to the protocol issues, the 3. RELATED WORK online Vapour validator47 [3] aims at validating the compli- Earlier papers analysing problems in RDF Web data and ance of published RDF data (either vocabularies or instances the uptake of standards mainly focus on the categorisation data) according to Linked Data principles [2]. The online and validation of documents with respect to the various OWL Pellet [16] validator48 enables species validation as well as species. In [1], the authors performed validation { based on other criteria we identi ed such as checking ontology consis- OWL-DL constraints { for a sample group of 201 OWL on- tency and nding unsatis able concepts. tologies which were all found to be OWL Full for mainly There are also a number of command-line validators. The trivial reasons; the authors then suggested means of patch- Validating RDF Parser (VRP)49 operates on speci ed RDF ing the ontologies to be OWL-DL conformant. A similar Schema constraints, with some support for datatypes. The but more extensive survey was conducted in [19] over 1,275 Eyeball50 project provides command-line validation of RDF ontologies; the authors provided categorisation of the expres- data for common problems including use of unde ned prop- sivity and species and discussion related to patching of the erties and classes, poorly formed namespaces, problematic ontologies. At the moment, we do not o er species validation pre xes, literal syntax validation and other optional heuris- for RDFS/OWL and our scope is much broader with respect tics. to validation. However, none of the above validators cover the plethora In [15], the authors describe common user errors in model- of issues we have encountered; thus, we have developed and ing OWL-DL ontologies. In [17], the authors describe some now provide RDF:Alerts : http://swse.deri.org/RDFAlerts/. error checking for OWL ontologies using integrity constraints Given a URI, the system provides validation for many of the involving the Unique Name Assumption (UNA) and also the issues enumerated in this paper; Figure 2 shows a screenshot Closed World Assumption (CWA). Similarly, in [18], vari- of feedback for an erroneous document. We further intend to ous errors and constraints are introduced for error check- extend the tool { to include all of the presented issues and ing; the primary contribution is the introduction of ve `in- suggestions from the community { and to improve usabil- congruencies' (e.g., an individual not satisfying a cardinality ity; we may also consider extending such a tool to provide constraint according to UNA/CWA) with cases, causes and intermittent automatic reporting to publishers who opt in, methods of detection. However, all of these papers have a de- depending on the perceived demand of such a service. cidedly more OWL-centric focus than our work and provide Still, other issues { particularly relating to inter-dataset no analysis or discussion of Web data. incompatibility, naming, and inconsistent use of vocabulary In [6], the authors provided an in-depth analysis of the terms { may be more dicult to resolve. Indeed, we have landscape of RDF Web data in a crawl of 300M triples. Also 45 they identi ed some statistics about classes and properties http://www.w3.org/RDF/Validator/ 46 (SWTs) in RDF data; e.g., they found that 2.2% of classes http://www.daml.org/validator/ 47 and properties had no de nition and that 0.08% of terms http://validator.linkeddata.org 48 had both class and property meta-usage. However, again http://www.mindswap.org/2003/pellet/demo.shtml 49 http://139.91.183.30:9090/RDF/ 44 50 http://www.eiao.net/rdf/1.0 http://jena.sourceforge.net/Eyeball/ 5. REFERENCES [1] S. Bechhofer and R. Volz. Patching syntax in OWL ontologies. In International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science, pages 668{682. Springer, November 2004. [2] T. Berners-Lee. Linked Data. Design issues for the World Wide Web, World Wide Web Consortium, 2006. http://www.w3.org/DesignIssues/LinkedData.html. [3] D. Berrueta, S. Fernndez, and I. Frade. Cooking HTTP content negotiation with Vapour. In Proceedings of 4th Workshop on Scripting for the Semantic Web (SFSW2008), June 2008. [4] P. V. Biron and A. Malhotra. XML Schema part 2: Figure 2: Screenshot of validation results from Datatypes second edition. W3C Recommendation, RDF:Alerts system. Oct. 2004. http://www.w3.org/TR/xmlschema-2/. also not properly discussed issues introduced by versioning, [5] R. Delbru, A. Polleres, G. Tummarello, and S. Decker. where, for example, a vocabulary maintainer makes changes Context dependent reasoning for semantic documents to the de nition of a term breaking backwards-compatibility in sindice. In Proceedings of the 4th International with legacy usage of that term { indeed, we recognise that Workshop on Scalable Semantic Web Knowledge Base casual versioning may explain some of the discrepancies we Systems (SSWS 2008), Karlsruhe, Germany, Oct. 2008. have encountered in this paper, though systematic detection [6] L. Ding and T. Finin. Characterizing the Semantic of such errors is dicult given our static snapshot dataset. Web on the Web. In Proceedings of the 5th The resolution of such errors may sometimes require com- International Semantic Web Conference, November promise between maintainers of ontologies and maintainers 2006. of exporters which populate the ontologies' terms, re ect- [7] B. C. Grau, B. Motik, Z. Wu, A. Fokoue, and C. Lutz. ing the current social and community driven nature of Web OWL 2 Web Ontology Language: Pro les. W3C publishing. Re ecting such community driven e orts, con- Working Draft, Apr. 2008. sideration is being given to more open ontology editing and http://www.w3.org/TR/owl2-profiles/. creation. In VoCamp events51 , people from di erent back- [8] A. Harth, J. Umbrich, and S. Decker. Multicrawler: A grounds and with di erent perspectives meet to work on pipelined architecture for crawling and indexing modelling lightweight ontologies for immediate use. In order semantic web data. In 5th International Semantic Web to allow ontologies to evolve according to user needs, initia- Conference, pages 258{271, 2006. tives such as semantic wikis for ontology management [13] [9] M. Hausenblas. Exploiting linked data to build and services such as OpenVocab52 allow users to more freely applications. IEEE Internet Computing, 13(4):68{73, interact with the ontology terms they wish to use and share. 2009. Although such approaches may again su er from human er- [10] P. Hayes. RDF semantics. W3C Recommendation, ror and disagreement { and have many open issues such as Feb. 2004. http://www.w3.org/TR/rdf-mt/. versioning and editing privileges { such community-driven [11] T. Heath. How will we interact with the web of data? e orts could lead to a more extensive vocabulary of terms IEEE Internet Computing, 12(5):88{91, 2008. for use on the Web. We have also initiated a community driven e ort which [12] A. Hogan, A. Harth, and A. Polleres. Scalable we call \The Pedantic Web Group"53 , which aims to engage Authoritative OWL Reasoning for the Web. Int. J. Semantic Web Inf. Syst., 5(2), 2009. with publishers and help them improve the quality of their data. Firstly, we have provided some pragmatic educational [13] M. Krotzsch, S. Scha ert, and D. Vrandecic. Reasoning material for publishers, including a list of validation tools in semantic wikis. In Reasoning Web, pages 310{329, and of frequently observed problems in RDF publishing. Sec- 2007. ondly, we have created a mailing list for actively contacting [14] Y. Ma, P. Hitzler, and Z. Lin. Algorithms for publishers about their mistakes and for various discussions Paraconsistent Reasoning with OWL. In ESWC, pages on the quality of the Web of Data { subscription to which 399{413, 2007. is open to the community. Indeed, such e orts may be the [15] A. L. Rector, N. Drummond, M. Horridge, J. Rogers, only means to resolve issues which require the co-ordination H. Knublauch, R. Stevens, H. Wang, and C. Wroe. of multiple publishers. As such, we see the Pedantic Web Owl pizzas: Practical experience of teaching owl-dl: Group as a go-to point for tackling publishing-related issues Common errors & common patterns. In EKAW, pages on the Web of Data, and as a community-driven means of 63{81, 2004. promoting better quality publishing for the Web of Data. [16] E. Sirin, B. Parsia, B. C. Grau, A. Kalyanpur, and To nally conclude, we would like to replace the present Y. Katz. Pellet: A practical OWL-DL reasoner. hypothetical Alice with a possible future Alice who is again Journal of Web Semantics, 5(2):51{53, 2007. browsing the Web of Data { however this time using an ap- [17] E. Sirin, M. Smith, and E. Wallace. Opening, closing plication which has been tempered for noisy data, where the worlds - on integrity constraints. In OWLED, 2008. documents have been validated, consistent identi ers used, [18] J. Tao, L. Ding, and D. L. McGuinness. Instance data and resources described using a rich vocabulary of community- evaluation for semantic web-based knowledge endorsed terms. We hope that such an Alice might be amazed management systems. In HICSS, pages 1{10, 2009. { this time for the right reasons. [19] T. D. Wang, B. Parsia, and J. A. Hendler. A survey of the web ontology landscape. In Proceedings of the 5th 51 http://vocamp.org/wiki/Main_Page International Semantic Web Conference (ISWC 2006), 52 http://open.vocab.org/ pages 682{694, Athens, GA, USA, Nov. 2006. 53 http://pedantic-web.org/