Revyu.com: a Reviewing and Rating Site for the Web of Data Tom Heath and Enrico Motta Knowledge Media Institute, The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom {t.heath, e.motta}@open.ac.uk Abstract. Revyu.com is a live, publicly accessible reviewing and rating Web site, designed to be usable by humans whilst transparently generating machine- readable RDF metadata for the Semantic Web, based on their input. The site uses Semantic Web specifications such as RDF and SPARQL, and the latest Linked Data best practices to create a major node in a potentially Web-wide ecosystem of reviews and related data. Throughout the implementation of Revyu design decisions have been made that aim to minimize the burden on users, by maximizing the reuse of external data sources, and allowing less structured human input (in the form of Web2.0-style tagging) from which stronger semantics can later be derived. Links to external sources such as DBpedia are exploited to create human-oriented mashups at the HTML level, whilst links are also made in RDF to ensure Revyu plays a first class role in the blossoming Web of Data. The site is available at . 1 Introduction Revyu.com is a live, publicly usable (and used!) reviewing and rating Web site developed using Semantic Web technologies and standards, and according to Linked Data principles [1] and best practices [2]. Reviews and ratings are widely available on the Web and are one major form of Web2.0-inspired 'user-generated content'. However, despite the availability of reviews through APIs such as Amazon Web Services, this data remains largely in isolated 'silos', and described in formats that hinder its integration and interlinking with data from other sources. This presents considerable barriers to the aggregation of all reviews of a particular item from across the Web. As has been recognised by previous authors [3, 4], the Semantic Web, or Web of Data, provides a technological platform with which to overcome this problem. Revyu takes a significant and concrete step towards this, by exposing reviews using standards such as RDF and SPARQL. In doing so it helps to seed an ecosystem of interlinked reviews, and to bootstrap the Semantic Web as a whole. 2 Revyu Overview Revyu allows people to review and rate things simply by filling in a Web form. This style of interaction with the site will be familiar to those who have written reviews on sites such as Epinions1 or Amazon2. Whilst this functionality is not especially novel, as a reviewing application Revyu improves significantly over other work in the area in the following ways: it goes well beyond the closed world 'silos' of sites such as Epinions and TripAdvisor, by exposing reviews in a reusable, machine-readable format; it improves upon the APIs of sites such as Amazon by using a more flexible data format (RDF), allowing more versatile queries via SPARQL, and linking to external data sources; lastly the site takes an open world view of the reviewing process by not constraining users to reviewing items from a fixed database. Anything a user can name can be reviewed, whilst links supplied with the review can disambiguate items thanks to inverse functional properties such as foaf:homepage. Consequently reviewers are not restricted to reviews and ratings in one domain, as is the case with Golbeck's FilmTrust [4]. As of August 2007 Revyu has been live for 10 months, attracting 412 reviews from 112 reviewers. Revyu is built from the ground upwards on Semantic Web technologies. By following Linked Data principles [1] and best practices [2] the site ensures that reviews it hosts can be fully connected into a Web of Data. This approach manifests itself in a number of ways. All site content, in addition to being available in HTML, is also published in RDF/XML that is interlinked with the corresponding HTML pages but available as separate crawlable documents. As we have described elsewhere, this creation and publication of RDF is invisible to the reviewer, enabling novice users to contribute data to the Semantic Web through a familiar, Web2.0-style mode of interaction [5]. To date this approach has yielded over 13,000 RDF triples publicly available on the Semantic Web. Whilst not a large figure by many standards, it is significant that these triples have been generated primarily from direct user input, rather than by data mining, extraction from natural language, or conversion of existing databases. In addition to review data, RDF describing reviewers, reviewed items, and tags they assign to these is published on the site. These descriptions use the FOAF [6] and Tag [7] ontologies, as well as properties and classes from RDFS and OWL. This data can also be retrieved programmatically via the Revyu SPARQL endpoint3, allowing third parties to access Revyu data for reuse in their own applications. Whilst in some ways analogous to Web2.0 APIs that provide remote query capabilities, SPARQL endpoints afford many advantages to the developer: for example, common libraries can be used to query multiple RDF graphs yet return the results as one resultset, effectively allowing joins over multiple data sources. In the following section we will detail the technical infrastructure underlying Revyu, and discuss decisions made in implementing the system. 3 Revyu Architecture and Implementation Revyu is implemented in PHP, and runs on a regular Apache web server. The RDF API for PHP (RAP) [8] provides RDF processing capabilities, whilst RDF data is 1 http://www.epinions.com/ 2 http://www.amazon.com/ 3 http://revyu.com/sparql/welcome persisted to a de-normalised MySQL database following the RAP database schema. The Revyu SPARQL endpoint relies on the RAP SPARQL engine, which operates against the same MySQL-based triplestore. From the outset Revyu was designed to adhere to the four 'commandments' of Linked Data outlined by Berners-Lee [1]: using URIs as names for things, using HTTP URIs so people can look up those names, providing useful information when someone looks up a URI, and linking to other URIs so more things can be discovered. All things represented on Revyu are assigned URIs: reviews, people, reviewed things, tags assigned to things, and even the bundles that represent tags assigned by one person at one point in time. Providing URIs for all these things gives many items a presence on the Semantic Web which they would not have otherwise, and enables any third party to refer to these items in other RDF statements. This opens the way for links between Revyu and other data sets, thereby helping to lay the foundations for a Web of Data. All URIs in the Revyu URI-space can be dereferenced. Attempts to dereference the URIs of non-information resources receive an HTTP303 "See Other" response containing the URI of a document that describes the resource. This adheres to the W3C Technical Architecture Group's finding on the httpRange-14 issue [9], and serves to reinforce the distinction between a resource and a description of that resource. Content negotiation is also performed on Revyu URIs, whereby the user agent receives a description of the resource in either HTML or RDF depending on the value of the Accept header sent in the initial HTTP request. 4 Deriving Semantics from Tagging Data When creating Revyu, a significant decision was taken to not require users to classify the items they were reviewing, but instead to associate keyword tags with the item. This decision was taken for several reasons: firstly there was seen to be a lack of sufficiently comprehensive classifications of items that users may want to review; secondly, requiring all users to subscribe to a single classification scheme for reviewed items seemed unnecessarily constraining and against the spirit of the Semantic Web; thirdly, providing a usable interface through which non-specialists could classify items using arbitrary types discovered in ontologies on the Semantic Web was seen as unfeasible; and lastly, the coverage provided by ontologies readily available on the Web was deemed insufficient to describe all items that might be reviewed, therefore potentially resulting in a more closed world of reviewed items. The recent availability of Yago [10] class definitions via DBpedia [11] has gone some way to addressing these issues, and we will be investigating use of these classes in future work. However we believe that tagging retains the appropriate balance of usability whilst also providing sufficient data from which stronger semantics can be derived. At present we use tagging data in two ways: to identify basic semantic relationships between tags and to derive type information about a reviewed item. Tags that are frequently associated with the same item are assumed to be related in some way. In the HTML pages about each tag, tags that co-occur above a certain threshold are displayed to the user. This threshold is set low for HTML output, as human readers of the page are unlikely to infer erroneous information based on these relationships. In contrast however, relationships exposed in RDF descriptions of tags (using the skos:related property) are based on a more conservative threshold, in order to avoid erroneous inferences based on these assertions. In ongoing work we are investigating the derivation of more precise relationships (such as superclass/subclass) between tags, based on tagging data. We currently derive type information from tagging data in two domains, books and films, relying on external data sources to help ensure accurate results. Firstly, where items are tagged 'book' we parse Web links provided by the reviewer that relate to the item, and attempt to extract ISBN numbers embedded in these links. Where we are able to extract an ISBN number in this fashion we conclude that the reviewed item is in fact a book, and assert a corresponding rdf:type statement into the triplestore. If an item has been tagged 'film' or 'movie', we execute a query against the DBpedia SPARQL endpoint4 in order to find any entries of type yago:Film that have the same name as the reviewed item. If a match is found then we conclude this item is in fact a film, and add an rdf:type statement to this effect to the triplestore. These type statements for both books and films are exposed in the RDF descriptions of items on Revyu, and also used as the basis for showing additional relevant data in the HTML pages about an item, as detailed in the following section. 5 Production and Consumption of Linked Data Validating Revyu data against external sources not only allows the derivation of more reliable type information than would be possible using tags alone, it also allows items on Revyu to be linked with others from heterogeneous external data sources such as DBpedia5, Open Guides6, and FOAF data. Where matches are found, we use the owl:sameAs property to assert that two URIs identify the same resource. Publishing these links in RDF helps create a Web of Data rather than simply isolated islands of RDF; Revyu data is in the Web, not just on the Web. DBpedia (Films) Revyu.com RDF Book Mashup (Books) FOAF Data Geonames (Hotels) Open Guide to Milton (coming soon) Keynes (Amenities) Fig. 1. Links from Revyu.com to external data sets We actively exploit the links we set between Revyu and external data sources, to enhance the experience of our users without placing an additional burden on 4 http://dbpedia.org/sparql 5 http://dbpedia.org/ 6 http://openguides.org/ reviewers by requiring them to supply additional information about the reviewed item. For example, where owl:sameAs statements exist linking films on Revyu to their entry in DBpedia, we retrieve additional information about the film, such as the URI of the films promotional poster, and the name of the director. This information is displayed on the Revyu HTML page about the film (as shown in Fig. 2), thereby enhancing the value of the site for users without requiring this information to be manually entered into Revyu. Similarly we use owl:sameAs links between Revyu and the RDF Book Mashup [12] as the basis for retrieving book cover and author information which is also then displayed on the Revyu HTML page about the book (see 7 for an example). In the RDF descriptions of items we take a slightly different approach to that taken with HTML output, choosing to simply expose the links between items without republishing RDF data from external sources. This approach could be described as using Semantic Web data to produce Web2.0-style mashups at the human-readable, HTML level, whilst also mashing up (i.e. linking) data at the RDF level. Not only does this Linked Data approach to mashups reduce issues with licensing of data for republication, it is also a more Web-like approach; duplicating data is of much lesser value than linking to it, and the user agent of the future should be able to 'look ahead' to linked items and merge data accordingly. It should be noted that we do not claim that the Revyu Web2.0-style mashups represent something that could not have been achieved using conventional Web2.0 approaches. However, the following features distinguish our approach: the simultaneous publishing of data-oriented and human-oriented mashups, so that the data integration effort we have invested is not lost but can be reused by other parties; the ability to easily integrate additional heterogeneous sources using RDF; and the substantially reduced development costs in producing human-oriented mashups through use of Semantic Web technologies. Whilst to date we have waited for new film reviews on Revyu and then attempted to automatically match them with entries in DBpedia, we are currently preparing for import into Revyu 'skeleton' records covering 12,000 films described in DBpedia. These records simply include the title of the film, a statement indicating that this item is of type 'Film', a number of keyword tags, and links to the corresponding item on DBpedia. Not only will this provide a foundation on which new reviews can be created, it will also ensure that all films being reviewed in the future will already be interlinked with the corresponding DBpedia entry, and thus the Web of Data. This skeleton record approach has already been followed when linking Revyu to data from the Open Guide to Milton Keynes8, a member of the Open Guides family of wiki-based city guides that expose data in RDF. Milton Keynes is a city in south east England, and home of The Open University. Whilst some amenities in the city, such as pubs and restaurants, were already reviewed on Revyu, many more were listed in the Open Guide due to its longer history. Therefore, after identifying items existing in both locations and making the appropriate mappings to avoid duplication, we created skeleton records in Revyu for the remaining items, setting links back to their Open Guide URIs. This has enabled latitude and longitude data for many items to be 7 http://revyu.com/things/the-unwritten-rules-of-phd-research/about/html 8 http://miltonkeynes.openguides.org/ retrieved from RDF exposed by the Open Guide, and used to show a Google Map of the items location (see 9 for an example). The same approach can also be used to expose address, telephone, and opening time information held in the Open Guide. Fig. 2. Excerpts from the Revyu HTML page Fig. 3. Excerpts from the first author's Revyu about the film Broken Flowers, showing the profile page, showing data sourced film poster, director information, and automatically from his external FOAF file11 summary drawn from DBpedia10 9 http://revyu.com/things/ye-olde-swan-woughton-on-the-green-milton-keynes/about/html 10 http://revyu.com/things/broken-flowers-film-movie-bill-murray-jim-jarmusch- sharon/about/html 11 http://revyu.com/people/tom/about/html Similar principles are also applied to user information, such that people registering with the site are not required to provide copious information to populate their user profile. Instead, where they have an existing FOAF description in an external location they may provide its URI, in which case Revyu dereferences this URI and queries the resulting graph for relevant information (such as a photo, location, home page address, and interests), which is then displayed on their profile page, as illustrated in Fig. 3. This approach reduces the burden on the user by not requiring them to manage multiple redundant sets of personal information stored in different locations. Furthermore, where the user has assigned themselves a URI in their FOAF description, Revyu sets owl:sameAs links asserting that this URI identifies the same resource as the user's Revyu URI. Users can also state that they know other Revyu reviewers, at which point this relationship is recorded in the triplestore using the foaf:knows property, and exposed (privacy settings permitting) in the user's RDF description on the Revyu site. This ensures that social networking data created in one location is not automatically rendered inaccessible to other services. 6 Future Work and Conclusions In addition to encouraging further user participation in order to increase the value delivered by the site, we plan to integrate Revyu with a number of additional data sets. Most notably we are preparing to create skeleton records in Revyu of 70,000 hotels worldwide, linked to their corresponding entry in the Geonames dataset. The same approach will also be used to link Revyu with data from other Open Guides, such as London and Boston. Additional data will be integrated as further relevant sources become available. It should be noted that our aim in linking to external datasets is not to constrain, but merely to seed, users conceptions of what can be reviewed. As we integrate further data sets we hope to achieve a more automated linking process by investigating generic similarity matching techniques for operation on the wider Semantic Web. Whilst frequently suggested as an additional feature, at present there are no concrete plans to import external review data into Revyu, for a number of reasons. Firstly, to the best of our knowledge Revyu is the only site serving reviews as Linked Data according to current best practices, which limits our abilities to interlink Revyu with external review data sets; secondly, little review data is available under a suitable license; lastly, our ongoing research is predicated on the ability to combine review data with social networks, requiring some global identifier (such as foaf:mbox_sha1sum) to be available for each reviewer. This is rarely the case with traditional reviewing sites. By providing reviews in a reusable format that is easily integrated and interlinked with other data, Revyu provides core data for our ongoing work into information seeking, recommendation, and trust in social networks on the Web. In conclusion, in this paper we have described Revyu, a human usable reviewing and rating Web site built on Semantic Web technologies, and fundamentally designed to contribute to the realization of a Web of Data. Whilst superficially not unique in functionality, the site is rare in its status as a publicly available service in daily use that is oriented towards human users, yet also embodies current best practices in developing for the Semantic Web. Acknowledgements This research was partially supported by the Advanced Knowledge Technologies (AKT) and OpenKnowledge (OK) projects. AKT is an Interdisciplinary Research Collaboration (IRC) sponsored by the UK Engineering and Physical Sciences Research Council under grant number GR/N15764/01. OK is sponsored by the European Commission as part of the Information Society Technologies (IST) programme under grant number IST-2001-34038. Peter Coetzee did a superb job of turning data into skeleton records for import into Revyu. Lastly, the Open Guides and DBpedia communities, and the RDF Book Mashup team deserve our special thanks. References 1. Berners-Lee, T.: Linked Data. http://www.w3.org/DesignIssues/LinkedData.html (2006) 2. Bizer, C., Cyganiak, R., Heath, T.: How to Publish Linked Data on the Web. http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/ (2007) 3. Guha, R.: Open Rating Systems. In: Proc. 1st Workshop on Friend of a Friend (2004) 4. Golbeck, J., Hendler, J.: FilmTrust: Movie Recommendations using Trust in Web-based Social Networks. In: Proc. IEEE Consumer Communications and Networking Conference (2006) 5. Heath, T., Motta, E.: Ease of Interaction plus Ease of Integration: Combining Web2.0 and the Semantic Web in a Reviewing Site. Journal of Web Semantics, 5 (to appear) 6. Brickley, D., Miller, L.: FOAF Vocabulary Specification 0.9. http://xmlns.com/foaf/0.1/ (2007) 7. Newman, R., Russell, S., Ayers, D.: Tag Ontology. http://www.holygoat.co.uk/owl/redwood/0.1/tags/ (2005) 8. Oldakowski, R., Bizer, C., Westphal, D.: RAP: RDF API for PHP. In: Proc. 1st Workshop on Scripting for the Semantic Web, 2nd European Semantic Web Conference (ESWC2005) (2005) 9. W3C Technical Architecture Group: httpRange-14: What is the range of the HTTP dereference function? http://www.w3.org/2001/tag/issues.html#httpRange-14 (2005) 10. Suchanek, F. M., Kasneci, G., Weikum, G.: Yago: A Core of Semantic Knowledge - Unifying WordNet and Wikipedia. In: Proc. 16th International World Wide Web Conference (WWW2007) (2007) 11. Auer, S., Lehmann, J.: What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content. In: Proc. 4th European Semantic Web Conference (ESWC2007) (2007) 12. Bizer, C., Cyganiak, R., Gauss, T.: The RDF Book Mashup: From Web APIs to a Web of Data. In: Proc. 3rd Workshop on Scripting for the Semantic Web, at 4th European Semantic Web Conference (ESWC 2007) (2007)