Quality Labeling of Web Content: The Quatro approach Vangelis Karkaletsis Andrea Perego Phil Archer NCSR “Demokritos” Università degli Studi di Milano Internet Content Rating Association P. Grigoriou & Neapoleos str. via Comelico 39/41 22 Old Steine, Brighton, East Sussex, 15310 Ag. Paraskevi Attikis, Greece I-20135 Milano MI, Italy BN1 1EL United Kingdom +30 210 6503197 +39 02503 16273 +44 (0)1473 434770 vangelis@iit.demokritos.gr perego@dico.unimi.it parcher@icra.org Kostas Stamatakis Pantelis Nasikas David Rose NCSR “Demokritos” NCSR “Demokritos” Coolwave Limited P. Grigoriou & Neapoleos str. P. Grigoriou & Neapoleos str. 4 -6 Greenfield House, Storrington, Nr 15310 Ag. Paraskevi Attikis, Greece 15310 Ag. Paraskevi Attikis, Greece Pulborough, West Sussex, UK +30 210 6503215 +30 210 6503197 +44 (0)870 7127000 kstam@iit.demokritos.gr pnas@iit.demokritos.gr david@coolwave.co.uk ABSTRACT people have the same site in their (online) bookmarks (favorites) QUATRO is an on-going EC-funded project which aims to and that it gets a 4.5 rating on a recommender system. provide a common vocabulary and machine readable schema for QUATRO adds to the picture in two ways: by providing a way in quality labeling of Web content, as well as ways to automatically which any number of web resources can easily share the same show the contents of the label(s) found in a Web resource, and description; by providing a common vocabulary that can be used functionalities for checking the validity of these labels. The paper by labeling authorities. As a result, machines will be able to presents the QUATRO processes for label validation and user recognize that a site mentioned in a blog that gets a 4.5 star rating notification, and outlines the architecture of QUATRO system. on a recommender system and is in 3 friends’ online bookmarks also has a label. By basing the labels on RDF, QUATRO is Categories and Subject Descriptors effectively promoting the addition of data on the web that a wide H.3.5 Online Information Services: Web-based services variety of other applications can use to build trust in a given resource. General Terms At the time of writing this paper, the details of the QUATRO Management, Reliability, Experimentation, Verification. vocabulary have been finalized and the complete vocabulary is available on the QUATRO site and elsewhere, both as a plain text document and an RDF schema [4]. It will be available for free Keywords usage by Labeling Authorities (LAs) as they see fit. The project’s Quality labeling, web content analysis, RDF schemas vocabulary is divided into four categories: - General Criteria, such as whether the labelled site uses clear language that is fit for purpose, includes a privacy statement, 1. INTRODUCTION data protection contact point etc. QUATRO is an on-going EC-funded project which aims to - Criteria for labelling to ensure accuracy of information such provide a common vocabulary and machine readable schema for as the content provider’s credentials and appropriate quality labeling of web content, making it possible for the many disclosure of funding. existing labeling schemes to be brought together through a single, coherent approach without affecting the individual scheme’s - Criteria for labelling to ensure compliance with rules and criteria or independence [1]. legislation for e-business such as fair marketing practices and measures to protect children. QUATRO’s work on providing a platform for machine- understandable quality labels, also called trustmarks, is part of a - Terms used in operating the trust mark scheme itself such as much greater activity around the world, that of Semantic Web [2]. the date the label was issued, when it was last reviewed and Three QUATRO partners, ERCIM, as European host for W3C, by whom. and ICRA and NCSR, as W3C members, are active participants in LAs will, of course, continue to devise their own criteria. this activity. RDF, the Resource Description Framework [3], is However, where those criteria are equivalent to those in the the key technology behind the Semantic Web, providing a means QUATRO schema, use of common elements offers some distinct of expressing data on the web in a structured way that can be advantages. processed by machines. It allows a machine to recognize that, for example, 5 blogs are commenting on the same web site, that 3 Work is now underway to develop applications to make use of the machine-readable labels: - An application for checking the validity of machine-readable In the second scenario, labels are stored at the labeled resource labels found in web resources. A label’s validity is checked site. Since such labels are not under the control of the LA, they against the corresponding information found in the LA’s can be easily modified by the resources’ content providers. In database. Furthermore, QUATRO also enables, for some order to verify their validity, QUATRO needs to be able to verify cases, the checking of label’s validity against the content of a) whether the label stored at the labeled resource site is the same the web resource. The application is implemented as a proxy of the one that has been generated by the LA (integrity control) server, named QUAPRO. and b) whether the label has not expired (date control). The former may be enforced by a hash-matching while the latter by a - A browser extension, named ViQ, which enables the visual date-comparison mechanism. interpretation of label found in the web resource requested by the user, according to QUAPRO results. A user is More precisely, concerning integrity control, whenever a label is therefore able to see that a site has a label and be notified on generated, the LA hashes the label and the produced hash is the label’s validity and content. stored in the LA database. Whenever a label is located inside a web resource, QUATRO hashes it and asks the LA to verify - A wrapper for search engines’ results, named LADI, which whether this hash matches with the hash of the label stored in the indicates the presence of label(s) on the web sites listed. This LA’s database. In addition, for every label generated by the LA, a will be available for inspection by clicking an icon adjacent label expiry date parameter is set, which means that the label is to the relevant result. As in the case of ViQ, label validation valid until that specific date. Therefore, QUATRO gets from the and user notification will be performed by QUAPRO. LA this valid-until date in order to check the label validity. This paper briefly presents the QUATRO processes for label Finally, as noted before, whenever a content analyzer is available, validation and user notification (Section 2), the QUATRO QUATRO can perform an additional check examining the content architecture and the main functionalities of the components of the of the web resource against the label’s content. system implementing this architecture (Section 3). Thus, three different policies can be enforced for label’s 2. Label validation and User notification validation: labels’ integrity, labels’ expiry date, and content Before displaying the content of a label identified in a web analysis (meaning the semantic equivalence between the actual resource, it is necessary to examine whether the label is a valid resource content and the description provided by the label). one against either the Labeling Authority’s (LA) database or the Note that it may be also the case that the label cannot be content of the web resource. For this purpose, QUATRO employs validated. For instance, the LA database may be down, the two validation processes. hosting server may be off-line, the QUATRO’s proxy (QUAPRO) The first one concerns the label’s integrity, independently from may be unavailable. In such cases we can simply say that the the content of the web resource. A label is generated by the validity of the label cannot be verified. This applies even to the corresponding LA at some point in time, and represents the case when a content analyzer is not able to decide whether a label content of the web resource at that time. It is possible that the is valid or not. Thus we have the following possible results when provider of the web resource’s content has changed the label’s evaluating labels: valid, invalid, and cannot be verified; content without informing the LA. The validation mechanism As it concerns user notification, this is performed in order to must enable the checking of the label’s content against the inform users whether a resource is labeled or not. Yet, when corresponding content stored in the LA’s database, in order to labels are invalid, the description they provide is useless. Thus, ensure the label’s integrity. This does not mean that a label that we can devise two different strategies for considering a resource satisfies the integrity constraint is actually valid, since the content as labeled: of the web resource may have changed. On the other hand, we - when valid labels are associated with it, cannot be completely sure that a label which does not satisfy our integrity constraints is necessarily invalid. - when labels are associated with it, independently from their That’s why examining a label’s integrity must be supported, validity. whenever this is possible, by an additional comparison of the QUATRO adopts the latter strategy, since it aims at informing label’s content against the actual resource content. This users about the characteristics of the requested resources, not at constitutes the second QUATRO validation process. It is difficult blocking inappropriate contents. In addition, QUATRO validation to automate this validation check since it involves the use of policies allow the verification of labels’ validity against the LA’s advanced content analysis techniques. In the context of database in all cases, but, as it concerns the validation of the QUATRO, we use the content analyzer FilterX [5] in one of the label’s content against the resource’s content, this can only be case studies. done when a content analyzer is available for the specific case. The criteria according to which a label should be considered Thus, QUATRO’s approach allows the user to access the content valid/invalid may vary depending on the specific labeling scheme. of a label, even though it is not valid. After being notified whether We distinguish two different scenarios. a label is valid or not, users can display the contents of any available label. It is up to them to decide whether they will trust it In the first scenario, the labels are stored at the LA’s site. In such or not. a case, labels cannot be modified directly by the web resources’ content providers, and thus their integrity is granted. That is, in Label notification may then return one of the following results: this case, we can only examine whether the resource’s content has - The requested resource is unlabelled: The end user is been modified and if the updated content is not in-line with the informed that no label is available for the requested resource. label’s content. - The requested resource is labeled: The end user is informed The next sub-sections provide more information on the that labels are present, and he/she is notified whether they functionalities of QUATRO components. are valid, invalid, or they cannot be evaluated. 2.1 ViQ Further work on the label validation scheme will include, The Metadata Visualizer (ViQ) is a client application in charge of incorporating XML Digital Signatures. In this scenario an LA two main tasks: does not need to provide an online database with labels and hashes as a web service, just a way to locate its public key (e.g. - to notify users whether a requested Web resource is as RDF/A metadata on its website). The label file will contain the associated with content labels or not; digital signature of the hash. The hash will be generated as before, - to display to the users the contents of the labels associated and we will generate the digital signature from it, rather than from with Web resources. the label itself, due to performance reasons. So, once the labeling authority creates the label and the hash, and signs it with a digital ViQ is being developed as a browser extension for the three most signature from a private key that it (the LA) keeps secret , a user popular Web browsers (i.e., MS Internet Explorer, Mozilla agent program can easily verify the integrity of the hash (and thus Firefox and Opera), providing a toolbar (the ViQ Toolbar), a the label) if he uses the public key. One drawback in this status bar icon, and an additional item in the browser main menu. validation scheme would be that it might take too much time to Users are notified of the presence/absence of labels by specific decrypt the digital signature with the public key in order to get icons. If labels are available, the user can display their contents. back the original hash key , but we are working on it. ViQ relies on QUAPRO for verifying labels’ validity. Moreover, QUAPRO will be in charge of returning the information needed QUATRO Architecture by ViQ to display the label summary and details. More precisely, Figure 1 depicts the four applications participating in the whenever a Web resource is requested by the user, ViQ performs QUATRO quality labels validation and notification tasks (ViQ, the following steps: LADI, QUAPRO and FilterX). QUAPRO is the central server- - if QUAPRO says that labels are absent, the user is notified based application which receives requests from the two end-user that no labels are available for the requested resource;: applications (ViQ, LADI), identifies quality labels, evaluates them and replies accordingly. A Data Access interface (DAcc), - otherwise, ViQ notifies that labels are present, and it displays placed before an LA’s database, handles the communication the lists of available labels, marked with an icon denoting between QUAPRO and the database. their validity status (valid, invalid, and “cannot be verified” – see Figure 2). The applications mentioned above have to exchange messages since QUAPRO needs information from all the parties involved (ViQ/LADI, LA’s database, content analyzer) to assess the labels' validity. The Simple Object Access Protocol (SOAP), a W3C recommendation [6], is used for this purpose. An XML schema has been devised that must be followed by any application that wants to use the services provided by QUAPRO. This enables, for instance, to employ another content analysis tool, or add another labeling authority. SOAP has been selected because it uses http (in our case) as its transfer protocol, and therefore no special configuration is required from the end user when installing the ViQ plug-in. Figure 2. ViQ browser extension 2.2 LADI The Search Engine Wrapper LADI is a server application that gives users an indication of the existence of a label or labels inside the web resources listed in search engine results and then allows them to see more detailed information about those labels. Figure 1. QUATRO architecture As with ViQ, LADI calls on QUAPRO to provide label summary and details and to verify the validity of labels. Where ViQ provides information about resources that have already been visited, LADI will provide the same or similar information before the RDF labels, such as the label creator, the label expiry date and a resource is visited. LADI’s task is therefore quite different in the URLs that this label applies to. that it must check with QUAPRO for each of, say, ten results per When QUAPRO receives a request for one of the labels found in page of search results that are viewed per user search. It must then a specific URL, it queries the label in order to find its expiry date, provide the indicators and a method for viewing the information creates its hash and contacts the corresponding LA database (via within the browser as part of the search result listing returned to DAcc) to assess the validity of the label. While waiting for the the user. DAcc response, and in case a content analyzer is available So, LADI will: (FilterX in our case), it also sends a message to it. When the - Provide a web search form initially. responses from DAcc and the content analyzer come, QUAPRO compiles the new message to be sent to ViQ/LADI. This message - Accept a search term from the user and, using the contains links to unique URLs in the QUAPRO server that appropriate API, perform a server-to-server request to the contain the labels in natural language so that it can be accessed if appropriate search engine (Google, Yahoo! in QUATRO requested from ViQ/LADI. case studies). - For each of the resources returned by the search engine(s), 2.4 DAcc make a server-to-server request to QUAPRO to check for the The labeling authorities maintain a database of the web sites that existence of a label or labels and to obtain the information have been labeled as well as metadata about the labels such as about those labels. expiration date, language, the hash key for the label. For QUAPRO, DAcc is a "black box" receiving and sending SOAP - Produce the HTML for the search results to be returned to messages in conformity to the SOAP messages schema. the user, merging the results obtained from the chosen search engine with any relevant information from QUAPRO. The DAcc application receives from QUAPRO the URL of the web site, the URL of the RDF label on the web site and the hash key generated from QUAPRO. DAcc in response returns whether the hash keys match, and the expiration date status. 2.5 FilterX FilterX is a content analyzer which enables the intelligent blocking of obscene content accessible through browsers on the World Wide Web. FilterX is a product of i-sieve [3], a spin-off of QUATRO’s partner NCSR "Demokritos". I-sieve provides FilterX to NCSR for the research purposes of the QUATRO project. For the purposes of QUATRO, FilterX has been adapted to perform as an independent software module which will be invoked by QUAPRO to evaluate labeled Web resources and return a message compatible to QUATRO specification. So, FilterX accepts a URL sent by QUAPRO and returns a message with the results of content analysis. Figure 3. LADI-annotated search results 3. Concluding remarks Currently, web sites carrying quality labels such as those administered by the QUATRO partners, Internet Quality Agency 2.3 QUAPRO and Web Mèdica Acreditada, carry a logo. Clicking the logo, QUAPRO is a server-based application that processes requests results in the display of a database entry confirming the logo’s from both ViQ and LADI. In order to decide on a quality label’s validity, last review date etc. However, such labels work in validity, QUAPRO can perform 3 different types of controls: date isolation and are only visible to human visitors to sites. They control, hash control, content analysis control. The first two cannot be harvested, aggregated or otherwise utilised by checks are used to decide on label’s validity against the LA’s machines. database, whereas the third check examines the label’s validity against the content of the corresponding resource. In case all three QUATRO offers a substantial improvement to the current checks are used, a composition of the verdicts gives the final situation. First, project members have worked to create a flexible validity value for the label (valid, invalid, “cannot be verified”). platform that encodes the labels. Secondly, it offers a vocabulary that encompasses the common elements of a wide variety of QUAPRO either accepts a single URL (ViQ) or a list of URLs labeling schemes. The two together have the potential to make (LADI) and checks if they are labeled. It looks for links to labels many different quality labels highly interoperable. It must be in the HTML code of the web page or the HTTP headers when noted that Segala [8] is using the system to encode its certification accessing a URL. If a label is found, QUAPRO proceeds by scheme for web accessibility. RDF content labels are also querying the label to find the label’s creator and subsequently examined in a W3C’s Incubator Activity [9] which is feeding returns this information to ViQ/LADI. QUAPRO is using the directly into the Mobile Web Initiative's development of a SPARQL query language [7], for accessing information stored in mobileOK trustmark [10]. Furthermore, QUATRO provides the means for users navigating [2] http://www.scientificamerican.com/article.cfm?articleI the web with a common web browser to be notified when quality D=00048144-10D2-1C70- labels are present (using appropriate graphics) and, if they are, 84A9809EC588EF21&catID=2 whether they are valid or not. The two end-user applications, ViQ and LADI, currently under development, serve this purpose. [3] http://www.w3.org/RDF/ [4] http://purl.oclc.org/quatro/elements/1.0/ 4. Acknowledgments [5] http://www.i-sieve.com This research was partially funded by the EC through the SIAP project QUATRO (Quality Assurance and Content Description). [6] http://www.w3.org/TR/soap QUATRO involves the following partners: Pira International [7] http://www.w3.org/TR/rdf-sparql-query/ (Coordinator), Internet Content Rating Association, Internet Quality Agency, Web Mèdica Acreditada, NCSR “Demokritos”, [8] http://www.segala.com University of Milan, Coolwave, ECP.NL, ERCIM. [9] http://www.w3.org/2005/Incubator/wcl/wcl-charter- 20060208.html 5. References [10] http://www.w3.org/Mobile/ [1] http://www.quatro-project.org [11] http://www.w3.org/TR/xmldsig-core/