=Paper=
{{Paper
|id=Vol-190/paper-9
|storemode=property
|title=Quality Labeling of Web Content: The Quatro approach
|pdfUrl=https://ceur-ws.org/Vol-190/paper09.pdf
|volume=Vol-190
|authors=Vangelis Karkaletsis,Andrea Perego,Phil Archer,Kostas Stamatakis,Pantelis Nasikas and David Rose
|dblpUrl=https://dblp.org/rec/conf/mtw/KarkaletsisPASNR06
}}
==Quality Labeling of Web Content: The Quatro approach==
Quality Labeling of Web Content: The Quatro approach
Vangelis Karkaletsis Andrea Perego Phil Archer
NCSR “Demokritos” Università degli Studi di Milano Internet Content Rating Association
P. Grigoriou & Neapoleos str. via Comelico 39/41 22 Old Steine, Brighton, East Sussex,
15310 Ag. Paraskevi Attikis, Greece I-20135 Milano MI, Italy BN1 1EL United Kingdom
+30 210 6503197 +39 02503 16273 +44 (0)1473 434770
vangelis@iit.demokritos.gr perego@dico.unimi.it parcher@icra.org
Kostas Stamatakis Pantelis Nasikas David Rose
NCSR “Demokritos” NCSR “Demokritos” Coolwave Limited
P. Grigoriou & Neapoleos str. P. Grigoriou & Neapoleos str. 4 -6 Greenfield House, Storrington, Nr
15310 Ag. Paraskevi Attikis, Greece 15310 Ag. Paraskevi Attikis, Greece Pulborough, West Sussex, UK
+30 210 6503215 +30 210 6503197 +44 (0)870 7127000
kstam@iit.demokritos.gr pnas@iit.demokritos.gr david@coolwave.co.uk
ABSTRACT people have the same site in their (online) bookmarks (favorites)
QUATRO is an on-going EC-funded project which aims to and that it gets a 4.5 rating on a recommender system.
provide a common vocabulary and machine readable schema for QUATRO adds to the picture in two ways: by providing a way in
quality labeling of Web content, as well as ways to automatically which any number of web resources can easily share the same
show the contents of the label(s) found in a Web resource, and description; by providing a common vocabulary that can be used
functionalities for checking the validity of these labels. The paper by labeling authorities. As a result, machines will be able to
presents the QUATRO processes for label validation and user recognize that a site mentioned in a blog that gets a 4.5 star rating
notification, and outlines the architecture of QUATRO system. on a recommender system and is in 3 friends’ online bookmarks
also has a label. By basing the labels on RDF, QUATRO is
Categories and Subject Descriptors effectively promoting the addition of data on the web that a wide
H.3.5 Online Information Services: Web-based services variety of other applications can use to build trust in a given
resource.
General Terms At the time of writing this paper, the details of the QUATRO
Management, Reliability, Experimentation, Verification. vocabulary have been finalized and the complete vocabulary is
available on the QUATRO site and elsewhere, both as a plain text
document and an RDF schema [4]. It will be available for free
Keywords usage by Labeling Authorities (LAs) as they see fit. The project’s
Quality labeling, web content analysis, RDF schemas vocabulary is divided into four categories:
- General Criteria, such as whether the labelled site uses clear
language that is fit for purpose, includes a privacy statement,
1. INTRODUCTION data protection contact point etc.
QUATRO is an on-going EC-funded project which aims to - Criteria for labelling to ensure accuracy of information such
provide a common vocabulary and machine readable schema for as the content provider’s credentials and appropriate
quality labeling of web content, making it possible for the many disclosure of funding.
existing labeling schemes to be brought together through a single,
coherent approach without affecting the individual scheme’s - Criteria for labelling to ensure compliance with rules and
criteria or independence [1]. legislation for e-business such as fair marketing practices and
measures to protect children.
QUATRO’s work on providing a platform for machine-
understandable quality labels, also called trustmarks, is part of a - Terms used in operating the trust mark scheme itself such as
much greater activity around the world, that of Semantic Web [2]. the date the label was issued, when it was last reviewed and
Three QUATRO partners, ERCIM, as European host for W3C, by whom.
and ICRA and NCSR, as W3C members, are active participants in
LAs will, of course, continue to devise their own criteria.
this activity. RDF, the Resource Description Framework [3], is
However, where those criteria are equivalent to those in the
the key technology behind the Semantic Web, providing a means
QUATRO schema, use of common elements offers some distinct
of expressing data on the web in a structured way that can be
advantages.
processed by machines. It allows a machine to recognize that, for
example, 5 blogs are commenting on the same web site, that 3 Work is now underway to develop applications to make use of the
machine-readable labels:
- An application for checking the validity of machine-readable In the second scenario, labels are stored at the labeled resource
labels found in web resources. A label’s validity is checked site. Since such labels are not under the control of the LA, they
against the corresponding information found in the LA’s can be easily modified by the resources’ content providers. In
database. Furthermore, QUATRO also enables, for some order to verify their validity, QUATRO needs to be able to verify
cases, the checking of label’s validity against the content of a) whether the label stored at the labeled resource site is the same
the web resource. The application is implemented as a proxy of the one that has been generated by the LA (integrity control)
server, named QUAPRO. and b) whether the label has not expired (date control). The
former may be enforced by a hash-matching while the latter by a
- A browser extension, named ViQ, which enables the visual date-comparison mechanism.
interpretation of label found in the web resource requested
by the user, according to QUAPRO results. A user is More precisely, concerning integrity control, whenever a label is
therefore able to see that a site has a label and be notified on generated, the LA hashes the label and the produced hash is
the label’s validity and content. stored in the LA database. Whenever a label is located inside a
web resource, QUATRO hashes it and asks the LA to verify
- A wrapper for search engines’ results, named LADI, which whether this hash matches with the hash of the label stored in the
indicates the presence of label(s) on the web sites listed. This LA’s database. In addition, for every label generated by the LA, a
will be available for inspection by clicking an icon adjacent label expiry date parameter is set, which means that the label is
to the relevant result. As in the case of ViQ, label validation valid until that specific date. Therefore, QUATRO gets from the
and user notification will be performed by QUAPRO. LA this valid-until date in order to check the label validity.
This paper briefly presents the QUATRO processes for label Finally, as noted before, whenever a content analyzer is available,
validation and user notification (Section 2), the QUATRO QUATRO can perform an additional check examining the content
architecture and the main functionalities of the components of the of the web resource against the label’s content.
system implementing this architecture (Section 3).
Thus, three different policies can be enforced for label’s
2. Label validation and User notification validation: labels’ integrity, labels’ expiry date, and content
Before displaying the content of a label identified in a web analysis (meaning the semantic equivalence between the actual
resource, it is necessary to examine whether the label is a valid resource content and the description provided by the label).
one against either the Labeling Authority’s (LA) database or the Note that it may be also the case that the label cannot be
content of the web resource. For this purpose, QUATRO employs validated. For instance, the LA database may be down, the
two validation processes. hosting server may be off-line, the QUATRO’s proxy (QUAPRO)
The first one concerns the label’s integrity, independently from may be unavailable. In such cases we can simply say that the
the content of the web resource. A label is generated by the validity of the label cannot be verified. This applies even to the
corresponding LA at some point in time, and represents the case when a content analyzer is not able to decide whether a label
content of the web resource at that time. It is possible that the is valid or not. Thus we have the following possible results when
provider of the web resource’s content has changed the label’s evaluating labels: valid, invalid, and cannot be verified;
content without informing the LA. The validation mechanism As it concerns user notification, this is performed in order to
must enable the checking of the label’s content against the inform users whether a resource is labeled or not. Yet, when
corresponding content stored in the LA’s database, in order to labels are invalid, the description they provide is useless. Thus,
ensure the label’s integrity. This does not mean that a label that we can devise two different strategies for considering a resource
satisfies the integrity constraint is actually valid, since the content as labeled:
of the web resource may have changed. On the other hand, we
- when valid labels are associated with it,
cannot be completely sure that a label which does not satisfy our
integrity constraints is necessarily invalid. - when labels are associated with it, independently from their
That’s why examining a label’s integrity must be supported, validity.
whenever this is possible, by an additional comparison of the QUATRO adopts the latter strategy, since it aims at informing
label’s content against the actual resource content. This users about the characteristics of the requested resources, not at
constitutes the second QUATRO validation process. It is difficult blocking inappropriate contents. In addition, QUATRO validation
to automate this validation check since it involves the use of policies allow the verification of labels’ validity against the LA’s
advanced content analysis techniques. In the context of database in all cases, but, as it concerns the validation of the
QUATRO, we use the content analyzer FilterX [5] in one of the label’s content against the resource’s content, this can only be
case studies. done when a content analyzer is available for the specific case.
The criteria according to which a label should be considered Thus, QUATRO’s approach allows the user to access the content
valid/invalid may vary depending on the specific labeling scheme. of a label, even though it is not valid. After being notified whether
We distinguish two different scenarios. a label is valid or not, users can display the contents of any
available label. It is up to them to decide whether they will trust it
In the first scenario, the labels are stored at the LA’s site. In such
or not.
a case, labels cannot be modified directly by the web resources’
content providers, and thus their integrity is granted. That is, in Label notification may then return one of the following results:
this case, we can only examine whether the resource’s content has - The requested resource is unlabelled: The end user is
been modified and if the updated content is not in-line with the informed that no label is available for the requested resource.
label’s content.
- The requested resource is labeled: The end user is informed The next sub-sections provide more information on the
that labels are present, and he/she is notified whether they functionalities of QUATRO components.
are valid, invalid, or they cannot be evaluated.
2.1 ViQ
Further work on the label validation scheme will include, The Metadata Visualizer (ViQ) is a client application in charge of
incorporating XML Digital Signatures. In this scenario an LA two main tasks:
does not need to provide an online database with labels and
hashes as a web service, just a way to locate its public key (e.g. - to notify users whether a requested Web resource is
as RDF/A metadata on its website). The label file will contain the associated with content labels or not;
digital signature of the hash. The hash will be generated as before, - to display to the users the contents of the labels associated
and we will generate the digital signature from it, rather than from with Web resources.
the label itself, due to performance reasons. So, once the labeling
authority creates the label and the hash, and signs it with a digital ViQ is being developed as a browser extension for the three most
signature from a private key that it (the LA) keeps secret , a user popular Web browsers (i.e., MS Internet Explorer, Mozilla
agent program can easily verify the integrity of the hash (and thus Firefox and Opera), providing a toolbar (the ViQ Toolbar), a
the label) if he uses the public key. One drawback in this status bar icon, and an additional item in the browser main menu.
validation scheme would be that it might take too much time to Users are notified of the presence/absence of labels by specific
decrypt the digital signature with the public key in order to get icons. If labels are available, the user can display their contents.
back the original hash key , but we are working on it. ViQ relies on QUAPRO for verifying labels’ validity. Moreover,
QUAPRO will be in charge of returning the information needed
QUATRO Architecture by ViQ to display the label summary and details. More precisely,
Figure 1 depicts the four applications participating in the whenever a Web resource is requested by the user, ViQ performs
QUATRO quality labels validation and notification tasks (ViQ, the following steps:
LADI, QUAPRO and FilterX). QUAPRO is the central server- - if QUAPRO says that labels are absent, the user is notified
based application which receives requests from the two end-user that no labels are available for the requested resource;:
applications (ViQ, LADI), identifies quality labels, evaluates
them and replies accordingly. A Data Access interface (DAcc), - otherwise, ViQ notifies that labels are present, and it displays
placed before an LA’s database, handles the communication the lists of available labels, marked with an icon denoting
between QUAPRO and the database. their validity status (valid, invalid, and “cannot be verified” –
see Figure 2).
The applications mentioned above have to exchange messages
since QUAPRO needs information from all the parties involved
(ViQ/LADI, LA’s database, content analyzer) to assess the labels'
validity. The Simple Object Access Protocol (SOAP), a W3C
recommendation [6], is used for this purpose. An XML schema
has been devised that must be followed by any application that
wants to use the services provided by QUAPRO. This enables, for
instance, to employ another content analysis tool, or add another
labeling authority. SOAP has been selected because it uses http
(in our case) as its transfer protocol, and therefore no special
configuration is required from the end user when installing the
ViQ plug-in.
Figure 2. ViQ browser extension
2.2 LADI
The Search Engine Wrapper LADI is a server application that
gives users an indication of the existence of a label or labels
inside the web resources listed in search engine results and then
allows them to see more detailed information about those labels.
Figure 1. QUATRO architecture As with ViQ, LADI calls on QUAPRO to provide label summary
and details and to verify the validity of labels. Where ViQ
provides information about resources that have already been
visited, LADI will provide the same or similar information before the RDF labels, such as the label creator, the label expiry date and
a resource is visited. LADI’s task is therefore quite different in the URLs that this label applies to.
that it must check with QUAPRO for each of, say, ten results per
When QUAPRO receives a request for one of the labels found in
page of search results that are viewed per user search. It must then
a specific URL, it queries the label in order to find its expiry date,
provide the indicators and a method for viewing the information
creates its hash and contacts the corresponding LA database (via
within the browser as part of the search result listing returned to
DAcc) to assess the validity of the label. While waiting for the
the user.
DAcc response, and in case a content analyzer is available
So, LADI will: (FilterX in our case), it also sends a message to it. When the
- Provide a web search form initially. responses from DAcc and the content analyzer come, QUAPRO
compiles the new message to be sent to ViQ/LADI. This message
- Accept a search term from the user and, using the contains links to unique URLs in the QUAPRO server that
appropriate API, perform a server-to-server request to the contain the labels in natural language so that it can be accessed if
appropriate search engine (Google, Yahoo! in QUATRO requested from ViQ/LADI.
case studies).
- For each of the resources returned by the search engine(s), 2.4 DAcc
make a server-to-server request to QUAPRO to check for the The labeling authorities maintain a database of the web sites that
existence of a label or labels and to obtain the information have been labeled as well as metadata about the labels such as
about those labels. expiration date, language, the hash key for the label. For
QUAPRO, DAcc is a "black box" receiving and sending SOAP
- Produce the HTML for the search results to be returned to messages in conformity to the SOAP messages schema.
the user, merging the results obtained from the chosen search
engine with any relevant information from QUAPRO. The DAcc application receives from QUAPRO the URL of the
web site, the URL of the RDF label on the web site and the hash
key generated from QUAPRO. DAcc in response returns whether
the hash keys match, and the expiration date status.
2.5 FilterX
FilterX is a content analyzer which enables the intelligent
blocking of obscene content accessible through browsers on the
World Wide Web. FilterX is a product of i-sieve [3], a spin-off of
QUATRO’s partner NCSR "Demokritos". I-sieve provides
FilterX to NCSR for the research purposes of the QUATRO
project.
For the purposes of QUATRO, FilterX has been adapted to
perform as an independent software module which will be
invoked by QUAPRO to evaluate labeled Web resources and
return a message compatible to QUATRO specification. So,
FilterX accepts a URL sent by QUAPRO and returns a message
with the results of content analysis.
Figure 3. LADI-annotated search results
3. Concluding remarks
Currently, web sites carrying quality labels such as those
administered by the QUATRO partners, Internet Quality Agency
2.3 QUAPRO and Web Mèdica Acreditada, carry a logo. Clicking the logo,
QUAPRO is a server-based application that processes requests
results in the display of a database entry confirming the logo’s
from both ViQ and LADI. In order to decide on a quality label’s
validity, last review date etc. However, such labels work in
validity, QUAPRO can perform 3 different types of controls: date
isolation and are only visible to human visitors to sites. They
control, hash control, content analysis control. The first two
cannot be harvested, aggregated or otherwise utilised by
checks are used to decide on label’s validity against the LA’s
machines.
database, whereas the third check examines the label’s validity
against the content of the corresponding resource. In case all three QUATRO offers a substantial improvement to the current
checks are used, a composition of the verdicts gives the final situation. First, project members have worked to create a flexible
validity value for the label (valid, invalid, “cannot be verified”). platform that encodes the labels. Secondly, it offers a vocabulary
that encompasses the common elements of a wide variety of
QUAPRO either accepts a single URL (ViQ) or a list of URLs
labeling schemes. The two together have the potential to make
(LADI) and checks if they are labeled. It looks for links to labels
many different quality labels highly interoperable. It must be
in the HTML code of the web page or the HTTP headers when
noted that Segala [8] is using the system to encode its certification
accessing a URL. If a label is found, QUAPRO proceeds by
scheme for web accessibility. RDF content labels are also
querying the label to find the label’s creator and subsequently
examined in a W3C’s Incubator Activity [9] which is feeding
returns this information to ViQ/LADI. QUAPRO is using the
directly into the Mobile Web Initiative's development of a
SPARQL query language [7], for accessing information stored in
mobileOK trustmark [10].
Furthermore, QUATRO provides the means for users navigating [2] http://www.scientificamerican.com/article.cfm?articleI
the web with a common web browser to be notified when quality D=00048144-10D2-1C70-
labels are present (using appropriate graphics) and, if they are, 84A9809EC588EF21&catID=2
whether they are valid or not. The two end-user applications, ViQ
and LADI, currently under development, serve this purpose. [3] http://www.w3.org/RDF/
[4] http://purl.oclc.org/quatro/elements/1.0/
4. Acknowledgments [5] http://www.i-sieve.com
This research was partially funded by the EC through the SIAP
project QUATRO (Quality Assurance and Content Description). [6] http://www.w3.org/TR/soap
QUATRO involves the following partners: Pira International
[7] http://www.w3.org/TR/rdf-sparql-query/
(Coordinator), Internet Content Rating Association, Internet
Quality Agency, Web Mèdica Acreditada, NCSR “Demokritos”, [8] http://www.segala.com
University of Milan, Coolwave, ECP.NL, ERCIM. [9] http://www.w3.org/2005/Incubator/wcl/wcl-charter-
20060208.html
5. References [10] http://www.w3.org/Mobile/
[1] http://www.quatro-project.org
[11] http://www.w3.org/TR/xmldsig-core/