=Paper= {{Paper |id=Vol-190/paper-9 |storemode=property |title=Quality Labeling of Web Content: The Quatro approach |pdfUrl=https://ceur-ws.org/Vol-190/paper09.pdf |volume=Vol-190 |authors=Vangelis Karkaletsis,Andrea Perego,Phil Archer,Kostas Stamatakis,Pantelis Nasikas and David Rose |dblpUrl=https://dblp.org/rec/conf/mtw/KarkaletsisPASNR06 }} ==Quality Labeling of Web Content: The Quatro approach== https://ceur-ws.org/Vol-190/paper09.pdf
     Quality Labeling of Web Content: The Quatro approach
        Vangelis Karkaletsis                                Andrea Perego                                  Phil Archer
        NCSR “Demokritos”                           Università degli Studi di Milano         Internet Content Rating Association
    P. Grigoriou & Neapoleos str.                         via Comelico 39/41                22 Old Steine, Brighton, East Sussex,
 15310 Ag. Paraskevi Attikis, Greece                   I-20135 Milano MI, Italy                   BN1 1EL United Kingdom
          +30 210 6503197                                  +39 02503 16273                           +44 (0)1473 434770
     vangelis@iit.demokritos.gr                        perego@dico.unimi.it                           parcher@icra.org

          Kostas Stamatakis                                Pantelis Nasikas                                David Rose
        NCSR “Demokritos”                               NCSR “Demokritos”                      Coolwave Limited
    P. Grigoriou & Neapoleos str.                   P. Grigoriou & Neapoleos str.    4 -6 Greenfield House, Storrington, Nr
 15310 Ag. Paraskevi Attikis, Greece             15310 Ag. Paraskevi Attikis, Greece     Pulborough, West Sussex, UK
          +30 210 6503215                                 +30 210 6503197                    +44 (0)870 7127000
      kstam@iit.demokritos.gr                        pnas@iit.demokritos.gr                       david@coolwave.co.uk



ABSTRACT                                                               people have the same site in their (online) bookmarks (favorites)
QUATRO is an on-going EC-funded project which aims to                  and that it gets a 4.5 rating on a recommender system.
provide a common vocabulary and machine readable schema for            QUATRO adds to the picture in two ways: by providing a way in
quality labeling of Web content, as well as ways to automatically      which any number of web resources can easily share the same
show the contents of the label(s) found in a Web resource, and         description; by providing a common vocabulary that can be used
functionalities for checking the validity of these labels. The paper   by labeling authorities. As a result, machines will be able to
presents the QUATRO processes for label validation and user            recognize that a site mentioned in a blog that gets a 4.5 star rating
notification, and outlines the architecture of QUATRO system.          on a recommender system and is in 3 friends’ online bookmarks
                                                                       also has a label. By basing the labels on RDF, QUATRO is
Categories and Subject Descriptors                                     effectively promoting the addition of data on the web that a wide
H.3.5 Online Information Services: Web-based services                  variety of other applications can use to build trust in a given
                                                                       resource.
General Terms                                                          At the time of writing this paper, the details of the QUATRO
Management, Reliability, Experimentation, Verification.                vocabulary have been finalized and the complete vocabulary is
                                                                       available on the QUATRO site and elsewhere, both as a plain text
                                                                       document and an RDF schema [4]. It will be available for free
Keywords                                                               usage by Labeling Authorities (LAs) as they see fit. The project’s
Quality labeling, web content analysis, RDF schemas                    vocabulary is divided into four categories:
                                                                       -    General Criteria, such as whether the labelled site uses clear
                                                                            language that is fit for purpose, includes a privacy statement,
1.        INTRODUCTION                                                      data protection contact point etc.
QUATRO is an on-going EC-funded project which aims to                  -    Criteria for labelling to ensure accuracy of information such
provide a common vocabulary and machine readable schema for                 as the content provider’s credentials and appropriate
quality labeling of web content, making it possible for the many            disclosure of funding.
existing labeling schemes to be brought together through a single,
coherent approach without affecting the individual scheme’s            -    Criteria for labelling to ensure compliance with rules and
criteria or independence [1].                                               legislation for e-business such as fair marketing practices and
                                                                            measures to protect children.
QUATRO’s work on providing a platform for machine-
understandable quality labels, also called trustmarks, is part of a    -    Terms used in operating the trust mark scheme itself such as
much greater activity around the world, that of Semantic Web [2].           the date the label was issued, when it was last reviewed and
Three QUATRO partners, ERCIM, as European host for W3C,                     by whom.
and ICRA and NCSR, as W3C members, are active participants in
                                                                       LAs will, of course, continue to devise their own criteria.
this activity. RDF, the Resource Description Framework [3], is
                                                                       However, where those criteria are equivalent to those in the
the key technology behind the Semantic Web, providing a means
                                                                       QUATRO schema, use of common elements offers some distinct
of expressing data on the web in a structured way that can be
                                                                       advantages.
processed by machines. It allows a machine to recognize that, for
example, 5 blogs are commenting on the same web site, that 3           Work is now underway to develop applications to make use of the
                                                                       machine-readable labels:
-    An application for checking the validity of machine-readable         In the second scenario, labels are stored at the labeled resource
     labels found in web resources. A label’s validity is checked         site. Since such labels are not under the control of the LA, they
     against the corresponding information found in the LA’s              can be easily modified by the resources’ content providers. In
     database. Furthermore, QUATRO also enables, for some                 order to verify their validity, QUATRO needs to be able to verify
     cases, the checking of label’s validity against the content of       a) whether the label stored at the labeled resource site is the same
     the web resource. The application is implemented as a proxy          of the one that has been generated by the LA (integrity control)
     server, named QUAPRO.                                                and b) whether the label has not expired (date control). The
                                                                          former may be enforced by a hash-matching while the latter by a
-    A browser extension, named ViQ, which enables the visual             date-comparison mechanism.
     interpretation of label found in the web resource requested
     by the user, according to QUAPRO results. A user is                  More precisely, concerning integrity control, whenever a label is
     therefore able to see that a site has a label and be notified on     generated, the LA hashes the label and the produced hash is
     the label’s validity and content.                                    stored in the LA database. Whenever a label is located inside a
                                                                          web resource, QUATRO hashes it and asks the LA to verify
-    A wrapper for search engines’ results, named LADI, which             whether this hash matches with the hash of the label stored in the
     indicates the presence of label(s) on the web sites listed. This     LA’s database. In addition, for every label generated by the LA, a
     will be available for inspection by clicking an icon adjacent        label expiry date parameter is set, which means that the label is
     to the relevant result. As in the case of ViQ, label validation      valid until that specific date. Therefore, QUATRO gets from the
     and user notification will be performed by QUAPRO.                   LA this valid-until date in order to check the label validity.
This paper briefly presents the QUATRO processes for label                Finally, as noted before, whenever a content analyzer is available,
validation and user notification (Section 2), the QUATRO                  QUATRO can perform an additional check examining the content
architecture and the main functionalities of the components of the        of the web resource against the label’s content.
system implementing this architecture (Section 3).
                                                                          Thus, three different policies can be enforced for label’s
2.        Label validation and User notification                          validation: labels’ integrity, labels’ expiry date, and content
Before displaying the content of a label identified in a web              analysis (meaning the semantic equivalence between the actual
resource, it is necessary to examine whether the label is a valid         resource content and the description provided by the label).
one against either the Labeling Authority’s (LA) database or the          Note that it may be also the case that the label cannot be
content of the web resource. For this purpose, QUATRO employs             validated. For instance, the LA database may be down, the
two validation processes.                                                 hosting server may be off-line, the QUATRO’s proxy (QUAPRO)
The first one concerns the label’s integrity, independently from          may be unavailable. In such cases we can simply say that the
the content of the web resource. A label is generated by the              validity of the label cannot be verified. This applies even to the
corresponding LA at some point in time, and represents the                case when a content analyzer is not able to decide whether a label
content of the web resource at that time. It is possible that the         is valid or not. Thus we have the following possible results when
provider of the web resource’s content has changed the label’s            evaluating labels: valid, invalid, and cannot be verified;
content without informing the LA. The validation mechanism                As it concerns user notification, this is performed in order to
must enable the checking of the label’s content against the               inform users whether a resource is labeled or not. Yet, when
corresponding content stored in the LA’s database, in order to            labels are invalid, the description they provide is useless. Thus,
ensure the label’s integrity. This does not mean that a label that        we can devise two different strategies for considering a resource
satisfies the integrity constraint is actually valid, since the content   as labeled:
of the web resource may have changed. On the other hand, we
                                                                          -    when valid labels are associated with it,
cannot be completely sure that a label which does not satisfy our
integrity constraints is necessarily invalid.                             -    when labels are associated with it, independently from their
That’s why examining a label’s integrity must be supported,                    validity.
whenever this is possible, by an additional comparison of the             QUATRO adopts the latter strategy, since it aims at informing
label’s content against the actual resource content. This                 users about the characteristics of the requested resources, not at
constitutes the second QUATRO validation process. It is difficult         blocking inappropriate contents. In addition, QUATRO validation
to automate this validation check since it involves the use of            policies allow the verification of labels’ validity against the LA’s
advanced content analysis techniques. In the context of                   database in all cases, but, as it concerns the validation of the
QUATRO, we use the content analyzer FilterX [5] in one of the             label’s content against the resource’s content, this can only be
case studies.                                                             done when a content analyzer is available for the specific case.
The criteria according to which a label should be considered              Thus, QUATRO’s approach allows the user to access the content
valid/invalid may vary depending on the specific labeling scheme.         of a label, even though it is not valid. After being notified whether
We distinguish two different scenarios.                                   a label is valid or not, users can display the contents of any
                                                                          available label. It is up to them to decide whether they will trust it
In the first scenario, the labels are stored at the LA’s site. In such
                                                                          or not.
a case, labels cannot be modified directly by the web resources’
content providers, and thus their integrity is granted. That is, in       Label notification may then return one of the following results:
this case, we can only examine whether the resource’s content has         -    The requested resource is unlabelled: The end user is
been modified and if the updated content is not in-line with the               informed that no label is available for the requested resource.
label’s content.
-    The requested resource is labeled: The end user is informed        The next sub-sections provide more information on the
     that labels are present, and he/she is notified whether they       functionalities of QUATRO components.
     are valid, invalid, or they cannot be evaluated.
                                                                        2.1        ViQ
Further work on the label validation scheme will include,               The Metadata Visualizer (ViQ) is a client application in charge of
incorporating XML Digital Signatures. In this scenario an LA            two main tasks:
does not need to provide an online database with labels and
hashes as a web service, just a way to locate its public key (e.g.      -     to notify users whether a requested Web resource is
as RDF/A metadata on its website). The label file will contain the            associated with content labels or not;
digital signature of the hash. The hash will be generated as before,    -     to display to the users the contents of the labels associated
and we will generate the digital signature from it, rather than from          with Web resources.
the label itself, due to performance reasons. So, once the labeling
authority creates the label and the hash, and signs it with a digital   ViQ is being developed as a browser extension for the three most
signature from a private key that it (the LA) keeps secret , a user     popular Web browsers (i.e., MS Internet Explorer, Mozilla
agent program can easily verify the integrity of the hash (and thus     Firefox and Opera), providing a toolbar (the ViQ Toolbar), a
the label) if he uses the public key. One drawback in this              status bar icon, and an additional item in the browser main menu.
validation scheme would be that it might take too much time to          Users are notified of the presence/absence of labels by specific
decrypt the digital signature with the public key in order to get       icons. If labels are available, the user can display their contents.
back the original hash key , but we are working on it.                  ViQ relies on QUAPRO for verifying labels’ validity. Moreover,
                                                                        QUAPRO will be in charge of returning the information needed
QUATRO Architecture                                                     by ViQ to display the label summary and details. More precisely,
Figure 1 depicts the four applications participating in the             whenever a Web resource is requested by the user, ViQ performs
QUATRO quality labels validation and notification tasks (ViQ,           the following steps:
LADI, QUAPRO and FilterX). QUAPRO is the central server-                -     if QUAPRO says that labels are absent, the user is notified
based application which receives requests from the two end-user               that no labels are available for the requested resource;:
applications (ViQ, LADI), identifies quality labels, evaluates
them and replies accordingly. A Data Access interface (DAcc),           -     otherwise, ViQ notifies that labels are present, and it displays
placed before an LA’s database, handles the communication                     the lists of available labels, marked with an icon denoting
between QUAPRO and the database.                                              their validity status (valid, invalid, and “cannot be verified” –
                                                                              see Figure 2).
The applications mentioned above have to exchange messages
since QUAPRO needs information from all the parties involved
(ViQ/LADI, LA’s database, content analyzer) to assess the labels'
validity. The Simple Object Access Protocol (SOAP), a W3C
recommendation [6], is used for this purpose. An XML schema
has been devised that must be followed by any application that
wants to use the services provided by QUAPRO. This enables, for
instance, to employ another content analysis tool, or add another
labeling authority. SOAP has been selected because it uses http
(in our case) as its transfer protocol, and therefore no special
configuration is required from the end user when installing the
ViQ plug-in.




                                                                                         Figure 2. ViQ browser extension


                                                                        2.2        LADI
                                                                        The Search Engine Wrapper LADI is a server application that
                                                                        gives users an indication of the existence of a label or labels
                                                                        inside the web resources listed in search engine results and then
                                                                        allows them to see more detailed information about those labels.
                Figure 1. QUATRO architecture                           As with ViQ, LADI calls on QUAPRO to provide label summary
                                                                        and details and to verify the validity of labels. Where ViQ
                                                                        provides information about resources that have already been
visited, LADI will provide the same or similar information before      the RDF labels, such as the label creator, the label expiry date and
a resource is visited. LADI’s task is therefore quite different in     the URLs that this label applies to.
that it must check with QUAPRO for each of, say, ten results per
                                                                       When QUAPRO receives a request for one of the labels found in
page of search results that are viewed per user search. It must then
                                                                       a specific URL, it queries the label in order to find its expiry date,
provide the indicators and a method for viewing the information
                                                                       creates its hash and contacts the corresponding LA database (via
within the browser as part of the search result listing returned to
                                                                       DAcc) to assess the validity of the label. While waiting for the
the user.
                                                                       DAcc response, and in case a content analyzer is available
So, LADI will:                                                         (FilterX in our case), it also sends a message to it. When the
-     Provide a web search form initially.                             responses from DAcc and the content analyzer come, QUAPRO
                                                                       compiles the new message to be sent to ViQ/LADI. This message
-     Accept a search term from the user and, using the                contains links to unique URLs in the QUAPRO server that
      appropriate API, perform a server-to-server request to the       contain the labels in natural language so that it can be accessed if
      appropriate search engine (Google, Yahoo! in QUATRO              requested from ViQ/LADI.
      case studies).
-     For each of the resources returned by the search engine(s),      2.4       DAcc
      make a server-to-server request to QUAPRO to check for the       The labeling authorities maintain a database of the web sites that
      existence of a label or labels and to obtain the information     have been labeled as well as metadata about the labels such as
      about those labels.                                              expiration date, language, the hash key for the label. For
                                                                       QUAPRO, DAcc is a "black box" receiving and sending SOAP
-     Produce the HTML for the search results to be returned to        messages in conformity to the SOAP messages schema.
      the user, merging the results obtained from the chosen search
      engine with any relevant information from QUAPRO.                The DAcc application receives from QUAPRO the URL of the
                                                                       web site, the URL of the RDF label on the web site and the hash
                                                                       key generated from QUAPRO. DAcc in response returns whether
                                                                       the hash keys match, and the expiration date status.

                                                                       2.5       FilterX
                                                                       FilterX is a content analyzer which enables the intelligent
                                                                       blocking of obscene content accessible through browsers on the
                                                                       World Wide Web. FilterX is a product of i-sieve [3], a spin-off of
                                                                       QUATRO’s partner NCSR "Demokritos". I-sieve provides
                                                                       FilterX to NCSR for the research purposes of the QUATRO
                                                                       project.
                                                                       For the purposes of QUATRO, FilterX has been adapted to
                                                                       perform as an independent software module which will be
                                                                       invoked by QUAPRO to evaluate labeled Web resources and
                                                                       return a message compatible to QUATRO specification. So,
                                                                       FilterX accepts a URL sent by QUAPRO and returns a message
                                                                       with the results of content analysis.

            Figure 3. LADI-annotated search results
                                                                       3.        Concluding remarks
                                                                       Currently, web sites carrying quality labels such as those
                                                                       administered by the QUATRO partners, Internet Quality Agency
2.3        QUAPRO                                                      and Web Mèdica Acreditada, carry a logo. Clicking the logo,
QUAPRO is a server-based application that processes requests
                                                                       results in the display of a database entry confirming the logo’s
from both ViQ and LADI. In order to decide on a quality label’s
                                                                       validity, last review date etc. However, such labels work in
validity, QUAPRO can perform 3 different types of controls: date
                                                                       isolation and are only visible to human visitors to sites. They
control, hash control, content analysis control. The first two
                                                                       cannot be harvested, aggregated or otherwise utilised by
checks are used to decide on label’s validity against the LA’s
                                                                       machines.
database, whereas the third check examines the label’s validity
against the content of the corresponding resource. In case all three   QUATRO offers a substantial improvement to the current
checks are used, a composition of the verdicts gives the final         situation. First, project members have worked to create a flexible
validity value for the label (valid, invalid, “cannot be verified”).   platform that encodes the labels. Secondly, it offers a vocabulary
                                                                       that encompasses the common elements of a wide variety of
QUAPRO either accepts a single URL (ViQ) or a list of URLs
                                                                       labeling schemes. The two together have the potential to make
(LADI) and checks if they are labeled. It looks for links to labels
                                                                       many different quality labels highly interoperable. It must be
in the HTML code of the web page or the HTTP headers when
                                                                       noted that Segala [8] is using the system to encode its certification
accessing a URL. If a label is found, QUAPRO proceeds by
                                                                       scheme for web accessibility. RDF content labels are also
querying the label to find the label’s creator and subsequently
                                                                       examined in a W3C’s Incubator Activity [9] which is feeding
returns this information to ViQ/LADI. QUAPRO is using the
                                                                       directly into the Mobile Web Initiative's development of a
SPARQL query language [7], for accessing information stored in
                                                                       mobileOK trustmark [10].
Furthermore, QUATRO provides the means for users navigating         [2] http://www.scientificamerican.com/article.cfm?articleI
the web with a common web browser to be notified when quality           D=00048144-10D2-1C70-
labels are present (using appropriate graphics) and, if they are,       84A9809EC588EF21&catID=2
whether they are valid or not. The two end-user applications, ViQ
and LADI, currently under development, serve this purpose.          [3] http://www.w3.org/RDF/
                                                                    [4] http://purl.oclc.org/quatro/elements/1.0/
4.       Acknowledgments                                            [5] http://www.i-sieve.com
This research was partially funded by the EC through the SIAP
project QUATRO (Quality Assurance and Content Description).         [6] http://www.w3.org/TR/soap
QUATRO involves the following partners: Pira International
                                                                    [7] http://www.w3.org/TR/rdf-sparql-query/
(Coordinator), Internet Content Rating Association, Internet
Quality Agency, Web Mèdica Acreditada, NCSR “Demokritos”,           [8] http://www.segala.com
University of Milan, Coolwave, ECP.NL, ERCIM.                       [9] http://www.w3.org/2005/Incubator/wcl/wcl-charter-
                                                                        20060208.html
5.       References                                                 [10] http://www.w3.org/Mobile/
[1] http://www.quatro-project.org
                                                                    [11] http://www.w3.org/TR/xmldsig-core/