<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Identifying and Classifying Uncertainty Layers in Web Document Quality Assessment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Ceolin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lora Aroyo</string-name>
          <email>lora.aroyog@vu.nl</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Noordegraaf</string-name>
          <email>j.j.noordegraaf@uva.nl</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Assessing the quality of Web documents is crucial, but challenging. In this paper, we outline the di erent uncertainty bottlenecks that such task implies, and we propose a strategy to tackle them.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Assessing the quality of Web documents is a necessary, yet challenging issue.
For example, if a journalist is writing an article on the vaccination debate, and
is looking for Web documents to use as a source. What would her de nition
of quality encompass? Given that she wants to represent a debate, she needs
documents that properly represent each point of view, i.e. they are complete,
accurate, precise and reliable documents with a clear provenance. With the
proliferation of information on the Web, the potential set of documents she may be
confronted with is so vast that it is necessary to make a selection of documents
with the highest quality, seen from the perspective of journalistic usage.</p>
      <p>As this example shows, the prime source of uncertainty is the fact that the
de nition of quality depends on the user's perspective on the data. Suppose that
this de nition comprises the quality dimensions mentioned before: completeness,
accuracy, precision, and trustworthiness. On the one hand, we need to
understand how these quality dimensions have to be combined together to come up
with a nal decision about the overall quality of a document (i.e., to decide if
the journalist is going to use the document or not). On the other hand, in order
to meet the Web scale of the set of documents the user is presented with, we
need to understand how to automatically evaluate and quantify these qualities:
what is the information we need to extract from the documents to make such
quanti cation? And how can this information be extracted?</p>
      <p>Given the complexity of de ning Web document quality, it would be useful
to accompany estimated quality assessments obtained by automatic predictions
with quanti cation of their con dence. We could always come up with a decision
about the quality of a document, but we may be unsure about the accuracy of
such decision. To address such a bottleneck, in this paper we propose to identify
the possible sources of uncertainty in the process of quality estimation of Web
documents, and we discuss an approach to quantify them.</p>
      <p>
        The problem of assessing the quality of Web documents is crucial in
information retrieval. Bharat et al [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] copyrighted a method for clustering online
news content based on freshness and quality of content, while Kang and Kim [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
nd links between speci c quality requirements and user queries. We focus on
detecting the uncertainty in such clusters and links. Pasi et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Floridi
and Illari [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] edited two extensive reviews on (Web) information quality and its
philosophy. These reviews hint at the uncertainty issues in quality assessment.
      </p>
      <p>The rest of the paper is structured as follows. Sections 2 and 3 introduce a
quality assessment pipeline we devised and its sources of uncertainty. Section 4
presents a strategy for uncertainty handling, and Section 5 concludes.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Quality Assessment Pipeline</title>
      <p>
        The pipeline for automating the process of quality estimation developed in
previous work of ours [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is depicted in Figure 1 and described below.
      </p>
      <p>Parsers, etc.</p>
      <p>Users</p>
      <p>Signal Detection (Feature
extractors, Feature relevance)</p>
      <p>Documents
Quality Dimension Modeling and
Quality Assessment Collection</p>
      <p>Enriched Documents
Learn (Model selection)
Quality assessments</p>
      <p>New Documents</p>
      <p>Predict
Predicted Quality
assessments</p>
      <p>Legend
Process (uncertainty)</p>
      <p>External Input</p>
      <p>Artifact
Running Example. An automated learning algorithm (e.g., SVM) is used to
associate the quality assessments of the journalist in the training set to the
document features, to predict the quality of other Web documents.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Sources of Uncertainty</title>
      <p>Feature Extractors Tools for document feature extraction may produce
disagreeing results. This adds additional uncertainty to the process.
Running example. Suppose that we parse the same document with two
different NLP parsers, e.g. P1 and P2: the resulting sentiment di ers of 0.2 on
a range from -1 (negative) to 1 (positive), and the sets of entities extracted
are di erent. How shall we handle such discrepancies? How shall we evaluate
the tool reliability? Several possibilities apply here.</p>
      <p>Feature Relevance These features are collected because they could (jointly)
act as quality markers. In principle, the more attributes we collect, the more
potential markers we gather. The quality of di erent types of documents
(e.g., newspaper articles, blog posts) could be marked by di erent features,
and a feature that does not mark quality in the documents observed up to
a given time could mark quality in the next document collected. However,
features could: (1) con ict with each other; and (2) create scalability issues
due to dimensionality growth. It is di cult to prune these features, because
we do not know which of these might become relevant in the future.
Running example. We collected a sample of assessments, and we use it to
make quality predictions. Yet, we do not know if the correspondencies
between assessments in the training set and document features we may nd are
valid also on other documents (and whether those document features that
seem useless at the moment might be useful in the future).</p>
      <p>Model Selection Correlations and correspondences between features and
qualities can be identi ed by means of diverse algorithms. For similar reasons
that hold for the uncertainty linked to the feature relevance, the choice of
these algorithms is di cult. They could perform well on a dataset at hand,
but not on its extension. Since we aim at allowing quality prediction on large
sets of Web documents, we need to carefully choose the learning algorithm.
Running example. Suppose that Support Vector Machines performs well on
the training set at hand. We need to seek guarantees on the fact that the
performance keeps stable as long as we extend the dataset. E.g., by monitoring
the performance and by evaluating alternative approaches in parallel.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Uncertainty Handling Strategy</title>
      <p>We identify the following strategy based on Semantic Web technologies to
address the uncertainty of Web document quality estimations.</p>
      <p>
        Trace the Provenance of Quality Estimates Tracing the provenance of the
estimations we make is crucial to investigate the reasons for high or low
accuracy, and improve them. We can use PROV [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to this aim, and by
specializing it further, we may be able to better describe the peculiarities of
uncertainty bottlenecks we may nd.
      </p>
      <p>
        Reason on and Annotate Provenance Traces Once we identi ed all the
steps that led to a given quality estimate, we can estimate the con dence
in the estimate by looking at the provenance. In particular, by collecting a
large enough set of provenance traces, and of measurements of the
estimation accuracy, we can identify which processes entities used lead to higher
uncertainty. To properly trace the quality of these assessments, we can make
use of the Data Quality Vocabulary (DQV) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Running example. We extract sentiment and entities from the documents selected
for the journalist. We use a Support Vector Machine model to predict the quality
of the documents. Once we make the prediction, we can measure its accuracy,
and associate it to the current trace. We can then measure the accuracy also
with other algorithms (e.g., Bayesian Networks) and input features (e.g., source
trustworthiness). By keeping track of the provenance of the estimates, we can
infer which parts of the process constitute an uncertainty bottleneck.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>In this position paper, we discuss the possible sources of uncertainty in the
process of automated estimation of the quality of Web documents, and we
illustrate them by means of a running example. We propose a general strategy for
quantifying such uncertainty, so to measure the con dence in quality estimates.
This procedure relies on the Semantic Web techniques (in particular, PROV and
DQV) to trace all the steps that led to the estimates, and to learn how these
correlate with uncertainty, to detect possible bottlenecks in the process.
Acknowledgments This work was supported by the Amsterdam Academic
Alliance Data Science (AAA-DS) Program Award to the University of Amsterdam
and VU University Amsterdam.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bharat</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curtiss</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmitt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Method and apparatus for clustering news online content based on content freshness and quality of content source (2016) US Patent 9</article-title>
          ,
          <issue>361</issue>
          ,
          <fpage>369</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , G.:
          <article-title>Query type classi cation for web document retrieval</article-title>
          .
          <source>In: SIGIR '03</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2003</year>
          )
          <volume>64</volume>
          {
          <fpage>71</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Pasi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordogna</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
          </string-name>
          , L.C., eds.:
          <source>Quality Issues in the Management of Web Information</source>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Floridi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Illari</surname>
          </string-name>
          , P., eds.:
          <source>The Philosophy of Information Quality</source>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ceolin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noordegraaf</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aroyo</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>van Son</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Towards web documents quality assessment for digital humanities scholars</article-title>
          .
          <source>In: WebSci '16</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <volume>315</volume>
          {
          <fpage>317</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. W3C: PROV-O. http://www.w3.org/TR/prov-o/ (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. W3C:
          <article-title>Data quality vocabulary</article-title>
          . https://www.w3.org/TR/vocab-dqv/ (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>