Identifying and Classifying Uncertainty Layers in Web Document Quality Assessment Davide Ceolin1 and Lora Aroyo1 and Julia Noordegraaf2 1 {d.ceolin,lora.aroyo}@vu.nl VU University Amsterdam Amsterdam, The Netherlands 2 j.j.noordegraaf@uva.nl University of Amsterdam Amsterdam, The Netherlands Abstract. Assessing the quality of Web documents is crucial, but chal- lenging. In this paper, we outline the different uncertainty bottlenecks that such task implies, and we propose a strategy to tackle them. 1 Introduction Assessing the quality of Web documents is a necessary, yet challenging issue. For example, if a journalist is writing an article on the vaccination debate, and is looking for Web documents to use as a source. What would her definition of quality encompass? Given that she wants to represent a debate, she needs documents that properly represent each point of view, i.e. they are complete, accurate, precise and reliable documents with a clear provenance. With the pro- liferation of information on the Web, the potential set of documents she may be confronted with is so vast that it is necessary to make a selection of documents with the highest quality, seen from the perspective of journalistic usage. As this example shows, the prime source of uncertainty is the fact that the definition of quality depends on the user’s perspective on the data. Suppose that this definition comprises the quality dimensions mentioned before: completeness, accuracy, precision, and trustworthiness. On the one hand, we need to under- stand how these quality dimensions have to be combined together to come up with a final decision about the overall quality of a document (i.e., to decide if the journalist is going to use the document or not). On the other hand, in order to meet the Web scale of the set of documents the user is presented with, we need to understand how to automatically evaluate and quantify these qualities: what is the information we need to extract from the documents to make such quantification? And how can this information be extracted? Given the complexity of defining Web document quality, it would be useful to accompany estimated quality assessments obtained by automatic predictions with quantification of their confidence. We could always come up with a decision about the quality of a document, but we may be unsure about the accuracy of such decision. To address such a bottleneck, in this paper we propose to identify the possible sources of uncertainty in the process of quality estimation of Web documents, and we discuss an approach to quantify them. The problem of assessing the quality of Web documents is crucial in infor- mation retrieval. Bharat et al [1] copyrighted a method for clustering online news content based on freshness and quality of content, while Kang and Kim [2] find links between specific quality requirements and user queries. We focus on detecting the uncertainty in such clusters and links. Pasi et al. [3] and Floridi and Illari [4] edited two extensive reviews on (Web) information quality and its philosophy. These reviews hint at the uncertainty issues in quality assessment. The rest of the paper is structured as follows. Sections 2 and 3 introduce a quality assessment pipeline we devised and its sources of uncertainty. Section 4 presents a strategy for uncertainty handling, and Section 5 concludes. 2 Quality Assessment Pipeline The pipeline for automating the process of quality estimation developed in pre- vious work of ours [5] is depicted in Figure 1 and described below. Parsers, etc. Signal Detection (Feature Enriched Documents Legend extractors, Feature relevance) New Documents Process (uncertainty) Documents Learn (Model selection) Predict External Input Quality Dimension Modeling and Predicted Quality Users Quality Assessment Collection Quality assessments assessments Artifact Fig. 1. Overview of the Web Document Assessment pipeline Signal Detection We automatically extract features from documents to be able to identify similarity and differences among them. Running example. Consider the example in the previous section. Suppose that the journalist indicated the quality of a set of documents, and we intend to identify more documents she might consider of high quality. Since each document is different from each other, we need to extrapolate information that allows comparing them. If we extract the sentiment and the entities mentioned, we can compare a blog post and a news article on those terms (Do the documents share the same sentiment? Do they mention the same entities?) and, if any correlation with quality is present, these features will be used for quality prediction, as described below. Quality Dimension Modeling and Assessment Collection Since the def- inition of quality depends on the specific user, context, and task, we collect assessments to use as a training set. Running example. Here we record the document quality assessments provided by the journalist (along with her identity, the context and the ask at hand). Quality Estimation Once that features are extracted and sample quality as- sessments are collected, we identify correlations and correspondences be- tween these elements. Typically, we employ machine learning algorithms. Running Example. An automated learning algorithm (e.g., SVM) is used to associate the quality assessments of the journalist in the training set to the document features, to predict the quality of other Web documents. 3 Sources of Uncertainty Feature Extractors Tools for document feature extraction may produce dis- agreeing results. This adds additional uncertainty to the process. Running example. Suppose that we parse the same document with two dif- ferent NLP parsers, e.g. P1 and P2 : the resulting sentiment differs of 0.2 on a range from -1 (negative) to 1 (positive), and the sets of entities extracted are different. How shall we handle such discrepancies? How shall we evaluate the tool reliability? Several possibilities apply here. Feature Relevance These features are collected because they could (jointly) act as quality markers. In principle, the more attributes we collect, the more potential markers we gather. The quality of different types of documents (e.g., newspaper articles, blog posts) could be marked by different features, and a feature that does not mark quality in the documents observed up to a given time could mark quality in the next document collected. However, features could: (1) conflict with each other; and (2) create scalability issues due to dimensionality growth. It is difficult to prune these features, because we do not know which of these might become relevant in the future. Running example. We collected a sample of assessments, and we use it to make quality predictions. Yet, we do not know if the correspondencies be- tween assessments in the training set and document features we may find are valid also on other documents (and whether those document features that seem useless at the moment might be useful in the future). Model Selection Correlations and correspondences between features and qual- ities can be identified by means of diverse algorithms. For similar reasons that hold for the uncertainty linked to the feature relevance, the choice of these algorithms is difficult. They could perform well on a dataset at hand, but not on its extension. Since we aim at allowing quality prediction on large sets of Web documents, we need to carefully choose the learning algorithm. Running example. Suppose that Support Vector Machines performs well on the training set at hand. We need to seek guarantees on the fact that the per- formance keeps stable as long as we extend the dataset. E.g., by monitoring the performance and by evaluating alternative approaches in parallel. 4 Uncertainty Handling Strategy We identify the following strategy based on Semantic Web technologies to ad- dress the uncertainty of Web document quality estimations. Trace the Provenance of Quality Estimates Tracing the provenance of the estimations we make is crucial to investigate the reasons for high or low ac- curacy, and improve them. We can use PROV [6] to this aim, and by spe- cializing it further, we may be able to better describe the peculiarities of uncertainty bottlenecks we may find. Reason on and Annotate Provenance Traces Once we identified all the steps that led to a given quality estimate, we can estimate the confidence in the estimate by looking at the provenance. In particular, by collecting a large enough set of provenance traces, and of measurements of the estima- tion accuracy, we can identify which processes entities used lead to higher uncertainty. To properly trace the quality of these assessments, we can make use of the Data Quality Vocabulary (DQV) [7]. Running example. We extract sentiment and entities from the documents selected for the journalist. We use a Support Vector Machine model to predict the quality of the documents. Once we make the prediction, we can measure its accuracy, and associate it to the current trace. We can then measure the accuracy also with other algorithms (e.g., Bayesian Networks) and input features (e.g., source trustworthiness). By keeping track of the provenance of the estimates, we can infer which parts of the process constitute an uncertainty bottleneck. 5 Discussion In this position paper, we discuss the possible sources of uncertainty in the process of automated estimation of the quality of Web documents, and we illus- trate them by means of a running example. We propose a general strategy for quantifying such uncertainty, so to measure the confidence in quality estimates. This procedure relies on the Semantic Web techniques (in particular, PROV and DQV) to trace all the steps that led to the estimates, and to learn how these correlate with uncertainty, to detect possible bottlenecks in the process. Acknowledgments This work was supported by the Amsterdam Academic Al- liance Data Science (AAA-DS) Program Award to the University of Amsterdam and VU University Amsterdam. References 1. Bharat, K., Curtiss, M., Schmitt, M.: Method and apparatus for clustering news online content based on content freshness and quality of content source (2016) US Patent 9,361,369. 2. Kang, I.H., Kim, G.: Query type classification for web document retrieval. In: SIGIR ’03, ACM (2003) 64–71 3. Pasi, G., Bordogna, G., Jain, L.C., eds.: Quality Issues in the Management of Web Information. Springer (2013) 4. Floridi, L., Illari, P., eds.: The Philosophy of Information Quality. Springer (2014) 5. Ceolin, D., Noordegraaf, J., Aroyo, L., van Son, C.: Towards web documents quality assessment for digital humanities scholars. In: WebSci ’16, ACM (2016) 315–317 6. W3C: PROV-O. http://www.w3.org/TR/prov-o/ (2013) 7. W3C: Data quality vocabulary. https://www.w3.org/TR/vocab-dqv/ (2015)