Multidimensional News Quality: A Comparison of Crowdsourcing and Nichesourcing Eddy Maddalena Davide Ceolin Stefano Mizzaro University of Southampton Centrum Wiskunde & Informatica University of Udine Southampton Amsterdam Udine United Kingdom The Netherlands Italy E.Maddalena@soton.ac.uk davide.ceolin@cwi.nl mizzaro@uniud.it intrinsic complexity. Information quality can be as- sessed by considering diverse points of views; how they Abstract can be assessed, and how the assessment results should be combined, depends on the assessors and on their re- In the age of fake news and of filter bubbles, quirements. This calls for a combined approach, where assessing the quality of information is a com- automated computation is required to handle the huge pelling issue: it is important for users to un- amount of information available on the Web, while hu- derstand the quality of the information they man computation is required to understand how the consume online. We report on our experiment quality dimensions are assessed and combined. An im- aimed at understanding if workers from the portant aspect of human computation in this context is crowd can be a suitable alternative to experts its regularity: when human assessments are consistent for information quality assessment. Results enough, automated computation can leverage them to show that the data collected by crowdsourc- scale the computation up. ing seem reliable. The agreement with the ex- perts is not full, but in a task that is so com- In a previous work by Ceolin, Noordegraaf, and plex and related to the assessor’s background, Aroyo [CNA16], two user studies are performed to col- this is expected and, to some extent, positive. lect quality assessments regarding Web documents on the vaccination debate. Assessments were collected 1 Introduction and Background by means of a Web application, in a scenario simi- lar to crowdsourcing with the only difference that the Online information is used by a variety of stakehold- assessments were expressed by a few experts (media ers as a basis for decision making, knowledge discov- scholars and journalism students) rather than a large ery, studies, and many more activities. However, as crowd of anonymous workers. This approach has been a consequence of the democratic nature of the Web, named nichesourcing [Boe+12]. Ceolin, Noordegraaf, such information shows an extremely diverse level of and Aroyo noted that, when the task at hand is con- quality. Making explicit this level of quality for each strained, experts who show a similar background tend information item is crucial to allow the stakeholders an to significantly agree with each other. However, they overall adequate information perusal. Given their per- also noted that the task of deeply assessing online in- vasiveness and influence on the public opinion, online formation is rather demanding, and expert availability news are a kind of information whose quality assess- is limited. Crowdsourcing could be a solution to the ment becomes a particularly critical task to contrast limited availability of human assessors. the spread of misinformation and disinformation. Assessing the quality of online news and informa- In this paper, we repeat that study [CNA16] though tion in general is a challenging task, because of its crowdsourcing to analyse similarities and differences among the two ways of collecting human assessments. Copyright © CIKM 2018 for the individual papers by the papers' Our ultimate goal is to determine if and how crowd- authors. Copyright © CIKM 2018 for the volume as a collection sourcing is a suitable alternative to nichesourcing for by its editors. This volume and its papers are published under information quality assessment. Section 2 briefly sur- the Creative Commons License Attribution 4.0 International (CC veys related work, Section 3 describes the experimental BY 4.0). setup we adopted, Section 4 presents the results, and Section 5 concludes the paper. pro, against)? 3. Readability - Does the document read well? 2 Related Work 4. Precision - How precise is the information in this document (as opposed to vague)? In the age of fake news [Laz+18; VRA18] and of 5. Completeness - How complete is the information in the filter bubble [Par11], assessing the quality of this document? information is a compelling issue: it is important 6. Trustworthiness - How trustworthy is the source? Is for users to understand the quality of the informa- the source trustworthy or does it exhibit malicious tion they consume online. Two important initia- intentions? tives that are worth being mentioned in this field are 7. Relevance - How relevant is the article to the task? the W3C Credible Web Community Group (https:// 8. Overall quality - Which is your general opinion credweb.org/) and the Credibility Coalition (http: about the quality of the article? //credibilitycoalition.org). While the first is We also asked two further questions requiring workers meant to establish standards to model and share data personal opinion, to understand how personal belief about the credibility of information online, the second affects quality judgment: aims at identifying markers and strategies for estab- 9. Your personal opinion - Do you agree with the doc- lishing the credibility of the same information. To this ument content? extent, the work we present in this paper is comple- 10. Your confidence - How knowledgeable/expert are mentary to these initiatives, as it aims at providing you about the topic? gold standards to reason on the credibility (and, more All the 10 assessments were collected on a 5-stars Lik- broadly, quality) of online information. ert scale, as in the original experiment [CNA16]. For each quality dimension, we also asked the users to mo- 3 Experimental Setup tivate their judgment by some free text. The task ran on the Figure Eight (https://www. 3.1 Dataset Description figure-eight.com/) crowdsourcing platform by se- We ran our experiment on a sample from the vaccina- lecting level-three workers who are highest accuracy tion debate dataset provided by the QuPiD project contributors. Each worker was paid 0.2 USD and could (http://qupid-project.net) and used by Ceolin, not judge more than three articles. Besides redun- Noordegraaf, and Aroyo [CNA16]. In 2015, a measles dancy (each article was judged by 10 workers), we also outbreak took place at Disneyland, California. Such adopted some standard quality checks: each worker outbreak triggered a fierce debate that fleshed out the was shown a pair of articles of clearly low and high already hot discussions regarding vaccinations, where quality, and the work was rejected if the collected val- pro and anti vaccination individuals blamed each other ues were ranked in the wrong way; there was also a for the responsibility of the event. The vaccination de- time threshold (the worker needed to spend at least bate dataset collects a number of documents regard- 120 seconds on the task), and some syntactic checks ing that specific debate. While the dataset is limited on the free text motivations. in size (about 50 documents), it is rather diverse in terms of types of documents represented (newspaper 3.3 Research Questions articles, activist blog posts, etc.) and stances (pro, anti, neutral). This experiment allows us to address three research questions: Q1. Relationships between quality dimensions: what 3.2 The Crowdsourcing Task are the correlations between the quality dimen- The crowdsourcing task we ran aimed at collecting sions? Do some of the quality dimensions corre- laymen judgments concerning the quality of a subset late in a way that makes one derivable from an- of 20 articles assessed by the experts (media scholars other? What is the difference between experts and journalism students). We asked each worker to and workers? assess one document along eight different quality di- Q2. Internal agreement (between individual workers): mensions derived from Ceolin, Noordegraaf, and Aroyo can different workers agree to a reasonable extent [CNA16] (we slightly reformulated some of them to when assessing quality dimensions? Are there dif- have a shorter description, more adequate for crowd ferences among the dimensions? workers): Q3. External agreement (between individual workers 1. Accuracy - How accurate is the information in this and experts): what is the individual external article? agreement, i.e., the agreement between the in- 2. Neutrality - Is the document neutral with respect dividual workers and the experts, on all dimen- to the topic addressed, or does it clear stance (e.g., sions? What is the aggregate external agreement, 5 =.22, p=1.6e-03 =.41, p=1.5e-09 =.68, p=7.0e-29 =.51, p=9.3e-15 =.51, p=1.8e-14 =.46, p=1.2e-11 =.65, p=1.6e-25 5 =.57, p=4.8e-04 =.73, p=9.4e-07 =.83, p=1.4e-09 =.76, p=1.5e-07 =.88, p=7.0e-12 =.64, p=5.1e-05 =.9, p=2.2e-13 ** *** *** *** *** *** *** *** *** *** *** *** *** *** Accuracy Accuracy 4 4 3 r=.24, p=6.7e-04 r=.43, p=1.9e-10 r=.7, p=1.6e-30 r=.51, p=1.3e-14 r=.5, p=3.0e-14 r=.49, p=1.7e-13 r=.64, p=2.0e-24 3 r=.54, p=9.8e-04 r=.71, p=2.0e-06 r=.82, p=2.9e-09 r=.76, p=2.4e-07 r=.89, p=1.0e-12 r=.58, p=3.0e-04 r=.93, p=1.1e-15 2 *** =.4, p=3.5e-10 =.21, p=3.8e-04 *** =.64, p=2.0e-25 *** =.46, p=2.3e-14 *** =.45, p=1.4e-13 *** =.43, p=1.8e-12 *** =.58, p=2.1e-21 *** 2 *** *** *** *** *** *** =.49, p=8.0e-04 =.6, p=1.8e-05 =.74, p=1.1e-07 =.67, p=2.5e-06 =.81, p=7.4e-09 =.51, p=3.2e-04 =.88, p=3.5e-10 *** 1 5 *** =.096, p=1.8e-01 *** =.3, p=2.0e-05 *** =.2, p=4.0e-03 *** =.26, p=2.5e-04 *** =.19, p=7.0e-03 *** =.27, p=1.5e-04 *** 1 5 *** =.42, p=1.4e-02 *** =.5, p=2.5e-03*** =.7, p=3.7e-06 *** =.62, p=1.0e-04 *** =.33, p=5.8e-02 *** =.45, p=8.2e-03 *** *** r=.23, p=1.1e-03 ** r=.27, p=1.4e-04 *** r=.16, p=2.3e-02 ** r=.28, p=7.4e-05 *** r=.41, p=1.7e-02* r=.48, p=3.9e-03 ** r=.7, p=4.1e-06 *** r=.61, p=1.4e-04 *** r=.25, p=1.5e-01 r=.46, p=6.4e-03 ** Neutrality Neutrality 4 4 3 r=.089, p=2.1e-01r=.3, p=1.7e-05 3 2 *** =.2, p=5.7e-04** =.24, p=6.9e-05 =.082, p=1.8e-01 =.26, p=8.3e-06 *** =.15, p=1.2e-02* =.24, p=6.2e-05 *** 2 * =.44, p=1.9e-03 =.33, p=2.1e-02 ** =.62, p=2.8e-05 *** =.57, p=8.0e-05 *** =.23, p=1.1e-01 =.42, p=3.6e-03** 1 5 *** *** *** * =.42, p=4.0e-10 =.46, p=6.1e-12 =.35, p=4.3e-07 =.29, p=3.0e-05 =.47, p=1.3e-12 *** 1 5 * =.67, p=1.3e-05 ** =.62, p=1.1e-04 *** =.76, p=2.2e-07 *** =.65, p=3.4e-05 =.69, p=5.7e-06** Readability Readability 4 3 *** r=.44, p=4.4e-11 r=.45, p=3.2e-11 *** r=.35, p=2.7e-07 *** r=.36, p=2.5e-07 *** r=.49, p=3.0e-13 *** 4 3 *** r=.61, p=1.2e-04 r=.64, p=4.0e-05 *** r=.75, p=3.3e-07 *** r=.59, p=2.5e-04 *** r=.68, p=1.0e-05 *** 2 *** =.4, p=1.3e-10 =.41, p=1.1e-10 *** =.32, p=3.4e-07 *** =.32, p=3.4e-07 *** =.45, p=1.9e-12 *** 2 *** =.55, p=1.1e-04 =.53, p=1.3e-04 *** =.64, p=4.2e-06 *** =.52, p=2.0e-04 *** =.57, p=3.2e-05 *** 1 5 *** =.47, p=2.4e-12 *** =.45, p=2.9e-11 *** =.55, p=2.9e-17 *** =.66, p=3.5e-26 *** 1 5 *** =.78, p=4.4e-08 *** =.8, p=1.3e-08 *** =.73, p=1.2e-06 *** =.79, p=2.1e-08 *** *** r=.47, p=2.7e-12 *** r=.51, p=6.4e-15 *** r=.62, p=2.0e-22 *** *** r=.81, p=5.2e-09 *** r=.65, p=2.7e-05 *** r=.79, p=2.7e-08 *** Precision Precision 4 4 3 r=.51, p=7.4e-15 3 r=.77, p=1.2e-07 2 *** =.43, p=2.8e-12 =.47, p=1.2e-14 *** =.46, p=6.3e-14 *** =.56, p=4.6e-20 *** 2 *** =.74, p=1.1e-07 =.66, p=2.5e-06 *** =.55, p=7.8e-05 *** =.69, p=5.1e-07 *** 1 5 *** =.5, p=5.1e-14 *** =.37, p=1.1e-07 *** =.49, p=1.3e-13 *** 1 5 *** =.72, p=1.5e-06 *** =.67, p=1.2e-05 *** =.7, p=3.5e-06 *** Overall Quality Relevance Trustworthiness Completeness Overall Quality Relevance Trustworthiness Completeness 4 3 *** r=.43, p=1.2e-10 r=.5, p=4.6e-14 *** r=.51, p=8.2e-15 *** 4 3 *** r=.61, p=1.2e-04 r=.73, p=1.0e-06 *** r=.72, p=2.0e-06 *** 2 *** =.38, p=3.2e-10 =.45, p=2.5e-13 *** =.46, p=2.5e-14 *** 2 *** =.52, p=2.6e-04 =.64, p=6.6e-06 *** =.63, p=7.9e-06 *** 1 5 *** =.27, p=8.5e-05 *** =.46, p=1.2e-11 *** 1 5 *** =.6, p=1.8e-04 *** =.82, p=3.7e-09 *** 4 3 *** r=.49, p=1.1e-13 r=.34, p=1.0e-06 *** 4 3 *** r=.83, p=1.8e-09 r=.55, p=7.5e-04 *** 2 *** =.45, p=4.1e-13 =.3, p=8.5e-07 *** 2 *** =.76, p=5.4e-08 =.47, p=8.3e-04 *** 1 5 *** =.52, p=4.2e-15 *** 1 5 *** =.67, p=1.4e-05 *** 4 3 *** r=.5, p=2.9e-14 4 3 *** r=.64, p=4.7e-05 2 *** =.45, p=1.8e-13 2 *** =.54, p=1.0e-04 1 5 *** 1 5 *** 4 4 3 3 2 2 1 1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Accuracy Neutrality Readability Precision Completeness Trustworthiness Relevance Overall Quality Accuracy Neutrality Readability Precision Completeness Trustworthiness Relevance Overall Quality Figure 1: Scatterplots and correlations between the Figure 2: Scatterplots and correlations between the dimensions pairs, for raw worker values dimensions pairs, for the experts i.e., the agreement between the aggregated assess- they are less statistically significant. When comparing ments by the workers and the experts, on all di- to Figure 2 one can see that usually the correlation be- mensions? tween dimensions are higher for the experts than for the aggregate workers, but values are definitely more 4 Results comparable than the individual raw values, and indeed the aggregate workers have higher correlations than The main results are grouped on the basis of the re- the experts in three cases (the correlations between search questions. Accuracy and Relevance those between Overall Qual- ity and both Neutrality and Precision). We also tried 4.1 Q1: Quality Dimensions Relationships aggregating with the median, obtaining worse results. Another remark that can be made by observing the A first result is presented in Figure 1, that shows a histograms on the diagonals of Figures 1 and 2 is that scatterplot matrix. For each pair of dimensions (in- the values provided by the experts tend to follow a dicated on the diagonal), a scatterplot is shown (in more Bimodal distributions (they use more the ex- the bottom triangular matrix, with some random jit- tremes of the scale) than the workers. This is even ter to avoid some overlap). Each dot in a scatterplot clearer when looking at the aggregated values since the represents one individual worker/article pair, and its mean of the values will pull them even more towards coordinates are the values expressed by the worker on the middle of the scale, as it can be seen in Figure 3. the corresponding two dimensions. In the upper trian- The distributions also show that the workers tend to gular part, the correlation values are shown with their express higher values than the experts. p-values to measure statistical significance. Figure 2 allows to compare the data to experts. 4.2 Q2: Internal Agreement among Workers Comparing correlation values, it is clear that experts are more consistent across dimensions; p-values are Table 1 shows the agreement among the workers, over- roughly similar in the two cases. all and on each quality dimension, measured by both As it is common practice in crowdsourcing, in place Krippendorff’s α [Kri07] and Φ [Che+17]. Both mea- of using raw values by individual workers, we com- sures assume values in [−1, +1] (with −1 correspond- pute aggregated values. We select a simple (if not the ing to complete disagreement, 0 to random agreement, simplest) aggregation function: the arithmetic mean. and +1 to complete agreement). For Φ the table also Figure 3 shows the correlations obtained when aggre- shows, besides the most likely Φ value, the Highest gating with the mean the 10 values expressed by 10 Posterior Density (HPD) interval, i.e., the interval that workers on the same article. When comparing to Fig- contains the actual Φ value with a 95% probability: ure 1, one can see that correlations increase, although these are quite small intervals, so we can be confi- =.48, p=4.3e-02 =.37, p=1.3e-01 =.75, p=3.4e-04 =.57, p=1.3e-02 =.79, p=9.8e-05 =.77, p=1.8e-04 =.79, p=1.0e-04 Accuracy 5 4 * *** * *** *** *** ferent worker groups, and/or decrease the granularity 3 r=.49, p=3.8e-02 r=.39, p=1.1e-01 r=.7, p=1.3e-03 r=.6, p=8.3e-03 r=.77, p=1.8e-04 r=.76, p=2.6e-04 r=.73, p=5.8e-04 2 * ** ** *** *** =.33, p=6.5e-02 =.31, p=8.8e-02 =.57, p=1.6e-03 =.47, p=9.8e-03 =.66, p=2.3e-04 =.65, p=4.0e-04 =.61, p=7.2e-04 *** and ask to evaluate passages of an article instead of a 1 5 ** ** *** *** =.39, p=1.1e-01 =.33, p=1.8e-01 =.48, p=4.6e-02 =.54, p=2.2e-02 =.23, p=3.6e-01 =.57, p=1.4e-02 *** full article. In this light, we observe a low correlation r=.31, p=2.1e-01 r=.28, p=2.6e-01 r=.43, p=7.7e-02* r=.58, p=1.2e-02* r=.055, p=8.3e-01r=.57, p=1.4e-02* Neutrality 4 3 (between 0 and 0.20) between the workers confidence, 2 =.25, p=1.6e-01 =.2, p=2.7e-01 =.34, p=6.0e-02 =.4, p=2.4e-02* =.035, p=8.5e-01 =.42, p=1.9e-02 * 1 * * i.e., question number 10, and all the quality dimen- 5 =.42, p=8.6e-02 =.5, p=3.5e-02 =.39, p=1.1e-01 =.13, p=6.0e-01 =.42, p=8.1e-02 sions and a moderate correlation (about 0.6) between Readability r=.24, p=3.5e-01 r=.36, p=1.5e-01* r=.25, p=3.1e-01 r=.18, p=4.8e-01 r=.23, p=3.6e-01 4 3 2 1 =.19, p=3.1e-01 =.27, p=1.4e-01 =.19, p=3.0e-01 =.14, p=4.6e-01 =.17, p=3.5e-01 the workers agreement, i.e., question 9, with the article 5 =.74, p=4.0e-04 =.71, p=1.1e-03 =.67, p=2.5e-03 =.82, p=3.0e-05 assessed and Precision, Accuracy, and Overall Quality *** ** ** *** Precision 4 r=.75, p=3.4e-04 r=.6, p=8.6e-03 r=.58, p=1.1e-02 r=.86, p=4.6e-06 3 2 *** ** * =.6, p=9.1e-04 =.47, p=9.1e-03 =.48, p=9.1e-03 =.71, p=8.8e-05 *** scores. While this correlation is not complete, it still 1 5 *** =.63, p=5.0e-03 ** =.4, p=1.0e-01** =.82, p=3.2e-05 *** hints at the possibility that a subgroup of the workers Overall Quality Relevance Trustworthiness Completeness 4 3 ** r=.43, p=7.8e-02 r=.86, p=5.0e-06 r=.62, p=6.1e-03 *** shows a confirmation bias, meaning that these tend to 2 ** =.35, p=6.1e-02 =.72, p=7.4e-05 =.47, p=9.0e-03 *** 1 ** =.57, p=1.4e-02 =.75, p=3.5e-04*** judge positively the articles they agree with, and vice- 5 4 r=.57, p=1.4e-02* r=.65, p=3.5e-03 *** versa. In this short paper we do not have the space 3 2 * =.54, p=2.5e-03 =.43, p=2.0e-02 ** to discuss these issues in full, and we leave them for 1 5 * =.49, p=4.0e-02 ** future work. r=.43, p=7.8e-02* 4 3 2 =.36, p=5.2e-02 1 5 4.3 Q3: External Agreement with the Experts 4 3 2 Turning to the agreement between workers and ex- 1 1 2 3 Accuracy 4 5 1 2 3 4 Neutrality 5 1 2 3 4 Readability 5 1 2 3 Precision 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 Completeness Trustworthiness Relevance Overall Quality 5 1 2 3 4 5 perts, the scatterplots and correlations values in Fig- ure 4 (top row) show that the agreement of the indi- Figure 3: Scatterplots and correlations between the vidual workers with the experts is rather low, as cor- dimensions pairs, for aggregated (mean) worker values. relation values are positive but quite small, and of- ten not significant. Figure 4 (center row) shows the agreement with the experts that is obtained when ag- Dimension α Φ HPD [2.5, 97.5] gregating the worker values with the mean. Correla- tion values are systematically higher than individual All 0.132 0.084 [0.014, 0.146] workers, although almost never greater than 0.5 and Accuracy 0.057 0.800 [0.747, 0.836] often not statistically significant. As previously ob- Neutrality 0.016 0.703 [0.609, 0.778] served, the aggregation reduces the range of the values: Readability 0.012 0.687 [0.500, 0.831] whereas the experts usually use the full spectrum, the Precision 0.026 0.807 [0.773, 0.868] aggregated workers score is more limited. In all these Completeness 0.065 0.876 [0.816, 0.903] plots, the eight dimensions show quite similar correla- Trustworthiness 0.108 0.904 [0.827, 0.954] tion values with the exception of Neutrality: workers Relevance 0.022 0.739 [0.716, 0.783] Overall Quality 0.011 0.833 [0.805, 0.852] particularly disagree with the experts about it. Figure 4 (bottom row) demonstrates the previous Table 1: Agreement among the workers claim that in general the median is a worse aggrega- tion function: lower correlation values are obtained for dent that the most likely Φ value is correct. α val- Completeness, Trustworthiness, Relevance, and, espe- ues are quite low, but Φ ones are much higher. Most cially, Overall Quality (which has not correlation with likely, as we have discussed above, assessment values the experts when using the median). However, Read- have a quite low variability. In such a case, α exhibits ability and Precision are similar, and Neutrality and, a pathological behavior, which is of the issues with especially, Accuracy are higher. This suggests that α that is solved by Φ as discussed by Checco et al. different and more sophisticate aggregation functions [Che+17]. The much higher Φ values, together with might lead to a higher agreement with the experts, an the narrow HPD intervals, show that the agreement issue that for space limits we leave for future work. among the workers is consistent even if not complete. The results presented so far hint that the data col- 5 Conclusions and Future Work lected by our crowdsourcing experiment are reliable. It is also important to remark that although the workers In this paper we present an experiment that aims at in some cases fail to exactly replicate the assessments comparing crowd and nichesourcing as methods for as- by the experts (as we discuss shortly), the task is quite sessing the quality of online information from a mul- complex and assessor background might have a critical tidimensional standpoint. We collect 10 assessments role. In this respect, a full agreement might even be about 20 articles from a dataset on the vaccination a problem rather than a feature. If this is the case, debate, and we analyze them internally and in compar- it might be necessary to treat in a different way dif- ison to previously published expert assessments. We Accuracy Neutrality Readability Precision Completeness Trustworthiness Relevance Overall Quality =.14, p=4.70e-02 5 r=.14, =.016, p=8.25e-01 =.13, p=5.85e-02 =.1, p=1.49e-01 =.18, p=1.09e-02 =.17, p=1.35e-02 =.17, p=1.52e-02 =.12, p=9.59e-02 p=4.77e-02 r=.021, p=7.65e-01 r=.15, p=3.52e-02 r=.13, p=6.73e-02 r=.15, p=3.02e-02 r=.17, p=1.49e-02 r=.14, p=5.61e-02 r=.11, p=1.18e-01 =.12, p=4.96e-02 =.017, p=7.68e-01 =.13, p=3.70e-02 =.11, p=6.64e-02 =.13, p=3.10e-02 =.14, p=1.69e-02 =.11, p=6.71e-02 =.091, p=1.22e-01 Expert score 4 3 2 1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Workers score Workers score Workers score Workers score Workers score Workers score Workers score Workers score Accuracy Neutrality Readability Precision Completeness Trustworthiness Relevance Overall Quality 5 =.42, p=8.12e-02 =.036, p=8.88e-01 =.35, p=1.58e-01 =.41, p=8.75e-02 =.46, p=5.69e-02 =.43, p=7.62e-02 =.55, p=1.90e-02 =.44, p=6.96e-02 r=.41, p=9.26e-02 r=.11, p=6.62e-01 r=.44, p=6.55e-02 r=.49, p=3.76e-02 r=.38, p=1.19e-01 r=.43, p=7.59e-02 r=.52, p=2.82e-02 r=.37, p=1.27e-01 =.31, p=1.07e-01 =.067, p=7.22e-01 =.41, p=2.79e-02 =.39, p=4.05e-02 =.29, p=1.34e-01 =.33, p=7.90e-02 =.42, p=3.22e-02 =.33, p=8.15e-02 Expert score 4 3 2 1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Workers score Workers score Workers score Workers score Workers score Workers score Workers score Workers score Accuracy Neutrality Readability Precision Completeness Trustworthiness Relevance Overall Quality =.5, p=3.59e-02 5 r=.52, =.12, p=6.45e-01 =.35, p=1.58e-01 =.42, p=8.08e-02 =.36, p=1.37e-01 =.32, p=2.02e-01 =.3, p=2.32e-01 =.076, p=7.63e-01 p=2.84e-02 r=.12, p=6.37e-01 r=.41, p=9.11e-02 r=.39, p=1.09e-01 r=.28, p=2.53e-01 r=.4, p=9.94e-02 r=.16, p=5.33e-01 r=-.096, p=7.04e-01 =.44, p=3.31e-02 =.1, p=5.99e-01 =.37, p=7.62e-02 =.33, p=1.11e-01 =.24, p=2.48e-01 =.32, p=1.25e-01 =.11, p=5.97e-01 =-.098, p=6.37e-01 Expert score 4 3 2 1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Workers score Workers score Workers score Workers score Workers score Workers score Workers score Workers score Figure 4: Scatterplots and correlations between experts and: (i) individual workers (top row); (ii) aggregated workers, with mean as aggregation function (center row); and (iii) aggregated workers, with median as aggregation function (bottom row). observe that workers tend to use higher values than ex- [Che+17] Alessandro Checco, Kevin Roitero, Eddy perts, and that aggregate workers values show a higher Maddalena, Stefano Mizzaro, and Gian- correlation in three cases (between Accuracy and Rel- luca Demartini. “Let’s Agree to Disagree: evance, and between Overall Quality and Neutrality Fixing Agreement Measures for Crowd- and Precision). When looking at the internal agree- sourcing”. In: The 5th AAAI Conference ment among workers, we note that this is high, but not on Human Computation and Crowdsourc- complete. This might be due to the fact that, at least ing (HCOMP 2017). 2017. some workers, show a confirmation bias, i.e., tend to [CNA16] Davide Ceolin, Julia Noordegraaf, and rate higher documents they agree with, and vice-versa. Lora Aroyo. “Capturing the Ineffable: Lastly, when looking at the agreement between work- Collecting, Analysing, and Automating ers and experts, we can see that this is generally high, Web Document Quality Assessments”. In: except for the Neutrality dimension. Knowledge Engineering and Knowledge In the future, we plan to extend our dataset to in- Management. Springer International Pub- crease the number of assessments, of articles analysed, lishing, 2016, pp. 83–97. and of topics covered to help us generalise our find- [Kri07] Klaus Krippendorff. “Computing Krippen- ings. We plan to extend the depth of our analyses, for dorff’s alpha reliability”. In: Departmental example to identify an assessability measure for docu- papers (ASC) (2007), p. 43. ments (hinting at how easy it is to assess them), and to identify similar groups of workers with higher internal [Laz+18] David M. J. Lazer, Matthew A. Baum, agreement. Yochai Benkler, Adam J. Berinsky, Kelly Acknowledgements This study was partially sup- M. Greenhill, Filippo Menczer, Miriam J. ported by the H2020 project QROWD (grant agree- Metzger, Brendan Nyhan, Gordon Penny- ment ID: 732194). cook, David Rothschild, Michael Schud- son, Steven A. Sloman, Cass R. Sun- stein, Emily A. Thorson, Duncan J. Watts, References and Jonathan L. Zittrain. “The science of fake news”. In: Science 359.6380 (2018), [Boe+12] Victor de Boer, Michiel Hildebrand, pp. 1094–1096. Lora Aroyo, Pieter De Leenheer, Chris Dijkshoorn, Binyam Tesfa, and Guus [Par11] E. Pariser. The Filter Bubble: What the In- Schreiber. “Nichesourcing: Harnessing the ternet Is Hiding from You. The Penguin Power of Crowds of Experts”. In: Knowl- Group, 2011. edge Engineering and Knowledge Manage- [VRA18] Soroush Vosoughi, Deb Roy, and Sinan ment. Springer Berlin Heidelberg, 2012, Aral. “The spread of true and false pp. 16–20. news online”. In: Science 359.6380 (2018), pp. 1146–1151.