=Paper= {{Paper |id=Vol-2482/paper38 |storemode=property |title=How Many Truth Levels? Six? One Hundred? Even More? Validating Truthfulness of Statements via Crowdsourcing |pdfUrl=https://ceur-ws.org/Vol-2482/paper38.pdf |volume=Vol-2482 |authors=Kevin Roitero,Gianluca Demartini,Stefano Mizzaro,Damiano Spina |dblpUrl=https://dblp.org/rec/conf/cikm/RoiteroDMS18 }} ==How Many Truth Levels? Six? One Hundred? Even More? Validating Truthfulness of Statements via Crowdsourcing== https://ceur-ws.org/Vol-2482/paper38.pdf

How Many Truth Levels? Six? One Hundred?
Even More? Validating Truthfulness of Statements via
Crowdsourcing

Kevin Roitero+ , Gianluca Demartini∗ , Stefano Mizzaro‡ , and Damiano Spina
+
University of Udine, Udine, Italy, roitero.kevin@spes.uniud.it
∗
University of Queensland, Brisbane, Australia, g.demartini@uq.edu.au
‡
University of Udine, Udine, Italy, mizzaro@uniud.it

RMIT University, Melbourne, Australia, damiano.spina@rmit.edu.au

scales (e.g., a scale with 100 levels) presents some ad-
vantages with respect to classical few levels scales. In-
Abstract spired by these works, we look at different truthfulness
scales and experimentally compare them in a crowd-
We report on collecting truthfulness values (i)
sourcing setting. In particular, we compare two novel
by means of crowdsourcing and (ii) using fine-
scales: a discrete scale on 100 levels, and a continuous
grained scales. In our experiment we collect
Magnitude Estimation scale [Mos77]. Thus our spe-
truthfulness values using a bounded and dis-
cific research question is: What is the impact of the
crete scale with 100 levels as well as a mag-
scale to be adopted when annotating statement truth-
nitude estimation scale, which is unbounded,
fulness via crowdsourcing?
continuous and has infinite amount of levels.
We compare the two scales and discuss the
agreement with a ground truth provided by 2 Background
experts on a six-level scale.
Recent work looked at the methods to automatically
detect fake news and fact-check. Kriplean et al.
1 Introduction [Kri+14] look at the use of volunteer crowdsourcing
Checking the validity of statements is an important to fact-check embedded into a socio-technical system
task to support the detection of rumors and fake news similar to the democratic process. As compared to
in social media. One of the challenges is the ability to them, we look at the more systematic involvement of
scale the collection of validity labels for a large number humans in the loop to quantitatively assess the truth-
of statements. fulness of statements.
Fact-checking has been shown as a task difficult to Our work looks at experimentally comparing differ-
be performed in crowdsourcing platforms.1 However, ent schemes to collect labelled data for truthful facts.
crowdworkers are often asked to annotate truthfulness Related to this, Medo and Wakeling [MW10] investi-
of statements using a few discrete values (e.g., true/- gate how the discretization of ratings affects the co-
false labels). determination procedure, i.e., where estimates of user
Recent work in information retrieval [Roi+18; and object reputation are refined iteratively together.
Mad+17] has shown that using more fine-grained Zubiaga et al. [Zub+18] and Zubiaga and Ji [ZJ14]
look at how humans assess credibility of information
Copyright © CIKM 2018 for the individual papers by the papers'
and, by means of a human study, identify key cred-
authors. Copyright © CIKM 2018 for the volume as a collection
ibility perception features to be used for automatic
by its editors. This volume and its papers are published under
detection of credible tweets. As compared to them, we
the Creative Commons License Attribution 4.0 International (CC also look at the human dimension of credibility check-
BY 4.0). ing but rather focus on which is the most appropriate
scale for human assessors to make such assessment.
1 https://fullfact.org/blog/2018/may/ Kochkina, Liakata, and Zubiaga [KLZ18b] and
crowdsourced-factchecking/ Kochkina, Liakata, and Zubiaga [KLZ18a] look at ru-
mour verification by proposing a supervised machine
learning model to automatically perform such a task.
As compared to them, we focus on understanding the
most effective scale used to collect training data to
then build such models.
Besides the dataset we used for our experiments in
this paper, other datasets related to fact checking and
the truthfulness assessment of statements have been
created. The Fake News Challenge2 addresses the the
task of stance detection: estimate the stance of a body
text from a news article relative to a headline. Specif-
ically, the body text may agree, disagree, discuss or
be unrelated to the headline. Fact-checking Lab at
CLEF 2018 [Nak+18] addresses a ranking task, i.e.,
to rank sentences in a political debate according to
their worthiness for fact-checking, and a classification
task, i.e., given a sentence that is worth checking, to
decide whether the claim is true, false or unsure of its
Figure 1: Example of a statement included in a crowd-
factuality. In our work we use the dataset first pro-
sourcing HIT.
posed by Wang [Wan17] as it has been created using
six-level labels which is in-line with our research ques- We use randomized statement ordering to avoid any
tion about how many levels are most appropriate for possible document-ordering effect/bias.
such labelling task. To ensure a good quality dataset, we use the follow-
ing quality checks in the crowdsourcing phase:
3 Experimental Setup • the truth value of the two gold statements (one
3.1 Dataset patently false and the other one patently true)
has to be consistent;
We use a sample of statements from the dataset de-
tailed by Wang [Wan17]. The dataset consists of a • the time spent to judge each statement has to be
collection of 12,836 labelled statements; each state- greater than 8 seconds;
ment is accompanied by some meta-data specifying
its “speaker”, “speaker’s job”, and “context” (i.e., the • each worker has two attempts to complete the
context in which the statement has been said) informa- task; at the third unsuccessful attempt of sub-
tion, as well as the the truth label made by experts on mitting the task the user is prevented to continue
a six-level scale: pants-fire (i.e., lie), false, barely-true, further.
half-true, mostly-true, and true. We collected the data using the Figure-Eight plat-
For our re-assessment, we perform a stratified ran- form.3
dom sampling to select 10 statements for each of the
six categories, obtaining a total of 60 statements. The 3.3 Labeling Scales
screenshot in Figure 1 shows one of the statements
We consider two different truth scales, keeping the
included in our sample.
same experimental setting (i.e., quality checks, HITs,
3.2 The Crowdsourcing Task etc.):

We obtain for each statement a crowdsourced truth 1. a scale in the [0, 100] range, denoted as S100 ;
label by 10 different workers. Each worker judges six
2. the Magnitude Estimation [Mos77] scale in the
statements (one for each category) plus two additional
(0, ∞) range, denoted as ME∞ .
“gold” statements used for quality checks. We also
ask each worker to provide a justification for the truth The effects and benefits of using the two scales in the
value he/she provide. setting of assessing document relevance for information
We pay the workers 0.2$ for each set of 8 judgments retrieval evaluation has been explored by Maddalena
(i.e., one Human Intelligent Task, or HIT). Workers et al. [Mad+17] and Roitero et al. [Roi+18].
are allowed to do one HIT for each scale only, but Overall, we collect 800 truth labels for each scale,
they are allowed to provide judgments for both scales. so 1,600 in total, for a total cost of 48$ including fees.
2 http://www.fakenewschallenge.org/ 3 https://www.figure-eight.com/.
Between 0 and 100 -- 99.3% of the scores

150.0 700 50.0 700

112.5 525 37.5 525

Cumulative Frequency
Cumulative Frequency

Frequency
Frequency

75.0 350 25.0 350

37.5 175 12.5 175

0.0 0 0.0 0
0 20 40 60 80 100 0 20 40 60 80 100
Score Score

Figure 2: Individual score distributions: S100 (left, raw), and ME∞ (right, normalized). The red line is the
cumulative distribution.

5.00 80 12 100

3.75 60 9 75
Cumulative Frequency

Cumulative Frequency
Frequency

Frequency

2.50 40 6 50

1.25 20 3 25

0.00 0 0 0
0 20 40 60 80 100 0 5 10 15 20 25 30
Score Score

Figure 3: Aggregated score distributions: S100 (left), and ME∞ (right).

4 Results higher values, i.e., the right of the plot, and there is a
clear tendency of giving scores which are multiple of
4.1 Individual Scores
ten (an effect that is consistent with the findings by
While the raw scores obtained with the S100 scale are Roitero et al. [Roi+18]).
ready to use, the scores from ME∞ need a normaliza- For the ME∞ scale, we see that the normalized
tion phase (since each worker will use a personal, and scores are almost normally-distributed (which is con-
potentially different, “inner scale factor” due to the sistent with the property that scores collected on a
absence of scale boundaries); we computed the nor- ratio scale like ME∞ should be log-normal), although
malized scores for the ME∞ scale following the stan- the distribution is slightly skewed towards lower values
dard normalization approach for such a scale, namely (i.e., left part of the plot).
geometric averaging [Ges97; McG03; Mos77]:
s∗ = exp log s − µH (log s) + µ(log s) ,

4.2 Aggregated Scores
where s is is the raw score, µH (log s) is the mean value Next, we compute the aggregated scores for both
of the log s within a HIT, and µ(log s) is the mean of scales: we aggregate the scores of the ten workers
the logarithm of all ME∞ scores. judging the same statement. Following the standard
Figure 2 shows the individual scores distributions: practices, we aggregate the S100 values using the arith-
for S100 (left) the raw scores are reported and for ME∞ metic mean, as done by Roitero et al. [Roi+18], and
(right) the normalized scores. The x-axis represents the ME∞ values using the median, as done by Mad-
the score, while the y-axis its absolute frequency; the dalena et al. [Mad+17] and Roitero et al. [Roi+18].
cumulative distribution is denoted by the red line. As Figure 3 shows the aggregated scores; comparing with
we can see, for S100 the distribution is skewed towards Figure 2, we notice that for S100 the distribution is
80 25.0

22.5
70
20.0
60
17.5
S100

ME
50
15.0

40 12.5

30 10.0

7.5
20
5.0
Lie False Barely Half Mostly True Lie False Barely Half Mostly True

Figure 4: Comparison with ground truth: S100 (top), and ME∞ (bottom).

more balanced, although it can not be said to be bell- 4.4 Inter-Assessor Agreement
shaped, and the decimal tendency effect disappears;
furthermore, the most common value is not 100 (i.e., Figure 5 shows the inter-assessor agreement of the
the limit of the scale) anymore. Concerning ME∞ , workers, namely the agreement among all the ten
we see that the scores are still roughly normally dis- workers judging the same statement. Agreement
tributed.4 However, the x-range is more limited; this is computed using Krippendorff’s α [Kri07] and Φ
is an effect of the aggregation function, which tends to Common Agreement [Che+17] measures; as already
remove the outlier scores. pointed out by Checco et al. [Che+17], Φ and α mea-
sure substantially different notions of agreement. As
we can see, while the two agreement measures show
4.3 Comparison with Experts some degree of similarity for S100 , for ME∞ the agree-
ment computed is substantially different: while α has
We now turn to compare with the ground truth our values close to zero (i.e., no agreement), Φ shows a high
truth levels obtained by crowdsourcing. Figure 4 agreement level, on average around 0.8. Checco et al.
shows the comparison between the S100 and ME∞ [Che+17] show that α can have an agreement value of
(normalized and) aggregated scores with the six-level zero even when the agreement is actually present in
ground truth. In each of the two charts, each box-plot the data. Although agreement values seem higher for
represents the corresponding scores distribution. We ME∞ , especially when using Φ, it is difficult to clearly
also report the individual (normalized and) aggregated prefer one of the two scales from these results.
scores as colored dots with some random horizontal jit-
ter. We can see that, even with a small number of doc- 4.5 Pairwise Agreement
uments (i.e., ten for each category), the median values
of the box-plots are increasing; this is always the case We also measure the agreement within one unit. We
for S100 , and true for most of the cases for ME∞ (where use the definition of pairwise agreement by Roitero
there is only one case in which this is untrue, for the et al. [Roi+18, Section 4.2.1] that allows to compare
two adjacent categories “Lie” and “False”). This be- (S100 and ME∞ ) scores with a ground truth on differ-
havior suggests that both the S100 and ME∞ scales ent scales (six levels). Figure 6 shows that the pairwise
allow to collect truth levels that are overall consistent agreement with the experts of the scores collected us-
with the ground truth, and that the S100 scale leads ing the two scales is similar.
to a slightly higher level of agreement with the expert
judges than the ME∞ scale. We analyze agreement in 4.6 Differences between the two Scales
more detail in the following.
As a last result, we note that the two scales measure
4 Running the omnibus test of normality implemented in something different, as shown by the scatter-plot in
scipy.stats.normaltest [DP73], we cannot reject the null hy- Figure 7. Each dot is one statement and the two coor-
pothesis, i.e., p > .001 for both the aggregated and raw normal- dinates are its aggregated scores on the two scales. Al-
ized scores. Although not rejecting the null hypothesis does not
necessary tell us that they follow a normal distribution, we can
though Pearson’s correlation between the two scales is
say we are pretty confident they came from a normal distribu- positive and significant, it is clear that there are some
tion. differences, that we plan to study in future work.
1.0
0.20 agr_measure
alpha
phi 0.8
0.15

Agreement Score
Agreement Score
0.10
0.6
agr_measure
0.05 alpha
0.4 phi
0.00

0.2
−0.05

−0.10 0.0

Lie False Barely Half Mostly True All Lie False Barely Half Mostly True All
Gold Breakdown Gold Breakdown

Figure 5: Assessor agreement: S100 (left), and ME∞ (right).

100
S100
experts ground truth collected on a six-levels scale
ME (see Figure 4), thus it seems viable to crowdsource
80
truthfulness of statements.
60
Freq.

• Also due to the limited size of our sample (10
40 statements), we cannot quantify which is the best
20
scale to be used in this scenario: we plan to fur-
ther address this issue in future work. In this
0 respect, we remark that whereas the reliability
0.0 0.2 0.4 0.6 0.8 1.0
Pairwise Agreement
of the S100 scale is perhaps expected, it is worth
noticing that the ME∞ scale, for sure less famil-
Figure 6: Complementary cumulative distribution iar, leads anyway to truthfulness values that are of
function of assessor agreement for S100 and ME∞ . comparable quality to the ones collected by means
of the S100 scale.
30
• The scale used has anyway some effect, as it is
25 shown by the differences in Figure 4, the different
agreement values in Figure 5, and the rather low
20
agreement between S100 and ME∞ in Figure 7.
ME

15
• S100 and ME∞ scales seems to lead to similar
10 agreement with expert judges (Figure 6).

5 For space limits, we do not report on other data like,
ρ: p=0.42 (p < .01)
τ: p=0.21 (p < .05) for example, the justifications provided by the workers
0
20 30 40 50 60 70 80 or the time taken to complete the job. We plan to do
S100
so in future work.
Figure 7: Agreement of the aggregated scores between Our preliminary experiment is an enabling step to
S100 and ME∞ . further explore the impact of different fine-grained la-
beling scales for fact-checking in crowdsourcing sce-
5 Conclusions and Future Work narios. We plan to extend the experiment with more
and more diverse statements, also from other datasets,
We performed a crowdsourcing experiment to ana-
which will allow us to perform further analyses. We
lyze the impact of using different fine-grained labeling
plan in particular to understand in more detail the
scales when asking crowdworkers to annotate truthful-
differences between the two scales highlighted in Fig-
ness of statements. In particular, we tested two label-
ure 7.
ing scales: S100 [RMM17] and ME∞ [Mad+17]. Our
preliminary results with a small sample of statements
from Wang’s dataset [Wan17] suggest that:

• Crowdworkers annotate truthfulness of state-
ments in a way that is overall consistent with the
References [Nak+18] Preslav Nakov, Alberto Barrón-Cedeño,
Tamer Elsayed, Reem Suwaileh, Lluı́s
[Che+17] Alessandro Checco, Kevin Roitero, Eddy
Màrquez, Wajdi Zaghouani, Pepa
Maddalena, Stefano Mizzaro, and Gi-
Atanasova, Spas Kyuchukov, and Gio-
anluca Demartini. “Let’s Agree to Dis-
vanni Da San Martino. “Overview of the
agree: Fixing Agreement Measures for
CLEF-2018 CheckThat! Lab on Auto-
Crowdsourcing”. In: Proc. HCOMP. 2017,
matic Identification and Verification of
pp. 11–20.
Political Claims”. In: Proc. CLEF. 2018.
[DP73] Ralph D’Agostino and Egon S. Pearson.
[RMM17] Kevin Roitero, Eddy Maddalena, and Ste-
“Tests for Departure from Normality. Em-
fano Mizzaro. “Do Easy Topics Predict Ef-
pirical√ Results for the Distributions of b2
fectiveness Better Than Difficult Topics?”
and b1 ”. In: Biometrika 60.3 (1973),
In: Proc. ECIR. 2017, pp. 605–611.
pp. 613–622.
[Roi+18] Kevin Roitero, Eddy Maddalena, Gian-
[Ges97] George Gescheider. Psychophysics: The
luca Demartini, and Stefano Mizzaro. “On
Fundamentals. 3rd. Lawrence Erlbaum
Fine-Grained Relevance Scales”. In: Proc.
Associates, 1997.
SIGIR. 2018, pp. 675–684.
[KLZ18a] Elena Kochkina, Maria Liakata, and
[Wan17] William Yang Wang. “”Liar, Liar Pants
Arkaitz Zubiaga. “All-in-one: Multi-task
on Fire”: A New Benchmark Dataset for
Learning for Rumour Verification”. In:
Fake News Detection”. In: Proc. ACL.
arXiv preprint arXiv:1806.03713 (2018).
2017, pp. 422–426.
[KLZ18b] Elena Kochkina, Maria Liakata, and
[ZJ14] Arkaitz Zubiaga and Heng Ji. “Tweet, but
Arkaitz Zubiaga. “PHEME dataset for
verify: epistemic study of information veri-
Rumour Detection and Veracity Classifi-
fication on Twitter”. In: Soc. Net. An. and
cation”. In: (2018).
Min. 4.1 (2014), p. 163. issn: 1869-5469.
[Kri+14] Travis Kriplean, Caitlin Bonnar, Alan
[Zub+18] Arkaitz Zubiaga, Ahmet Aker, Kalina
Borning, Bo Kinney, and Brian Gill. “In-
Bontcheva, Maria Liakata, and Rob Proc-
tegrating On-demand Fact-checking with
ter. “Detection and resolution of rumours
Public Dialogue”. In: Proc. CSCW. 2014,
in social media: A survey”. In: ACM Com-
pp. 1188–1199.
puting Surveys (CSUR) 51.2 (2018), p. 32.
[Kri07] Klaus Krippendorff. “Computing Krip-
pendorff’s alpha reliability”. In: Depart-
mental papers (ASC) (2007), p. 43.
[Mad+17] Eddy Maddalena, Stefano Mizzaro, Falk
Scholer, and Andrew Turpin. “On Crowd-
sourcing Relevance Magnitudes for Infor-
mation Retrieval Evaluation”. In: ACM
TOIS 35.3 (2017), p. 19.
[McG03] Mick McGee. “Usability magnitude esti-
mation”. In: Proceedings of the Human
Factors and Ergonomics Society Annual
Meeting 47.4 (2003), pp. 691–695.
[Mos77] Howard R Moskowitz. “Magnitude esti-
mation: notes on what, how, when, and
why to use it”. In: Journal of Food Qual-
ity 1.3 (1977), pp. 195–227.
[MW10] Matúš Medo and Joseph Rushton Wakel-
ing. “The effect of discrete vs. continuous-
valued ratings on reputation and ranking
systems”. In: EPL (Europhysics Letters)
91.4 (2010).