<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How Many Truth Levels? Six? One Hundred? Even More? Validating Truthfulness of Statements via Crowdsourcing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kevin Roitero</string-name>
          <email>roitero.kevin@spes.uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca Demartini</string-name>
          <email>g.demartini@uq.edu.au</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Mizzaroz</string-name>
          <email>mizzaro@uniud.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damiano Spina</string-name>
          <email>damiano.spina@rmit.edu.au</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Udine</institution>
          ,
          <addr-line>Udine</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>We report on collecting truthfulness values (i) by means of crowdsourcing and (ii) using negrained scales. In our experiment we collect truthfulness values using a bounded and discrete scale with 100 levels as well as a magnitude estimation scale, which is unbounded, continuous and has in nite amount of levels. We compare the two scales and discuss the agreement with a ground truth provided by experts on a six-level scale.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Checking the validity of statements is an important
task to support the detection of rumors and fake news
in social media. One of the challenges is the ability to
scale the collection of validity labels for a large number
of statements.</p>
      <p>Fact-checking has been shown as a task di cult to
be performed in crowdsourcing platforms.1 However,
crowdworkers are often asked to annotate truthfulness
of statements using a few discrete values (e.g.,
true/false labels).</p>
      <p>Recent work in information retrieval [Roi+18;
Mad+17] has shown that using more ne-grained
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>Recent work looked at the methods to automatically
detect fake news and fact-check. Kriplean et al.
[Kri+14] look at the use of volunteer crowdsourcing
to fact-check embedded into a socio-technical system
similar to the democratic process. As compared to
them, we look at the more systematic involvement of
humans in the loop to quantitatively assess the
truthfulness of statements.</p>
      <p>Our work looks at experimentally comparing di
erent schemes to collect labelled data for truthful facts.
Related to this, Medo and Wakeling [MW10]
investigate how the discretization of ratings a ects the
codetermination procedure, i.e., where estimates of user
and object reputation are re ned iteratively together.</p>
      <p>Zubiaga et al. [Zub+18] and Zubiaga and Ji [ZJ14]
look at how humans assess credibility of information
and, by means of a human study, identify key
credibility perception features to be used for automatic
detection of credible tweets. As compared to them, we
also look at the human dimension of credibility
checking but rather focus on which is the most appropriate
scale for human assessors to make such assessment.</p>
      <p>Kochkina, Liakata, and Zubiaga [KLZ18b] and
Kochkina, Liakata, and Zubiaga [KLZ18a] look at
rumour veri cation by proposing a supervised machine
learning model to automatically perform such a task.
As compared to them, we focus on understanding the
most e ective scale used to collect training data to
then build such models.</p>
      <p>Besides the dataset we used for our experiments in
this paper, other datasets related to fact checking and
the truthfulness assessment of statements have been
created. The Fake News Challenge2 addresses the the
task of stance detection: estimate the stance of a body
text from a news article relative to a headline.
Specifically, the body text may agree, disagree, discuss or
be unrelated to the headline. Fact-checking Lab at
CLEF 2018 [Nak+18] addresses a ranking task, i.e.,
to rank sentences in a political debate according to
their worthiness for fact-checking, and a classi cation
task, i.e., given a sentence that is worth checking, to
decide whether the claim is true, false or unsure of its
factuality. In our work we use the dataset rst
proposed by Wang [Wan17] as it has been created using
six-level labels which is in-line with our research
question about how many levels are most appropriate for
such labelling task.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup</title>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>We use a sample of statements from the dataset
detailed by Wang [Wan17]. The dataset consists of a
collection of 12,836 labelled statements; each
statement is accompanied by some meta-data specifying
its \speaker", \speaker's job", and \context" (i.e., the
context in which the statement has been said)
information, as well as the the truth label made by experts on
a six-level scale: pants- re (i.e., lie), false, barely-true,
half-true, mostly-true, and true.</p>
        <p>For our re-assessment, we perform a strati ed
random sampling to select 10 statements for each of the
six categories, obtaining a total of 60 statements. The
screenshot in Figure 1 shows one of the statements
included in our sample.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>The Crowdsourcing Task</title>
        <p>We obtain for each statement a crowdsourced truth
label by 10 di erent workers. Each worker judges six
statements (one for each category) plus two additional
\gold" statements used for quality checks. We also
ask each worker to provide a justi cation for the truth
value he/she provide.</p>
        <p>We pay the workers 0.2$ for each set of 8 judgments
(i.e., one Human Intelligent Task, or HIT). Workers
are allowed to do one HIT for each scale only, but
they are allowed to provide judgments for both scales.</p>
        <p>We use randomized statement ordering to avoid any
possible document-ordering e ect/bias.</p>
        <p>To ensure a good quality dataset, we use the
following quality checks in the crowdsourcing phase:
the truth value of the two gold statements (one
patently false and the other one patently true)
has to be consistent;
the time spent to judge each statement has to be
greater than 8 seconds;
each worker has two attempts to complete the
task; at the third unsuccessful attempt of
submitting the task the user is prevented to continue
further.</p>
        <p>We collected the data using the Figure-Eight
platform.3
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Labeling Scales</title>
        <p>We consider two di erent truth scales, keeping the
same experimental setting (i.e., quality checks, HITs,
etc.):
1. a scale in the [0; 100] range, denoted as S100;
2. the Magnitude Estimation [Mos77] scale in the
(0; 1) range, denoted as ME1.</p>
        <p>The e ects and bene ts of using the two scales in the
setting of assessing document relevance for information
retrieval evaluation has been explored by Maddalena
et al. [Mad+17] and Roitero et al. [Roi+18].</p>
        <p>Overall, we collect 800 truth labels for each scale,
so 1,600 in total, for a total cost of 48$ including fees.
2http://www.fakenewschallenge.org/
3https://www.figure-eight.com/.
0
20
40
60
80</p>
        <p>100
Score
20
40
60</p>
        <p>80
Score</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Individual Scores</title>
        <p>While the raw scores obtained with the S100 scale are
ready to use, the scores from ME1 need a
normalization phase (since each worker will use a personal, and
potentially di erent, \inner scale factor" due to the
absence of scale boundaries); we computed the
normalized scores for the ME1 scale following the
standard normalization approach for such a scale, namely
geometric averaging [Ges97; McG03; Mos77]:
higher values, i.e., the right of the plot, and there is a
clear tendency of giving scores which are multiple of
ten (an e ect that is consistent with the ndings by
Roitero et al. [Roi+18]).</p>
        <p>For the ME1 scale, we see that the normalized
scores are almost normally-distributed (which is
consistent with the property that scores collected on a
ratio scale like ME1 should be log-normal), although
the distribution is slightly skewed towards lower values
(i.e., left part of the plot).
s = exp log s</p>
        <p>H (log s) + (log s) ;
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Aggregated Scores</title>
        <p>where s is is the raw score, H (log s) is the mean value
of the log s within a HIT, and (log s) is the mean of
the logarithm of all ME1 scores.</p>
        <p>Figure 2 shows the individual scores distributions:
for S100 (left) the raw scores are reported and for ME1
(right) the normalized scores. The x-axis represents
the score, while the y-axis its absolute frequency; the
cumulative distribution is denoted by the red line. As
we can see, for S100 the distribution is skewed towards
Next, we compute the aggregated scores for both
scales: we aggregate the scores of the ten workers
judging the same statement. Following the standard
practices, we aggregate the S100 values using the
arithmetic mean, as done by Roitero et al. [Roi+18], and
the ME1 values using the median, as done by
Maddalena et al. [Mad+17] and Roitero et al. [Roi+18].
Figure 3 shows the aggregated scores; comparing with
Figure 2, we notice that for S100 the distribution is
30
20
more balanced, although it can not be said to be
bellshaped, and the decimal tendency e ect disappears;
furthermore, the most common value is not 100 (i.e.,
the limit of the scale) anymore. Concerning ME1,
we see that the scores are still roughly normally
distributed.4 However, the x-range is more limited; this
is an e ect of the aggregation function, which tends to
remove the outlier scores.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Comparison with Experts</title>
        <p>We now turn to compare with the ground truth our
truth levels obtained by crowdsourcing. Figure 4
shows the comparison between the S100 and ME1
(normalized and) aggregated scores with the six-level
ground truth. In each of the two charts, each box-plot
represents the corresponding scores distribution. We
also report the individual (normalized and) aggregated
scores as colored dots with some random horizontal
jitter. We can see that, even with a small number of
documents (i.e., ten for each category), the median values
of the box-plots are increasing; this is always the case
for S100, and true for most of the cases for ME1 (where
there is only one case in which this is untrue, for the
two adjacent categories \Lie" and \False"). This
behavior suggests that both the S100 and ME1 scales
allow to collect truth levels that are overall consistent
with the ground truth, and that the S100 scale leads
to a slightly higher level of agreement with the expert
judges than the ME1 scale. We analyze agreement in
more detail in the following.</p>
        <p>4Running the omnibus test of normality implemented in
scipy.stats.normaltest [DP73], we cannot reject the null
hypothesis, i.e., p &gt; :001 for both the aggregated and raw
normalized scores. Although not rejecting the null hypothesis does not
necessary tell us that they follow a normal distribution, we can
say we are pretty con dent they came from a normal
distribution.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Inter-Assessor Agreement</title>
        <p>Figure 5 shows the inter-assessor agreement of the
workers, namely the agreement among all the ten
workers judging the same statement. Agreement
is computed using Krippendor 's [Kri07] and
Common Agreement [Che+17] measures; as already
pointed out by Checco et al. [Che+17], and
measure substantially di erent notions of agreement. As
we can see, while the two agreement measures show
some degree of similarity for S100, for ME1 the
agreement computed is substantially di erent: while has
values close to zero (i.e., no agreement), shows a high
agreement level, on average around 0:8. Checco et al.
[Che+17] show that can have an agreement value of
zero even when the agreement is actually present in
the data. Although agreement values seem higher for
ME1, especially when using , it is di cult to clearly
prefer one of the two scales from these results.
4.5</p>
      </sec>
      <sec id="sec-4-5">
        <title>Pairwise Agreement</title>
        <p>We also measure the agreement within one unit. We
use the de nition of pairwise agreement by Roitero
et al. [Roi+18, Section 4.2.1] that allows to compare
(S100 and ME1) scores with a ground truth on di
erent scales (six levels). Figure 6 shows that the pairwise
agreement with the experts of the scores collected
using the two scales is similar.
4.6</p>
      </sec>
      <sec id="sec-4-6">
        <title>Di erences between the two Scales</title>
        <p>As a last result, we note that the two scales measure
something di erent, as shown by the scatter-plot in
Figure 7. Each dot is one statement and the two
coordinates are its aggregated scores on the two scales.
Although Pearson's correlation between the two scales is
positive and signi cant, it is clear that there are some
di erences, that we plan to study in future work.
100
80
. 60
q
e
rF 40
We performed a crowdsourcing experiment to
analyze the impact of using di erent ne-grained labeling
scales when asking crowdworkers to annotate
truthfulness of statements. In particular, we tested two
labeling scales: S100 [RMM17] and ME1 [Mad+17]. Our
preliminary results with a small sample of statements
from Wang's dataset [Wan17] suggest that:
Crowdworkers annotate truthfulness of
statements in a way that is overall consistent with the
agr_measure
alpha
phi
S100
ME
experts ground truth collected on a six-levels scale
(see Figure 4), thus it seems viable to crowdsource
truthfulness of statements.</p>
        <p>Also due to the limited size of our sample (10
statements), we cannot quantify which is the best
scale to be used in this scenario: we plan to
further address this issue in future work. In this
respect, we remark that whereas the reliability
of the S100 scale is perhaps expected, it is worth
noticing that the ME1 scale, for sure less
familiar, leads anyway to truthfulness values that are of
comparable quality to the ones collected by means
of the S100 scale.</p>
        <p>The scale used has anyway some e ect, as it is
shown by the di erences in Figure 4, the di erent
agreement values in Figure 5, and the rather low
agreement between S100 and ME1 in Figure 7.
S100 and ME1 scales seems to lead to similar
agreement with expert judges (Figure 6).</p>
        <p>For space limits, we do not report on other data like,
for example, the justi cations provided by the workers
or the time taken to complete the job. We plan to do
so in future work.</p>
        <p>Our preliminary experiment is an enabling step to
further explore the impact of di erent ne-grained
labeling scales for fact-checking in crowdsourcing
scenarios. We plan to extend the experiment with more
and more diverse statements, also from other datasets,
which will allow us to perform further analyses. We
plan in particular to understand in more detail the
di erences between the two scales highlighted in
Figure 7.
[Che+17]</p>
        <p>Preslav Nakov, Alberto Barron-Ceden~o,
Tamer Elsayed, Reem Suwaileh, Llu s
Marquez, Wajdi Zaghouani, Pepa
Atanasova, Spas Kyuchukov, and
Giovanni Da San Martino. \Overview of the
CLEF-2018 CheckThat! Lab on
Automatic Identi cation and Veri cation of
Political Claims". In: Proc. CLEF. 2018.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Checco</surname>
          </string-name>
          , Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. \
          <article-title>Let's Agree to Disagree: Fixing Agreement Measures for Crowdsourcing"</article-title>
          .
          <source>In: Proc. HCOMP</source>
          .
          <year>2017</year>
          , pp.
          <volume>11</volume>
          {
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>\Tests for Departure from Normality. Empirical Results for the Distributions of b2 and pb1"</article-title>
          .
          <source>In: Biometrika 60.3</source>
          (
          <issue>1973</issue>
          ), pp.
          <volume>613</volume>
          {
          <fpage>622</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>George</given-names>
            <surname>Gescheider</surname>
          </string-name>
          .
          <article-title>Psychophysics: The Fundamentals</article-title>
          . 3rd. Lawrence Erlbaum Associates,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Elena</given-names>
            <surname>Kochkina</surname>
          </string-name>
          , Maria Liakata, and Arkaitz Zubiaga. \
          <article-title>All-in-one: Multi-task Learning for Rumour Veri cation"</article-title>
          . In: arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>03713</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Elena</given-names>
            <surname>Kochkina</surname>
          </string-name>
          , Maria Liakata, and Arkaitz Zubiaga. \
          <article-title>PHEME dataset for Rumour Detection and Veracity Classi - cation"</article-title>
          . In: (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Travis</given-names>
            <surname>Kriplean</surname>
          </string-name>
          , Caitlin Bonnar, Alan Borning, Bo Kinney, and Brian Gill.
          <article-title>\Integrating On-demand Fact-checking with Public Dialogue"</article-title>
          .
          <source>In: Proc. CSCW</source>
          .
          <year>2014</year>
          , pp.
          <volume>1188</volume>
          {
          <fpage>1199</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Klaus</given-names>
            <surname>Krippendor</surname>
          </string-name>
          . \
          <article-title>Computing Krippendor 's alpha reliability"</article-title>
          .
          <source>In: Departmental papers (ASC)</source>
          (
          <year>2007</year>
          ), p.
          <fpage>43</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Mad+17]
          <string-name>
            <surname>Eddy</surname>
            <given-names>Maddalena</given-names>
          </string-name>
          , Stefano Mizzaro, Falk Scholer, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Turpin</surname>
          </string-name>
          . \
          <article-title>On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation"</article-title>
          .
          <source>In: ACM TOIS 35.3</source>
          (
          <issue>2017</issue>
          ), p.
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[McG03] [Mos77] [MW10] Mick McGee. \Usability magnitude estimation"</article-title>
          .
          <source>In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting 47.4</source>
          (
          <issue>2003</issue>
          ), pp.
          <volume>691</volume>
          {
          <fpage>695</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Howard R Moskowitz.</surname>
          </string-name>
          \
          <article-title>Magnitude estimation: notes on what, how, when, and why to use it"</article-title>
          .
          <source>In: Journal of Food Quality 1.3</source>
          (
          <issue>1977</issue>
          ), pp.
          <volume>195</volume>
          {
          <fpage>227</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Matus</given-names>
            <surname>Medo</surname>
          </string-name>
          and
          <article-title>Joseph Rushton Wakeling. \The e ect of discrete vs. continuousvalued ratings on reputation and ranking systems"</article-title>
          .
          <source>In: EPL (Europhysics Letters) 91.4</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [RMM17] [Roi+18] [Wan17] [ZJ14] [Zub+18]
          <string-name>
            <surname>Kevin</surname>
            <given-names>Roitero</given-names>
          </string-name>
          , Eddy Maddalena, and Stefano Mizzaro. \
          <article-title>Do Easy Topics Predict Effectiveness Better Than Di cult Topics?"</article-title>
          <source>In: Proc. ECIR</source>
          .
          <year>2017</year>
          , pp.
          <volume>605</volume>
          {
          <fpage>611</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          SIGIR.
          <year>2018</year>
          , pp.
          <volume>675</volume>
          {
          <fpage>684</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <year>2017</year>
          , pp.
          <volume>422</volume>
          {
          <fpage>426</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Arkaitz</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heng</given-names>
            <surname>Ji</surname>
          </string-name>
          . \
          <article-title>Tweet, but verify: epistemic study of information verication on Twitter"</article-title>
          .
          <source>In: Soc. Net. An. and Min</source>
          .
          <volume>4</volume>
          .
          <issue>1</issue>
          (
          <issue>2014</issue>
          ), p.
          <fpage>163</fpage>
          . issn:
          <fpage>1869</fpage>
          -
          <lpage>5469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Arkaitz</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          , Ahmet Aker, Kalina Bontcheva, Maria Liakata, and Rob Procter. \
          <article-title>Detection and resolution of rumours in social media: A survey"</article-title>
          .
          <source>In: ACM Computing Surveys (CSUR) 51.2</source>
          (
          <issue>2018</issue>
          ), p.
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>