<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigating Stability and Reliability of Crowdsourcing Output</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rehab Kamal Qarout</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Checco</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kalina Bontcheva</string-name>
          <email>k.bontchevag@sheffield.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of She eld, Department of Computer Science</institution>
          ,
          <addr-line>She eld</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of She eld, Information School</institution>
          ,
          <addr-line>She eld</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This research proposes to investigate the reliability of the output of crowdsourcing platforms and its consistency over time. We study the e ect of design interface and instructions and identify critical di erences between two platforms that have been used widely in research and data collection and evaluation. Our ndings will help to uncover data reliability problems and to propose changes in crowdsourcing platforms that can mitigate the inconsistencies of human contributions.</p>
      </abstract>
      <kwd-group>
        <kwd>crowdsourcing task design platforms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>There are many successful examples on the web of crowdsourcing platforms.
However, the features and services provided for the requesters vary from one
platform to another, and no single platform meets all the possible requirements
that the requesters may have.</p>
      <p>
        We investigate the quality of the output of di erent platforms when the same
task design and dataset is used. To study the reliability and consistency of the
output of the platforms and to generalise the ndings, we run a continuous
evaluation of existing datasets and replicate the task over multiple weeks.
Crowdsourcing platforms evaluation. In this context, a study by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
attempts to validate Amazon Mechanical Turk (MTurk) as a tool for collecting
data in cognitive behavioural research. They designed several types of
experiments and compared the results with traditional laboratory ways of collecting
data. The study showed that the quality of the data collected under the
experimental conditions in MTurk is highly similar to the quality of the data collected
the traditional laboratory way. A similar case study was presented by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], who
analysed the results of surveying the workers on their behaviour of using
particular technologies. This research compared the results from MTurk and Survey
Monkey to those obtained using a traditional survey. They demonstrated that
crowdsourcing platforms can provide the same results and do it much faster when
compared to the traditional way of collecting survey data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Despite some
concerns related to the limitations of the technical and visual design of the task and
unexpected behaviour such as dropping out of a task before nishing it,
collecting data with crowdsourcing saves time and money and reach a wide range of
users in a few seconds [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        A few papers highlighted the di erences between crowdsourcing platforms.
In one of the recent studies, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] introduced the new platform Proli c Academic
(ProA) and compared the result of this platform with CrowdFlower (CF) and
MTurk. The ndings of this study recorded the highest response rate for
participants in CF and the highest data quality for the participants in ProA and
comparable to MTurk's [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Another study [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] used Rankings website to
collect data and compare crowdsourcing platforms over two periods of time and
according to a number of criteria: type of service provided, quality and
reliability, region, online imprint. The ndings of this study discuss the e ect of the
platforms characteristics of their tra c data and popularity [
        <xref ref-type="bibr" rid="ref4 ref5">5, 4</xref>
        ].
Time consistency of tasks. studies by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] investigate the creation of evaluation
campaigns for the semantic search task of keyword-based ad-hoc object retrieval
using crowdsourcing task. They used a sample of entity-queries from the Yahoo!
log and Microsoft log to evaluate the semantic search result. They prove that the
reliability of crowdsourcing workers and the quality of the result was comparable
to that of the experts even when repeating the same task over time [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Following
this work, [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] extend the continuous evaluation of information retrieval (IR)
systems using crowdsourced relevance judgments.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Research Questions</title>
      <p>
        This research will address the following questions:
{ RQ1: Is there a signi cant di erence in the quality, reliability, and
consistency of the results for the same task repeated over a di erent time scale?
{ RQ2: Is there a signi cant di erence in the quality, reliability, and
consistency of the results for the same task performed on di erent platforms?
Answering RQ1 requires conducting a study where the same experiments will
be repeated on a di erent time scale. We replicated the experiment using the
same part of the dataset for the same assumption discussed in [
        <xref ref-type="bibr" rid="ref2 ref7">2, 7</xref>
        ] for measuring
repeatable and reliable evaluation over crowdsourcing systems. These studies
show experimental proofs that a crowdsourcing platform produces a scalable and
reliable result over a repetition time of one month. We examined the consistency
of the same task over a shorter time scale (once a week).
      </p>
      <p>RQ2 o ers an in-depth analysis and practical comparison of crowdsourcing
platforms. We investigated the replication of the same task over multiple
crowdsourcing platforms and over di erent levels of workers' experience and accuracy
as provided by each platform. Two of the most popular platforms, that have
been used in crowdsourcing business and research studies of data evaluation and
acquisitions, that is, Amazon Mechanical Turk (MTurk) and Figure Eight (F8),
have been chosen for this study.</p>
      <p>For both research questions and for each platform, we ran multiple types of
tasks and measured the stability of the performance over the variations of the
following factors:
{ The quality of the task interface.
{ The workers' experience level provided by the platform.</p>
      <p>The evaluation of these factors depended on the completion time of the task
and accuracy of the result. Moreover, with repeating the same task every week,
the overall time of completing the batch on each platform will be recorded.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Results: Phase 1</title>
      <p>
        The experiments in this phase used the plain interface similar to the one
presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We repeated the same experiment ve times (once every week)
and it was launched on the same day of the week and at the same time on each
platform. Each task consisted of 20 tweets to be judged by 150 workers. The
workers were rewarded with 0.15$ and they could do the task only once since
after they nished they were excluded from participating in another batch of the
task.
      </p>
      <p>Table 1 presents the results of the baseline phase experiments for the tweets
dataset with comparison between the two selected platforms. The results show
some consistency over the ve runs on each platform. Workers were nishing the
task faster in MTurk, where the average time per assignment was approximately
4 minutes, while it took approximately 6 minutes in F8. The overall accuracy
for each run on MTurk was more than 73% whereas it was in the range of 60%
on F8 as shown in Figure 1. Although the results from MTurk are signi cantly
better than those from F8, the total completion time for the whole batch took
an average of 3 days in MTurk and 4 to 7 hours in F8.</p>
      <p>A two-way ANOVA was conducted to examine the e ect of repeating the
same task several times and on two di erent platforms on the accuracy of the
results (Table 2). There was a statistically signi cant interaction between the
e ects of repeating the task on di erent platforms on the accuracy p &lt; 0:05.
There were no di erences between running the experiment several times on each
platform which indicates the consistency of the outcome of each platform. We
will investigate the reasons for having this di erence accuracies.
For the advanced phases of this study, we will investigate why we had these
results in the rst phase. One of the reasons could be the workers' diversity and
their level of experience. Another reason could be the variation of the amount
of payment for di erent channels in F8. With this study, we hope to reach a
reasonable level of understanding what are the best strategies and advise
crowdsourcing users on the best way to achieve better service from the system.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This project is supported by the European Union's Horizon 2020 research and
innovation programme under grant agreement No. 732328.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bentley</surname>
            ,
            <given-names>F.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daskalova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>White</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Comparing the Reliability of Amazon Mechanical Turk and Survey Monkey to Traditional Market Research Surveys</article-title>
          .
          <source>In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems - CHI EA '17</source>
          . pp.
          <volume>1092</volume>
          {
          <issue>1099</issue>
          (
          <year>2017</year>
          ). https://doi.org/10.1145/3027063.3053335
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Blanco</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halpin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herzig</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pound</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>H.S.</given-names>
          </string-name>
          :
          <article-title>Repeatable and Reliable Search System Evaluation using Crowdsourcing</article-title>
          .
          <source>Journal of Web Semantics</source>
          <volume>21</volume>
          ,
          <issue>923</issue>
          {
          <fpage>932</fpage>
          (
          <year>2011</year>
          ). https://doi.org/10.1016/j.websem.
          <year>2013</year>
          .
          <volume>05</volume>
          .005
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Crump</surname>
            ,
            <given-names>M.J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonnell</surname>
            ,
            <given-names>J.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gureckis</surname>
            ,
            <given-names>T.M.</given-names>
          </string-name>
          :
          <article-title>Evaluating Amazon's Mechanical Turk as a Tool for Experimental Behavioral Research</article-title>
          .
          <source>PLoS ONE</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ) (
          <year>2013</year>
          ). https://doi.org/10.1371/journal.pone.0057410
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mourelatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frarakis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tzagarakis</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A Study on the Evolution of Crowdsourcing Websites</article-title>
          .
          <source>ISSNOnline) European Journal of Social Sciences Education and Research</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <volume>2411</volume>
          {
          <fpage>9563</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mourelatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tzagarakis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimara</surname>
          </string-name>
          , E.:
          <article-title>A REVIEW OF ONLINE CROWDSOURCING PLATFORMS</article-title>
          . South-Eastern
          <source>Europe Journal of Economics</source>
          <volume>14</volume>
          (
          <issue>1</issue>
          ),
          <volume>59</volume>
          {
          <fpage>74</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Peer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samat</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandimarte</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acquisti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Beyond the Turk : Alternative platforms for crowdsourcing behavioral research</article-title>
          .
          <source>Journal of Experimental Social Psychology</source>
          <volume>70</volume>
          (
          <issue>January</issue>
          ),
          <volume>153</volume>
          {
          <fpage>163</fpage>
          (
          <year>2016</year>
          ). https://doi.org/10.1016/j.jesp.
          <year>2017</year>
          .
          <volume>01</volume>
          .006
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Tonon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demartini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cudre-Mauroux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Combining inverted indices and structured search for ad-hoc object retrieval</article-title>
          .
          <source>In: SIGIR</source>
          . p.
          <volume>125</volume>
          (
          <year>2012</year>
          ). https://doi.org/10.1145/2348283.2348304
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tonon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demartini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cudre-Mauroux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Pooling-based continuous evaluation of information retrieval systems</article-title>
          .
          <source>Information Retrieval</source>
          <volume>18</volume>
          (
          <issue>5</issue>
          ),
          <volume>445</volume>
          {
          <fpage>472</fpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1007/s10791-015-9266-y
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>