=Paper=
{{Paper
|id=Vol-2276/paper10
|storemode=property
|title=Investigating Stability and Reliability of Crowdsourcing Output (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2276/paper10.pdf
|volume=Vol-2276
|authors=Rehab Kamal Qarout,Alessandro Checco,Kalina Bontchevan
|dblpUrl=https://dblp.org/rec/conf/hcomp/QaroutCB18
}}
==Investigating Stability and Reliability of Crowdsourcing Output (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2276/paper10.pdf</pdf>
<pre>
         Investigating Stability and Reliability of
                  Crowdsourcing Output

     Rehab Kamal Qarout1 , Alessandro Checco2 , and Kalina Bontcheva1
     1
          University of Sheffield, Department of Computer Science, Sheffield, UK
              2
                 University of Sheffield, Information School, Sheffield, UK
               {rkqarout1, a.checco, k.bontcheva}@sheffield.ac.uk


         Abstract. This research proposes to investigate the reliability of the
         output of crowdsourcing platforms and its consistency over time. We
         study the effect of design interface and instructions and identify critical
         differences between two platforms that have been used widely in research
         and data collection and evaluation. Our findings will help to uncover data
         reliability problems and to propose changes in crowdsourcing platforms
         that can mitigate the inconsistencies of human contributions.

         Keywords: crowdsourcing · task design· platforms.


1   Introduction
There are many successful examples on the web of crowdsourcing platforms.
However, the features and services provided for the requesters vary from one
platform to another, and no single platform meets all the possible requirements
that the requesters may have.
   We investigate the quality of the output of different platforms when the same
task design and dataset is used. To study the reliability and consistency of the
output of the platforms and to generalise the findings, we run a continuous
evaluation of existing datasets and replicate the task over multiple weeks.


2   Related Work
Crowdsourcing platforms evaluation. In this context, a study by [3] at-
tempts to validate Amazon Mechanical Turk (MTurk) as a tool for collecting
data in cognitive behavioural research. They designed several types of experi-
ments and compared the results with traditional laboratory ways of collecting
data. The study showed that the quality of the data collected under the experi-
mental conditions in MTurk is highly similar to the quality of the data collected
the traditional laboratory way. A similar case study was presented by [1], who
analysed the results of surveying the workers on their behaviour of using partic-
ular technologies. This research compared the results from MTurk and Survey
Monkey to those obtained using a traditional survey. They demonstrated that
crowdsourcing platforms can provide the same results and do it much faster when
Qarout et al.

compared to the traditional way of collecting survey data [1]. Despite some con-
cerns related to the limitations of the technical and visual design of the task and
unexpected behaviour such as dropping out of a task before finishing it, collect-
ing data with crowdsourcing saves time and money and reach a wide range of
users in a few seconds [3].
    A few papers highlighted the differences between crowdsourcing platforms.
In one of the recent studies, [6] introduced the new platform Prolific Academic
(ProA) and compared the result of this platform with CrowdFlower (CF) and
MTurk. The findings of this study recorded the highest response rate for par-
ticipants in CF and the highest data quality for the participants in ProA and
comparable to MTurk’s [6]. Another study [5] used Rankings website to col-
lect data and compare crowdsourcing platforms over two periods of time and
according to a number of criteria: type of service provided, quality and reliabil-
ity, region, online imprint. The findings of this study discuss the effect of the
platforms characteristics of their traffic data and popularity [5, 4].


Time consistency of tasks. studies by [2] investigate the creation of evaluation
campaigns for the semantic search task of keyword-based ad-hoc object retrieval
using crowdsourcing task. They used a sample of entity-queries from the Yahoo!
log and Microsoft log to evaluate the semantic search result. They prove that the
reliability of crowdsourcing workers and the quality of the result was comparable
to that of the experts even when repeating the same task over time [2]. Following
this work, [8] extend the continuous evaluation of information retrieval (IR)
systems using crowdsourced relevance judgments.


3    Research Questions

This research will address the following questions:

 – RQ1: Is there a significant difference in the quality, reliability, and consis-
   tency of the results for the same task repeated over a different time scale?
 – RQ2: Is there a significant difference in the quality, reliability, and consis-
   tency of the results for the same task performed on different platforms?

    Answering RQ1 requires conducting a study where the same experiments will
be repeated on a different time scale. We replicated the experiment using the
same part of the dataset for the same assumption discussed in [2, 7] for measuring
repeatable and reliable evaluation over crowdsourcing systems. These studies
show experimental proofs that a crowdsourcing platform produces a scalable and
reliable result over a repetition time of one month. We examined the consistency
of the same task over a shorter time scale (once a week).
    RQ2 offers an in-depth analysis and practical comparison of crowdsourcing
platforms. We investigated the replication of the same task over multiple crowd-
sourcing platforms and over different levels of workers’ experience and accuracy
as provided by each platform. Two of the most popular platforms, that have
                                                 Stability of Crowdsourcing Output

been used in crowdsourcing business and research studies of data evaluation and
acquisitions, that is, Amazon Mechanical Turk (MTurk) and Figure Eight (F8),
have been chosen for this study.
    For both research questions and for each platform, we ran multiple types of
tasks and measured the stability of the performance over the variations of the
following factors:
 – The quality of the task interface.
 – The workers’ experience level provided by the platform.
   The evaluation of these factors depended on the completion time of the task
and accuracy of the result. Moreover, with repeating the same task every week,
the overall time of completing the batch on each platform will be recorded.


4   Experimental Results: Phase 1
The experiments in this phase used the plain interface similar to the one pre-
sented in [2]. We repeated the same experiment five times (once every week)
and it was launched on the same day of the week and at the same time on each
platform. Each task consisted of 20 tweets to be judged by 150 workers. The
workers were rewarded with 0.15$ and they could do the task only once since
after they finished they were excluded from participating in another batch of the
task.


                     Fig. 1. Accuracy distribution over time.


   Table 1 presents the results of the baseline phase experiments for the tweets
dataset with comparison between the two selected platforms. The results show
some consistency over the five runs on each platform. Workers were finishing the
task faster in MTurk, where the average time per assignment was approximately
4 minutes, while it took approximately 6 minutes in F8. The overall accuracy
Qarout et al.

for each run on MTurk was more than 73% whereas it was in the range of 60%
on F8 as shown in Figure 1. Although the results from MTurk are significantly
better than those from F8, the total completion time for the whole batch took
an average of 3 days in MTurk and 4 to 7 hours in F8.

                  Table 1. Results of five runs in MTurk and F8

                                        MTurk           F8
                                       4 m, 16 s     6 m, 09 s
                       Average         4 m, 49 s     6 m, 33 s
                      Time per         4 m, 24 s     6 m, 18 s
                     Assignment        4 m, 25 s     5 m, 30 s
                                       4 m, 37 s     5 m, 49 s
                                      0.73 ± 0.20 0.63 ± 0.28
                    Avg.Accuracy      0.76 ± 017 0.66 ± 0.25
                          &           0.76 ± 0.14 0.67 ± 0.25
                  Standard deviation 0.74 ± 0.19 0.66 ± 0.27
                                      0.76 ± 0.14 0.64 ± 0.28
                                    3 d, 00 h, 14 m 05 h, 11 m
                     Completion     3 d, 01 h, 29 m 04 h, 45 m
                       Time for     2 d, 08 h, 36 m 07 h, 10 m
                      the Batch     3 d, 13 h, 54 m 04 h, 43 m
                                    3 d, 03 h, 28 m 04 h, 04 m


    A two-way ANOVA was conducted to examine the effect of repeating the
same task several times and on two different platforms on the accuracy of the
results (Table 2). There was a statistically significant interaction between the
effects of repeating the task on different platforms on the accuracy p < 0.05.
There were no differences between running the experiment several times on each
platform which indicates the consistency of the outcome of each platform. We
will investigate the reasons for having this difference accuracies.

                     Table 2. Results of 2 ways ANOVA test.

                                      sum sq df      F PR(>)
                C(Platform)            3.65   1.0 71.4 0.68e-16
                C(Time)                 0.17  4.0 0.84 0.49
                C(Platform):C(Time)     0.08  4.0 0.42 0.80
                Residual               76.17 1490.0 NaN NaN


5    Future Directions
For the advanced phases of this study, we will investigate why we had these
results in the first phase. One of the reasons could be the workers’ diversity and
                                                   Stability of Crowdsourcing Output

their level of experience. Another reason could be the variation of the amount
of payment for different channels in F8. With this study, we hope to reach a
reasonable level of understanding what are the best strategies and advise crowd-
sourcing users on the best way to achieve better service from the system.


6    Acknowledgements
This project is supported by the European Union’s Horizon 2020 research and
innovation programme under grant agreement No. 732328.


References
1. Bentley, F.R., Daskalova, N., White, B.: Comparing the Reliability of Ama-
   zon Mechanical Turk and Survey Monkey to Traditional Market Research Sur-
   veys. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Hu-
   man Factors in Computing Systems - CHI EA ’17. pp. 1092–1099 (2017).
   https://doi.org/10.1145/3027063.3053335
2. Blanco, R., Halpin, H., Herzig, D.M., Mika, P., Pound, J., Thompson, H.S.: Repeat-
   able and Reliable Search System Evaluation using Crowdsourcing. Journal of Web
   Semantics 21, 923–932 (2011). https://doi.org/10.1016/j.websem.2013.05.005
3. Crump, M.J.C., McDonnell, J.V., Gureckis, T.M.: Evaluating Amazon’s Mechanical
   Turk as a Tool for Experimental Behavioral Research. PLoS ONE 8(3) (2013).
   https://doi.org/10.1371/journal.pone.0057410
4. Mourelatos, E., Frarakis, N., Tzagarakis, M.: A Study on the Evolution of Crowd-
   sourcing Websites. ISSNOnline) European Journal of Social Sciences Education and
   Research 11(1), 2411–9563 (2017)
5. Mourelatos, E., Tzagarakis, M., Dimara, E.: A REVIEW OF ONLINE CROWD-
   SOURCING PLATFORMS. South-Eastern Europe Journal of Economics 14(1),
   59–74 (2016)
6. Peer, E., Samat, S., Brandimarte, L., Acquisti, A.: Beyond the Turk
   : Alternative platforms for crowdsourcing behavioral research. Jour-
   nal of Experimental Social Psychology 70(January), 153–163 (2016).
   https://doi.org/10.1016/j.jesp.2017.01.006
7. Tonon, A., Demartini, G., Cudré-Mauroux, P.: Combining inverted indices
   and structured search for ad-hoc object retrieval. In: SIGIR. p. 125 (2012).
   https://doi.org/10.1145/2348283.2348304
8. Tonon, A., Demartini, G., Cudré-Mauroux, P.: Pooling-based continuous evalua-
   tion of information retrieval systems. Information Retrieval 18(5), 445–472 (2015).
   https://doi.org/10.1007/s10791-015-9266-y

</pre>