Work Like a Bee - Taking Advantage of Diligent
                        Crowdsourcing Workers

                Michael Riegler                       Preben N. Olsen                     Pål Halvorsen
                    Simula                                Simula                            Simula
             Research Laboratory AS                Research Laboratory AS            Research Laboratory AS
             michael@simula.no                      preben@simula.no                   paalh@simula.no


ABSTRACT                                                         2.    APPROACHES
This paper presents our approach for the Crowd Sourcing             This section describes our two approaches. As mentioned,
Task of the MediaEval 2014 Benchmark. The proposed so-           our main approach is to find the most diligent workers, while
lution is based on the assumption that the number of Human       the second approach is based on the idea of collecting addi-
Intelligence Tasks (HITs) completed by a worker is represen-     tional crowdsourcing votes. Quality control is a prerequisite
tative of his diligence, making workers who completing high      of a well-designed crowdsourcing HIT and to increase the
volumes of work more reliable than low-performing work-          quality of votes for this work, the task organizers included
ers. Our approach gives a baseline evaluation indicating the     a qualification HIT to make sure that workers understood
usefulness of looking at the number of task completed by a       the task at hand. As the main task was to classify drops
worker.                                                          in music tracks, the workers had to prove that they could
                                                                 classify a drop correctly. Only the workers who passed the
                                                                 qualification HIT were allowed to continue. Because of that
1.   INTRODUCTION                                                pre-quality control, we did not perform any additional qual-
   Crowdsourcing creates a lot of opportunities and is gain-     ity.
ing momentum as an area of interest within the multimedia        2.1    Diligent Workers
community. Moreover, current web-based services like Ama-
zon Mechanical Turk, Mircrowoker, and Crowdflower have              The idea of diligent workers is based on the work presented
simplified the task of leveraging the power of human com-        by Kazai et al. [2], which describes five different types of
putation.                                                        workers: (1) diligent, (2) competent, (3) sloppy, (4) incom-
   The biggest problem in crowdsourcing is still the reliabil-   petent, and (5) spammers. Diligent workers are identified
ity of the workers. The information we receive using crowd-      by the number of completed HITs they produce for a par-
sourcing is unreliable because of workers who try to trick       ticular task. They also state that most of the HITs are done
the system, spam or simply don’t understand the task prop-       by the same group of workers. The distribution of workers is
erly. The law of large numbers (LLN) describes how noise         a power law distribution and leads to around 54% of single
is averaged and its effects are removed with a large number      HIT workers for a crowdsorucing tasks. An important in-
of experiments, but increasing the number of experiments         sight from this work is that diligent workers can be identified
directly affects costs. This is why the crowdsourcing exer-      by the number of HITs per task. After comparing the num-
cise for the Crowdsorting Timed Comments Task this year          ber completed HITs per worker, we chose a subset of dili-
focuses on computing correct labels based on noisy crowd-        gent workers. The number of workers in this subset is chosen
sourced, metadata or content information.                        based on the overall distribution of performed HITs between
   Related work in this area can for example be found in [4,     all workers. Experiments on a development set showed that
3]. These approaches try to calculate correctness of the         30% of the best workers leads to a good result. This sub-
workers or use the features of the media files like the global   set then represents diligent workers who can be trusted and
image feature, for a classification.                             their votes can be used in different ways, e.g., give a higher
   In contrast, the proposed solution presented here is based    weight to their votes or only consider their votes.
on the assumption that workers who complete a high number
of tasks are high performers, either because they enjoy the
                                                                 2.2    Additional Crowdsourcing
task or that they understand the task well enough to do it          For the HITs without a clear result through majority vote
efficiently. We believe that both of these circumstances lead    between the three provided workers or by weighted subset
to reliable results with respect to HITs. As a secondary ap-     of the best performing workers, we used additional crowd-
proach, we also used labels collected from additional crowd-     sourcing. We developed an HTML and SQL-based platform
sourcing workers which means that we asked new workers           that gave us the opportunity to perform the tests in our lab.
for HITs where the original workers could not come to an         The requirement for this additional test was that the par-
agreement.                                                       ticipants had to try their best to find the right answer for
                                                                 the HIT.

                                                                 3.    EXPERIMENTAL SETUP
Copyright is held by the author/owner(s).
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain     The provided dataset contains 591 songs, metadata, and
Table 1: Configuration of the four different methods              Table 3: Most frequent class baseline for the given
evaluated.                                                        dataset.
       Run              Description                                Baseline WF1-score True Labels Predicted Labels
       R1   MJV with additional crowdsourcing                       MFC       0.3809   183, 63, 291      0, 0, 537
       R2        Diligent workers vote only
       R3   MJV with weighted diligent workers                    cut costs and yield more accurate crowdsourcing results. For
       R4    R3 with additional crowdsourcing                     example, by identifying diligent workers early in task execu-
                                                                  tion one can annotate their votes and only consider them as
     Table 2: MediaEval 2014 Benchmark results.                   in R2, or weight their votes differently as in R3. That said,
     Run WF1-score True Labels Predicted Labels                   we also want to point out that there is a chance our results
     R1     0.7207   183, 63, 291 192, 68, 279                    are dataset specific and further investigations on multiple
     R2     0.6919   183, 63, 291 208, 95, 234                    and larger datasets are needed.
     R3     0.6912   183, 63, 291 208, 87, 242                       At last, we want to highlight that additional crowdsourc-
     R4     0.6919   183, 63, 291 208, 95, 234                    ing does not increase the accuracy when considering diligent
                                                                  workers. This indicates that the quality of work and worker
labels generated by human computation, but because some           motivation is more important than the number of workers
of the songs are duplicates only 537 of them we used in the       used or votes gathered.
evaluation. The task’s main goal was to classify a drop in
music within a limited timespan. A drop can be seen as an
event that builds up to a change of the beat or melody in         5.   CONCLUSION
the song, i.e., a characteristic also found in electronic dance      This paper presents two approaches for classifying drops
music, and is more than just a simple change. Workers could       in electronic dance music segments by utilizing human com-
give three different labels to each song segment: (1) the         putation and crowdsourcing. The results and insights gained
segment contains a complete drop, (2) the segments only           by evaluating four different methods indicate that the pro-
contains a partial drop, and (3) the segment contains no          posed approach, which assumes that diligent workers also
drop in music [1].                                                provide better work quality, is promising. Our investiga-
   We assessed four different methods executed in four runs.      tion also indicates that additional crowdsourcing does not
The results are shown in Table 1 where a summarized overview      improve results when considering diligent workers.
and short descriptions of each method is provided. The first         For assurance and increased certainty, we recognize the
method (R1) considers the majority vote (MJV) between             need for extending the work to include multiple and larger
the three votes provided by the original dataset and addi-        datasets. Additional future work includes pairing crowd-
tional for not clear answers. While in the second run (R2)        sourcing results with computer generated content analysis
we only consider the votes provided by our diligent workers       and further classification of diligent workers.
subset. Our third method (R3) takes into account the ma-
jority votes, but adds a higher weight to votes provided by       6.   ACKNOWLEDGMENT
diligent workers. The fourth and last method (R4) used the          This work has been funded by the NFR-funded FRINATEK
results provided by R3, but with additional crowdsourcing         project ”Efficient Execution of Large Workloads on Elastic
for ambiguous answers (where MJV could not clearly lead           Heterogeneous Resources” (EONS) (project number 231687)
to a label).                                                      and the iAD center for Research-based Innovation (project
                                                                  number 174867) funded by the Norwegian Research Council.
4.    RESULTS
   Table 2 describes our benchmark results, while Table 3         7.   REFERENCES
describes the results for the most frequent class baseline,       [1] M. L. Karthik Yadati, Pavala S.N. Chandrasekaran
in which case all labels get the most frequent class label            Ayyanathan. Crowdsorting timed comments about
in the dataset assigned. The performance is measured by               music: Foundations for a new crowdsourcing task. In
the weighted harmonic mean of precision and recall (WF1-              MediaEval 2014 Workshop, Barcelona, Spain, October
score). This is done to avoid unreliable results based on the         16-17 2014.
imbalance of the classes.
                                                                  [2] G. Kazai, J. Kamps, and N. Milic-Frayling. Worker
   We see from Table 2 that every method evaluated outper-
                                                                      types and personality traits in crowdsourcing relevance
forms the most frequent class baseline by at least 30%. The
                                                                      labels. In Proceedings of the 20th ACM international
best performing method is R1 with a WF1-score of 0.7207.
                                                                      conference on Information and knowledge management,
Compared to R1, the three other methods have a perfor-
                                                                      pages 1941–1944. ACM, 2011.
mance drop of around 3%. These methods are not distin-
guishable with respect to the results they produces, which        [3] B. Loni, J. Hare, M. Georgescu, M. Riegler, X. Zhu,
might be because each of the methods rely on the votes                M. Morchid, R. Dufour, and M. Larson. Getting by
provided by the subset of diligent workers. We find it inter-         with a little help from the crowd: Practical approaches
esting that R3 and R4, which complements diligent workers             to social image labeling. In CROWDMM ’14, November
with MJV and additional crowdsourcing, do not significantly           03 - 07 2014, Orlando, FL, USA. ACM, 2014.
increase performance compared to R2.                              [4] M. Riegler, M. Lux, and C. Kofler. Frame the crowd:
   Moreover, the performance difference between R1 and R2             Global visual features labeling boosted with
is low, which strongly indicates that the assumption of work-         crowdsourcing information. In MediaEval 2013
ers who complete the majority of crowdsourcing tasks also             Workshop, Barcelona, Spain, 2013.
perform better is valid. This is a promising insight that can