=Paper= {{Paper |id=None |storemode=property |title=An Investigation of Techniques that Aim to Improve the Quality of Labels provided by the Crowd |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_44.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/HareAWSSDL13 }} ==An Investigation of Techniques that Aim to Improve the Quality of Labels provided by the Crowd== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_44.pdf
     An investigation of techniques that aim to improve the
            quality of labels provided by the crowd

                           Jonathon Hare, Anna Weston, Elena Simperl
              jsh2@ecs.soton.ac.uk, aw3g10@ecs.soton.ac.uk, E.Simperl@soton.ac.uk
                          Sina Samangooei, David Dupplaw, Paul Lewis
                  ss@ecs.soton.ac.uk, dpd@ecs.soton.ac.uk, phl@ecs.soton.ac.uk
                   Electronics and Computer Science, University of Southampton, United Kingdom
                                                     Maribel Acosta
                                                 maribel.acosta@kit.edu
                                Institute AIFB, Karlsruhe Institute of Technology, Germany

ABSTRACT                                                                             𝛃                             𝛃
                                                                                                                       𝛙

The 2013 MediaEval Crowdsourcing task looked at the prob-
                                                                          z              l             z               l
lem of working with noisy crowdsourced annotations of im-                            𝛂   W                        𝛂    W
age data. The aim of the task was to investigate possible                                    N                             N

techniques for estimating the true labels of an image by us-                     𝞬                            𝞬
ing the set of noisy crowdsourced labels, and possibly any
                                                                               (a)                           (b)
content and metadata from the image itself. For the runs
in this paper, we’ve applied a shotgun approach and tried
                                                                 Figure 1: Generative model of crowdworkers: (a) incor-
a number of existing techniques, which include generative
                                                                 porating per-item difficulty and per-worker reliability; (b)
probabilistic models and further crowdsourcing.
                                                                 incorporating per-item difficulty, per-worker reliability and
                                                                 features describing the image.
1.   INTRODUCTION
   Crowdsourcing is increasingly becoming a popular way of
extracting information. One problem with crowdsourcing is        2.1     Run 1
that the workers can have a number of traits that affect the        The first run was required to only make use of the provided
quality of the work they are performing. One standard way        crowdsourced labels. For this run, we applied the generative
of dealing with the problem of noisy data is to ask multiple     model developed by Paul Mineiro [3] illustrated in Figure 1a.
workers to perform the same task and then combine the            This model extends the one by Whitehill et al. [5] by incor-
labels of the workers in order to obtain a final estimate.       porating a hierarchical Gaussian prior on the elements of the
   Perhaps the most intuitive way of combining labels of         confusion matrix (i.e. the γ hyper-parameter in the figure).
multiple workers is through majority voting, however other       Briefly, the model assumes an unobserved ground truth la-
possibilities exist. The aim of the 2013 MediaEval Crowd-        bel z combines with a per-worker model parametrized by
sourcing task [1] was to explore techniques in which better      vector α and scalar item difficulty β to generate an ob-
estimates of the true labels can be created. Our run submis-     served worker label l for an image. The hyper-parameter
sions for this task explore a number of techniques to achieve    γ moderates the worker reliability as a function of the label
this: probabilistic models of workers (i.e. estimating which     class. The model parameters are learnt using a ‘Bayesian’
workers are bad, and discounting their votes), additional        Expectation-Maximisation algorithm. For our experiments
crowdsourcing of images without a clear majority vote, and       with this model, we used the nominallabelextract imple-
joint probabilistic models that take into account both the       mentation published by Paul Mineiro1 with uniform class
crowdsourced votes as well as extracted features.                priors. Note that the software was applied to data from
                                                                 each of the two questions asked of the workers separately,
                                                                 and “NotSure” answers were treated as unknowns (not in-
2.   METHODOLOGY                                                 cluded in the input data).
  As described previously, the overall methodology for our
run submissions was to take a shotgun approach and try           2.2     Run 2
three fundamentally different approaches (generative prob-         For the second run, we gathered additional data in two
abilistic model of workers; extra crowdsourcing; and joint       ways. Firstly, we randomly selected 1000 images from the
modelling) to the problem. The techniques and data we            test set and had them annotated by two reliable experts. The
used for each run are summarised in Table 1. Specific de-        two experts firstly annotated the data independently and
tails on each run are given below.                               came to agreement on 671 of these (across both questions).
                                                                 For the images they didn’t agree on for either question, they
                                                                 collaboratively came to a decision about the true label for
Copyright is held by the author/owner(s).                        1
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain       http://code.google.com/p/nincompoop/downloads/
                                           Table 1: Configuration of the submitted runs.

                                              Data                                                 Technique
                      Provided     Additional Labels            Features           Majority      Probabilistic Probabilistic
            Run #      Labels    Crowdsourced   Expert      Metadata Visual         vote           Worker         Joint
            1            X                                                                           X
            2            X             X             X                                  X
            3            X             X             X                                                X
            4            X             X             X             X                                                 X
            5            X             X             X             X        X                                        X


both questions. The relatively low-level of initial agreement                    Table 2: Results for each run and label.
between the experts is an indication of the subjectiveness of
the labelling task being performed (especially with respect                     Run #       Label 1 F1 Score   Label 2 F1 Score
to question 1 “is this a fashion image”). Secondly, for the im-                   1              0.7352             0.7636
ages in the test set that had at least two “NotSure” answers,                     2              0.8377             0.7621
                                                                                  3              0.7198             0.7710
we gathered more responses through additional crowdsourc-
                                                                                  4              0.7097             0.7528
ing using the CrowdFlower2 platform. In total we gathered                         5              0.6427             0.6026
additional an 824 responses over 421 images from this ex-
tra crowdsourcing. In order to produce the estimates we
performed majority voting.                                             model used in run 3 had a minor improvement for the second
                                                                       label, but it had a big negative effect for the first label. It’s
2.3     Run 3                                                          also clear that the more advanced model (runs 4 & 5), that
  In the third run, we applied the model used in run 1 to              took into account features, also performed less well with
the data in run 2. The original worker labels and additional           this data than hoped. Interestingly, when we applied both
crowdsourced labels were combined and used as the primary              generative models to the smaller MMSys dataset we had a
input. The expert labels were used to clamp the model at               slight improvement. One possible reason for the relatively
the respective images in order to obtain a better fit.                 low performance of the generative models on the first label
                                                                       could well be due to the subjectiveness of the question being
2.4     Run 4                                                          asked, which would lead to errors when fitting the models.
   In the fourth run, we chose to explore the use of another           This would also help indicate why additional crowdsourcing
generative model developed by Paul Mineiro [2]. This model             seems to improve results.
is inspired by the work of Raykar et al. [4], and incorporates
the notion of the hidden unknown true label also generating
a set of observed features (ψ). This is illustrated in the plate
                                                                       4.   ACKNOWLEDGMENTS
diagram shown in Figure 1b.                                              The described work was funded by the Engineering and
   Mineiro developed an online procedure to learn the model            Physical Sciences Research Council under the SOCIAM plat-
parameters that jointly learns a logistic regressor to learn           form grant, and European Union Seventh Framework Pro-
how to create classifications (estimations of the true label)          gramme (FP7/2007-2013) under grant agreements 270239
from the features. A nice feature of this approach is that in          (ARCOMEM) and 287863 (TrendMiner).
each iteration of learning/fitting, the worker model informs
the classifier and the classifier informs the worker model.            5.   REFERENCES
   For this run, the features used were bag-of-words features          [1] B. Loni, M. Larson, A. Bozzon, and L. Gottlieb.
extracted from the tags, titles, descriptions, contexts and                Crowdsourcing for Social Multimedia at MediaEval
notes metadata of each image.                                              2013: Challenges, Data set, and Evaluation. In
                                                                           MediaEval 2013 Workshop, Barcelona, Spain, October
2.5     Run 5                                                              18-19 2013.
   Finally, for the fifth run, we applied the same technique as        [2] P. Mineiro. Logistic Regression for Crowdsourced Data.
used in run 4, but also incorporated a Pyramid Histogram                   http://www.machinedlearnings.com/2011/11/
of Words (PHOW) feature extracted from the images them-                    logistic-regression-for-crowdsourced.html.
selves on top of the metadata features. The PHOW features
                                                                       [3] P. Mineiro. Modeling Mechanical Turk Part II.
were created from dense SIFT features quantised into 300
                                                                           http://www.machinedlearnings.com/2011/01/
visual words and aggregated into a pyramid with 2×2 and
                                                                           modeling-mechanical-turk-part-ii.html.
4×4 blocks.
                                                                       [4] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez,
                                                                           C. Florin, L. Bogoni, and L. Moy. Learning from
3.     RESULTS AND DISCUSSION                                              crowds. J. Mach. Learn. Res., 11:1297–1322, Aug. 2010.
  The results of the five runs are shown in Table 2. Whilst            [5] J. Whitehill, P. Ruvolo, T. fan Wu, J. Bergsma, and
we can’t currently make global comments as to how well                     J. Movellan. Whose vote should count more: Optimal
these runs performed compared to naı̈ve majority voting,                   integration of labels from labelers of unknown expertise.
we can note a few points. Firstly, looking at runs 2 and 3                 In Advances in Neural Information Processing Systems
which used the same data, we can see that the generative                   22, page 2035–2043, December 2009.
2
    http://crowdflower.com