=Paper=
{{Paper
|id=None
|storemode=property
|title=An Investigation of Techniques that Aim to Improve the Quality of Labels provided by the Crowd
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_44.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/HareAWSSDL13
}}
==An Investigation of Techniques that Aim to Improve the Quality of Labels provided by the Crowd==
An investigation of techniques that aim to improve the quality of labels provided by the crowd Jonathon Hare, Anna Weston, Elena Simperl jsh2@ecs.soton.ac.uk, aw3g10@ecs.soton.ac.uk, E.Simperl@soton.ac.uk Sina Samangooei, David Dupplaw, Paul Lewis ss@ecs.soton.ac.uk, dpd@ecs.soton.ac.uk, phl@ecs.soton.ac.uk Electronics and Computer Science, University of Southampton, United Kingdom Maribel Acosta maribel.acosta@kit.edu Institute AIFB, Karlsruhe Institute of Technology, Germany ABSTRACT 𝛃 𝛃 𝛙 The 2013 MediaEval Crowdsourcing task looked at the prob- z l z l lem of working with noisy crowdsourced annotations of im- 𝛂 W 𝛂 W age data. The aim of the task was to investigate possible N N techniques for estimating the true labels of an image by us- 𝞬 𝞬 ing the set of noisy crowdsourced labels, and possibly any (a) (b) content and metadata from the image itself. For the runs in this paper, we’ve applied a shotgun approach and tried Figure 1: Generative model of crowdworkers: (a) incor- a number of existing techniques, which include generative porating per-item difficulty and per-worker reliability; (b) probabilistic models and further crowdsourcing. incorporating per-item difficulty, per-worker reliability and features describing the image. 1. INTRODUCTION Crowdsourcing is increasingly becoming a popular way of extracting information. One problem with crowdsourcing is 2.1 Run 1 that the workers can have a number of traits that affect the The first run was required to only make use of the provided quality of the work they are performing. One standard way crowdsourced labels. For this run, we applied the generative of dealing with the problem of noisy data is to ask multiple model developed by Paul Mineiro [3] illustrated in Figure 1a. workers to perform the same task and then combine the This model extends the one by Whitehill et al. [5] by incor- labels of the workers in order to obtain a final estimate. porating a hierarchical Gaussian prior on the elements of the Perhaps the most intuitive way of combining labels of confusion matrix (i.e. the γ hyper-parameter in the figure). multiple workers is through majority voting, however other Briefly, the model assumes an unobserved ground truth la- possibilities exist. The aim of the 2013 MediaEval Crowd- bel z combines with a per-worker model parametrized by sourcing task [1] was to explore techniques in which better vector α and scalar item difficulty β to generate an ob- estimates of the true labels can be created. Our run submis- served worker label l for an image. The hyper-parameter sions for this task explore a number of techniques to achieve γ moderates the worker reliability as a function of the label this: probabilistic models of workers (i.e. estimating which class. The model parameters are learnt using a ‘Bayesian’ workers are bad, and discounting their votes), additional Expectation-Maximisation algorithm. For our experiments crowdsourcing of images without a clear majority vote, and with this model, we used the nominallabelextract imple- joint probabilistic models that take into account both the mentation published by Paul Mineiro1 with uniform class crowdsourced votes as well as extracted features. priors. Note that the software was applied to data from each of the two questions asked of the workers separately, and “NotSure” answers were treated as unknowns (not in- 2. METHODOLOGY cluded in the input data). As described previously, the overall methodology for our run submissions was to take a shotgun approach and try 2.2 Run 2 three fundamentally different approaches (generative prob- For the second run, we gathered additional data in two abilistic model of workers; extra crowdsourcing; and joint ways. Firstly, we randomly selected 1000 images from the modelling) to the problem. The techniques and data we test set and had them annotated by two reliable experts. The used for each run are summarised in Table 1. Specific de- two experts firstly annotated the data independently and tails on each run are given below. came to agreement on 671 of these (across both questions). For the images they didn’t agree on for either question, they collaboratively came to a decision about the true label for Copyright is held by the author/owner(s). 1 MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain http://code.google.com/p/nincompoop/downloads/ Table 1: Configuration of the submitted runs. Data Technique Provided Additional Labels Features Majority Probabilistic Probabilistic Run # Labels Crowdsourced Expert Metadata Visual vote Worker Joint 1 X X 2 X X X X 3 X X X X 4 X X X X X 5 X X X X X X both questions. The relatively low-level of initial agreement Table 2: Results for each run and label. between the experts is an indication of the subjectiveness of the labelling task being performed (especially with respect Run # Label 1 F1 Score Label 2 F1 Score to question 1 “is this a fashion image”). Secondly, for the im- 1 0.7352 0.7636 ages in the test set that had at least two “NotSure” answers, 2 0.8377 0.7621 3 0.7198 0.7710 we gathered more responses through additional crowdsourc- 4 0.7097 0.7528 ing using the CrowdFlower2 platform. In total we gathered 5 0.6427 0.6026 additional an 824 responses over 421 images from this ex- tra crowdsourcing. In order to produce the estimates we performed majority voting. model used in run 3 had a minor improvement for the second label, but it had a big negative effect for the first label. It’s 2.3 Run 3 also clear that the more advanced model (runs 4 & 5), that In the third run, we applied the model used in run 1 to took into account features, also performed less well with the data in run 2. The original worker labels and additional this data than hoped. Interestingly, when we applied both crowdsourced labels were combined and used as the primary generative models to the smaller MMSys dataset we had a input. The expert labels were used to clamp the model at slight improvement. One possible reason for the relatively the respective images in order to obtain a better fit. low performance of the generative models on the first label could well be due to the subjectiveness of the question being 2.4 Run 4 asked, which would lead to errors when fitting the models. In the fourth run, we chose to explore the use of another This would also help indicate why additional crowdsourcing generative model developed by Paul Mineiro [2]. This model seems to improve results. is inspired by the work of Raykar et al. [4], and incorporates the notion of the hidden unknown true label also generating a set of observed features (ψ). This is illustrated in the plate 4. ACKNOWLEDGMENTS diagram shown in Figure 1b. The described work was funded by the Engineering and Mineiro developed an online procedure to learn the model Physical Sciences Research Council under the SOCIAM plat- parameters that jointly learns a logistic regressor to learn form grant, and European Union Seventh Framework Pro- how to create classifications (estimations of the true label) gramme (FP7/2007-2013) under grant agreements 270239 from the features. A nice feature of this approach is that in (ARCOMEM) and 287863 (TrendMiner). each iteration of learning/fitting, the worker model informs the classifier and the classifier informs the worker model. 5. REFERENCES For this run, the features used were bag-of-words features [1] B. Loni, M. Larson, A. Bozzon, and L. Gottlieb. extracted from the tags, titles, descriptions, contexts and Crowdsourcing for Social Multimedia at MediaEval notes metadata of each image. 2013: Challenges, Data set, and Evaluation. In MediaEval 2013 Workshop, Barcelona, Spain, October 2.5 Run 5 18-19 2013. Finally, for the fifth run, we applied the same technique as [2] P. Mineiro. Logistic Regression for Crowdsourced Data. used in run 4, but also incorporated a Pyramid Histogram http://www.machinedlearnings.com/2011/11/ of Words (PHOW) feature extracted from the images them- logistic-regression-for-crowdsourced.html. selves on top of the metadata features. The PHOW features [3] P. Mineiro. Modeling Mechanical Turk Part II. were created from dense SIFT features quantised into 300 http://www.machinedlearnings.com/2011/01/ visual words and aggregated into a pyramid with 2×2 and modeling-mechanical-turk-part-ii.html. 4×4 blocks. [4] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from 3. RESULTS AND DISCUSSION crowds. J. Mach. Learn. Res., 11:1297–1322, Aug. 2010. The results of the five runs are shown in Table 2. Whilst [5] J. Whitehill, P. Ruvolo, T. fan Wu, J. Bergsma, and we can’t currently make global comments as to how well J. Movellan. Whose vote should count more: Optimal these runs performed compared to naı̈ve majority voting, integration of labels from labelers of unknown expertise. we can note a few points. Firstly, looking at runs 2 and 3 In Advances in Neural Information Processing Systems which used the same data, we can see that the generative 22, page 2035–2043, December 2009. 2 http://crowdflower.com