SQUARE: Benchmarking Crowd Consensus at MediaEval Aashish Sheshadri Matthew Lease Department of Computer Science School of Information The University of Texas at Austin The University of Texas at Austin aashishs@cs.utexas.edu ml@ischool.utexas.edu ABSTRACT We extend the square benchmark for statistical consensus MMSys 2013 Task 1 MMSys 2013 Task 1 methods to include additional evaluation on two datasets from the MediaEval 2013 Crowdsourcing in Multimedia shared task. In addition to reporting shared task results, we also % of Workers analyze qualitatively and quantitatively performance of con- sensus algorithms under varying supervision. MMSys 2013 Task 2 MMSys 2013 Task 2 1. ALGORITHMS We extend square1 [5], a benchmark for evaluating sta- tistical consensus algorithms, to include additional evalua- tion on two datasets from MediaEval 2013 Crowdsourcing in Multimedia shared task2 . Algorithms are briefly summa- rized below; the square paper [5] provides more detailed Accuracy % of Questions discussion and comparative analysis. Because crowdsourc- ing allows rapid generation of datasets for new tasks, where Figure 1: Left histogram: distribution of worker accura- no feature representation of inputs or automatic labeling cies. Right histogram: # of examples labeled per worker. algorithm may yet exist, we intentionally exclude hybrid methods requiring automatic label generation from features. Majority Voting (MV) uses random tie-breaking to avoid “Gold” data is defined for these tasks by majority voting of bias in absence of an informative prior for tie-breaking. Zen- trusted annotators over examples with a clear majority label. Crowd (ZC) [2] implements unsupervised EM to jointly es- Experiment. Following the same benchmarking proce- timate worker accuracies and labels. GLAD [8] implements dure as in SQUARE [5], we consider 5 degrees of supervi- unsupervised EM to jointly estimate labels and worker ex- sion: 10%, 20%, 50%, 80%, and 90%. In each case, we use pertise while modeling example difficulty. Dawid-Skene (DS) [1] cross-fold validation, i.e. for the 10% supervision setting, and Raykar et al. (RY) [4] implement unsupervised EM to estimation uses 10% train data and is evaluated on the re- jointly estimate labels and worker confusion matrices, with maining 90%, this procedure is repeated across the other RY differing from DS in modeling individual worker priors. nine folds, finally, average performance across the folds is Naive Bayes (NB) [6] implements a fully supervised estima- reported. We report unsupervised performance on 100% of tion of worker confusion. CUBAM [7]’s unsupervised MAP data, with evaluation limited to examples with gold labels. estimation jointly models example difficulty and annotator In the unsupervised setting, uninformed, task-independent specific measurements of noise, expertise and bias. hyper-parameters and class priors are unlikely to be opti- mal. While one might optimize these parameters by max- 2. DATA AND EXPERIMENTAL SETUP imizing likelihood over random restarts or grid search, we Data. Consensus algorithms are evaluated on the MMSys do not attempt to do so. Instead, with light-supervision, 2013 and the test fraction (20%) of Fashion 10000. Both we assume no examples are labeled, but informative priors datasets elicit binary judgments on the same set of tasks are provided (matching the training set empirical distribu- from Amazon Mechanical Turk (AMT) workers. Task 1 asks tion). Full-supervision assumes gold-labeled examples are workers to identify an image as being fashion-related or not, provided. To evaluate ZC, RY, DS and GLAD methods and Task 2 asks workers to indicate whether or not a desired under full-supervision, labels are predicted for all examples fashion object is present. For further details of tasks, see [3]. (without supervision) but replaced by gold on training data. 1 ir.ischool.utexas.edu/square 2 www.multimediaeval.org/mediaeval2013/crowd2013 3. RESULTS AND DISCUSSION MMSys 2013. Average performance over Tasks 1 and 2 are reported in Table 1. We highlight best scoring methods Copyright is held by the author/owner(s). but note these are not necessarily statistically significant. MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain Results show close to constant performance (roughly 91- Light-Supervision Full-Supevision Method Metric No Supervision 10% 20% 50% 80% 90% 10% 20% 50% 80% 90% Count Acc 91.10 91.05 91.05 91.10 91.00 91.10 91.05 91.05 91.10 91.00 91.10 0 MV F1 91.25 90.95 90.95 91.00 90.90 91.05 90.95 90.95 91.00 90.90 91.05 1 Acc 91.35 91.35 91.35 91.35 91.20 91.25 91.40 91.40 91.50 91.40 91.40 5 ZC F1 91.35 91.20 91.25 91.25 91.05 91.05 91.20 91.25 91.35 91.25 91.25 5 Acc 91.25 91.25 91.25 91.35 91.30 91.10 91.35 91.30 91.45 91.45 91.40 2 GLAD F1 91.35 91.10 91.05 91.20 91.10 90.90 91.10 91.10 91.25 91.25 91.20 1 Acc - - - - - - 90.95 91.15 91.40 91.30 91.40 0 NB F1 - - - - - - 90.80 91.00 91.25 91.15 91.25 0 Acc 91.10 90.95 90.95 90.85 90.55 89.90 91.05 91.10 91.45 91.85 91.90 2 DS F1 90.95 90.75 90.70 90.60 90.25 89.60 90.85 90.90 91.20 91.65 91.70 2 Acc 91.55 91.05 91.05 91.20 91.15 91.15 91.25 91.50 91.60 91.75 91.75 3 RY F1 91.45 90.85 90.90 91.10 91.00 91.05 91.05 91.35 91.40 91.60 91.65 4 Acc 91.10 - - - - - - - - - - 0 CU BAM F1 91.35 - - - - - - - - - - 0 Table 1: Accuracy and F1 results when averaged over both tasks on MMSys 2013 for varying supervision type (none, light, and full) and amount (10%, 20%, 50%, 80%, and 90%). Maximum values for each metric across methods in each column are bolded (Accuracy) and underlined (F1 ). As a simple summary measure, the final column counts the number of result columns (out of 11) in which a given method achieves the maximum value for each metric. T 2 − Y es T 2 − No Method Task 1 Task 2 T 1 − Y es 36.86% 0.34% MV 73.25 75.43 T 1 − No 15.43% 47.37% ZC 72.67 74.88 GLAD 72.89 75.67 Table 2: Confusion matrix over gold labels assigned to Task DS 72.82 75.20 1 (T1) and Task 2 (T2) of MYSys 2013. RY 72.87 75.01 CU BAM 73.03 77.46 92%) across methods, metrics, and varying supervision. Av- erage accuracy of 91.17% across methods, with minimal vari- Table 3: F1 scores on the test fraction of Fashion 10000. ance, is perhaps not surprising given the abundance of high quality workers for these tasks (see Figure 1 for details). We estimation of observer error-rates using the em hypothesize three key factors. 1. the response redundancy algorithm. Applied Statistics, pages 20–28, 1979. is limited at most three, with many instances recording only [2] G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. two responses (since workers could select ‘not sure’ rather Zencrowd: leveraging probabilistic reasoning and than provide a binary judgment). 2. a high percentage of crowdsourcing techniques for large-scale entity linking. the work was completed by relatively small percentage of the In Proc. WWW, pages 469–478, 2012. workers (Figure 1). 3. Consistently high worker accuracies [3] B. Loni, M. Larson, A. Bozzon, and L. Gottlieb. limit the value of weighted worker voting vs. simple MV. Crowdsourcing for Multimedia at MediaEval 2013: We further observe task 1 and 2 to be highly correlated. Challenges, datasets, and evaluation. In MediaEval Table 2 shows the confusion matrix of gold labels across the 2013 Workshop, Barcelona, Spain, October 18-19 2013. two tasks. Evaluation on Task 2 using Task 1 gold labels for [4] V. C. Raykar, S. Yu, L. H. Zhao, and G. H. Valadez. supervision achieved 82.89% average accuracy and 84.98% Learning from crowds. In Journal of Machine Learning average F1 across methods. This presents a possibility of Research 11 (2010) 1297-1322, MIT Press, 2010. joint estimation we do not investigate. Fashion 10000. Table 3 reports F1 scores on the test [5] A. Sheshadri and M. Lease. SQUARE: A Benchmark fraction of Fashion 10000; since gold data was not available for Research on Computing Crowd Consensus. In for the blind shared task, we only present results estimated Proceedings of the 1st AAAI Conference on Human with no supervision. MV is seen to score best on Task 1, and Computation (HCOMP), 2013. CUBAM on Task 2. This contrasts findings on MMSys 2013, [6] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. though we still observe only a 1% threshold across methods Cheap and fast—but is it good?: evaluating non-expert on Task 1. However, Task 2 shows markedly superior per- annotations for natural language tasks. In Proceedings formance of CUBAM, which remains to be investigated. of the Conference on Empirical Methods in Natural Language Processing, pages 254–263, 2008. Acknowledgments. This work is supported by National [7] P. Welinder, S. Branson, S. Belongie, and P. Perona. Science Foundation grant IIS-1253413 and by a Temple Fel- The multidimensional wisdom of crowds. In NIPS, lowship. Opinions expressed in this work are those of the pages 2424–2432, 2010. authors and do not reflect views of the sponsors. [8] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal 4. REFERENCES integration of labels from labelers of unknown [1] A. P. Dawid and A. M. Skene. Maximum likelihood expertise. In NIPS, pages 2035–2043, 2009.