SQUARE: Benchmarking Crowd Consensus at MediaEval

                       Aashish Sheshadri                                                           Matthew Lease
                 Department of Computer Science                                               School of Information
                 The University of Texas at Austin                                       The University of Texas at Austin
                   aashishs@cs.utexas.edu                                                  ml@ischool.utexas.edu


ABSTRACT
We extend the square benchmark for statistical consensus                                 MMSys 2013 Task 1             MMSys 2013 Task 1

methods to include additional evaluation on two datasets
from the MediaEval 2013 Crowdsourcing in Multimedia shared
task. In addition to reporting shared task results, we also


                                                                         % of Workers
analyze qualitatively and quantitatively performance of con-
sensus algorithms under varying supervision.
                                                                                         MMSys 2013 Task 2             MMSys 2013 Task 2

1.     ALGORITHMS
   We extend square1 [5], a benchmark for evaluating sta-
tistical consensus algorithms, to include additional evalua-
tion on two datasets from MediaEval 2013 Crowdsourcing
in Multimedia shared task2 . Algorithms are briefly summa-
rized below; the square paper [5] provides more detailed                                         Accuracy           % of Questions
discussion and comparative analysis. Because crowdsourc-
ing allows rapid generation of datasets for new tasks, where        Figure 1: Left histogram: distribution of worker accura-
no feature representation of inputs or automatic labeling           cies. Right histogram: # of examples labeled per worker.
algorithm may yet exist, we intentionally exclude hybrid
methods requiring automatic label generation from features.
   Majority Voting (MV) uses random tie-breaking to avoid           “Gold” data is defined for these tasks by majority voting of
bias in absence of an informative prior for tie-breaking. Zen-      trusted annotators over examples with a clear majority label.
Crowd (ZC) [2] implements unsupervised EM to jointly es-               Experiment. Following the same benchmarking proce-
timate worker accuracies and labels. GLAD [8] implements            dure as in SQUARE [5], we consider 5 degrees of supervi-
unsupervised EM to jointly estimate labels and worker ex-           sion: 10%, 20%, 50%, 80%, and 90%. In each case, we use
pertise while modeling example difficulty. Dawid-Skene (DS) [1]     cross-fold validation, i.e. for the 10% supervision setting,
and Raykar et al. (RY) [4] implement unsupervised EM to             estimation uses 10% train data and is evaluated on the re-
jointly estimate labels and worker confusion matrices, with         maining 90%, this procedure is repeated across the other
RY differing from DS in modeling individual worker priors.          nine folds, finally, average performance across the folds is
Naive Bayes (NB) [6] implements a fully supervised estima-          reported. We report unsupervised performance on 100% of
tion of worker confusion. CUBAM [7]’s unsupervised MAP              data, with evaluation limited to examples with gold labels.
estimation jointly models example difficulty and annotator             In the unsupervised setting, uninformed, task-independent
specific measurements of noise, expertise and bias.                 hyper-parameters and class priors are unlikely to be opti-
                                                                    mal. While one might optimize these parameters by max-
2.     DATA AND EXPERIMENTAL SETUP                                  imizing likelihood over random restarts or grid search, we
   Data. Consensus algorithms are evaluated on the MMSys            do not attempt to do so. Instead, with light-supervision,
2013 and the test fraction (20%) of Fashion 10000. Both             we assume no examples are labeled, but informative priors
datasets elicit binary judgments on the same set of tasks           are provided (matching the training set empirical distribu-
from Amazon Mechanical Turk (AMT) workers. Task 1 asks              tion). Full-supervision assumes gold-labeled examples are
workers to identify an image as being fashion-related or not,       provided. To evaluate ZC, RY, DS and GLAD methods
and Task 2 asks workers to indicate whether or not a desired        under full-supervision, labels are predicted for all examples
fashion object is present. For further details of tasks, see [3].   (without supervision) but replaced by gold on training data.
1
    ir.ischool.utexas.edu/square
2
    www.multimediaeval.org/mediaeval2013/crowd2013                  3.                  RESULTS AND DISCUSSION
                                                                      MMSys 2013. Average performance over Tasks 1 and 2
                                                                    are reported in Table 1. We highlight best scoring methods
Copyright is held by the author/owner(s).                           but note these are not necessarily statistically significant.
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain        Results show close to constant performance (roughly 91-
                                                         Light-Supervision                         Full-Supevision
        Method    Metric   No Supervision     10%      20%      50%     80%      90%     10%     20%     50%     80%      90%      Count
                  Acc          91.10          91.05    91.05   91.10   91.00     91.10   91.05   91.05   91.10    91.00   91.10      0
        MV
                  F1           91.25          90.95    90.95   91.00   90.90     91.05   90.95   90.95   91.00    90.90   91.05      1
                  Acc          91.35         91.35     91.35   91.35   91.20     91.25   91.40   91.40   91.50    91.40   91.40      5
        ZC
                  F1           91.35         91.20     91.25   91.25   91.05     91.05   91.20   91.25   91.35    91.25   91.25      5
                  Acc          91.25          91.25    91.25   91.35   91.30     91.10   91.35   91.30   91.45    91.45   91.40      2
        GLAD
                  F1           91.35          91.10    91.05   91.20   91.10     90.90   91.10   91.10   91.25    91.25   91.20      1
                  Acc            -              -        -       -       -         -     90.95   91.15   91.40    91.30   91.40      0
        NB
                  F1             -              -        -       -       -         -     90.80   91.00   91.25    91.15   91.25      0
                  Acc          91.10          90.95    90.95   90.85   90.55     89.90   91.05   91.10   91.45    91.85   91.90      2
        DS
                  F1           90.95          90.75    90.70   90.60   90.25     89.60   90.85   90.90   91.20    91.65   91.70      2
                  Acc          91.55          91.05    91.05   91.20   91.15     91.15   91.25   91.50   91.60    91.75   91.75      3
        RY
                  F1           91.45          90.85    90.90   91.10   91.00     91.05   91.05   91.35   91.40    91.60   91.65      4
                  Acc          91.10            -        -       -       -         -       -       -       -        -       -        0
        CU BAM
                  F1           91.35            -        -       -       -         -       -       -       -        -       -        0


Table 1: Accuracy and F1 results when averaged over both tasks on MMSys 2013 for varying supervision type (none, light,
and full) and amount (10%, 20%, 50%, 80%, and 90%). Maximum values for each metric across methods in each column are
bolded (Accuracy) and underlined (F1 ). As a simple summary measure, the final column counts the number of result columns
(out of 11) in which a given method achieves the maximum value for each metric.


                           T 2 − Y es       T 2 − No                                             Method          Task 1   Task 2
             T 1 − Y es    36.86%            0.34%                                                 MV            73.25    75.43
             T 1 − No      15.43%           47.37%                                                 ZC            72.67    74.88
                                                                                                  GLAD           72.89    75.67
Table 2: Confusion matrix over gold labels assigned to Task                                        DS            72.82    75.20
1 (T1) and Task 2 (T2) of MYSys 2013.                                                              RY            72.87    75.01
                                                                                                 CU BAM          73.03    77.46
92%) across methods, metrics, and varying supervision. Av-
erage accuracy of 91.17% across methods, with minimal vari-                     Table 3: F1 scores on the test fraction of Fashion 10000.
ance, is perhaps not surprising given the abundance of high
quality workers for these tasks (see Figure 1 for details). We                     estimation of observer error-rates using the em
hypothesize three key factors. 1. the response redundancy                          algorithm. Applied Statistics, pages 20–28, 1979.
is limited at most three, with many instances recording only                   [2] G. Demartini, D. E. Difallah, and P. Cudré-Mauroux.
two responses (since workers could select ‘not sure’ rather                        Zencrowd: leveraging probabilistic reasoning and
than provide a binary judgment). 2. a high percentage of                           crowdsourcing techniques for large-scale entity linking.
the work was completed by relatively small percentage of the                       In Proc. WWW, pages 469–478, 2012.
workers (Figure 1). 3. Consistently high worker accuracies                     [3] B. Loni, M. Larson, A. Bozzon, and L. Gottlieb.
limit the value of weighted worker voting vs. simple MV.                           Crowdsourcing for Multimedia at MediaEval 2013:
   We further observe task 1 and 2 to be highly correlated.                        Challenges, datasets, and evaluation. In MediaEval
Table 2 shows the confusion matrix of gold labels across the                       2013 Workshop, Barcelona, Spain, October 18-19 2013.
two tasks. Evaluation on Task 2 using Task 1 gold labels for
                                                                               [4] V. C. Raykar, S. Yu, L. H. Zhao, and G. H. Valadez.
supervision achieved 82.89% average accuracy and 84.98%
                                                                                   Learning from crowds. In Journal of Machine Learning
average F1 across methods. This presents a possibility of
                                                                                   Research 11 (2010) 1297-1322, MIT Press, 2010.
joint estimation we do not investigate.
   Fashion 10000. Table 3 reports F1 scores on the test                        [5] A. Sheshadri and M. Lease. SQUARE: A Benchmark
fraction of Fashion 10000; since gold data was not available                       for Research on Computing Crowd Consensus. In
for the blind shared task, we only present results estimated                       Proceedings of the 1st AAAI Conference on Human
with no supervision. MV is seen to score best on Task 1, and                       Computation (HCOMP), 2013.
CUBAM on Task 2. This contrasts findings on MMSys 2013,                        [6] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng.
though we still observe only a 1% threshold across methods                         Cheap and fast—but is it good?: evaluating non-expert
on Task 1. However, Task 2 shows markedly superior per-                            annotations for natural language tasks. In Proceedings
formance of CUBAM, which remains to be investigated.                               of the Conference on Empirical Methods in Natural
                                                                                   Language Processing, pages 254–263, 2008.
Acknowledgments. This work is supported by National                            [7] P. Welinder, S. Branson, S. Belongie, and P. Perona.
Science Foundation grant IIS-1253413 and by a Temple Fel-                          The multidimensional wisdom of crowds. In NIPS,
lowship. Opinions expressed in this work are those of the                          pages 2424–2432, 2010.
authors and do not reflect views of the sponsors.                              [8] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and
                                                                                   J. Movellan. Whose vote should count more: Optimal
4.   REFERENCES                                                                    integration of labels from labelers of unknown
[1] A. P. Dawid and A. M. Skene. Maximum likelihood                                expertise. In NIPS, pages 2035–2043, 2009.