L3S at MediaEval 2013 Crowdsourcing for Social
                           Multimedia Task

                                              Mihai Georgescu, Xiaofei Zhu
                                     L3S Research Center, Leibniz Universität Hannover
                                                      Appelstr. 9a
                                                30167 Hanover, Germany
                                                  {georgescu,zhu}@l3s.de

ABSTRACT                                                            value for the worker confidence, Cu∗ . Majority voting means
In this paper we present results of our initial research on         Cu = 1. Licrowd is computed by aggregating the individual
aggregating noisy crowdsourced labels, by using a modified          worker labels Liu ∈ {Y es, N o}, ignoring Not Sure labels.
version of the EM algorithm introduced in [1]. We pro-                In the E step we compute the aggregated crowd labels
pose different methods of estimating the worker confidence,         using Eq. 2 when discriminating between positive and neg-
a measure that indicates how well the worker is performing          ative and Eq. 1 otherwise, and in the M step we update the
the task, and of integrating it in the computation of the ag-       worker confidences as defined in Eq. 3 or Eq. 4.
gregated label. Furthermore, we introduce a novel method            2.1   Aggregated Crowd Labels
of computing the worker confidence by using the soft ag-
gregated labels. In order to assess the effectiveness of our          In case we do not discriminate between positive and neg-
proposed methods, we experiment on the MediaEval 2013               ative answer quality, the probability of an instance being
Crowdsourcing for Social Multimedia Task dataset.                   labeled as positive is:
                                                                                          P     ∗      i
                                                                                             u Cu · I(Lu = Y es)
                                                                       p+
                                                                        i = P      ∗ · I(Li = Y es) +
                                                                                                       P     ∗
                                                                                                                            (1)
                                                                                                         u Cu · I(Lu = N o)
                                                                                                                   i
1.   INTRODUCTION                                                                C
                                                                               u u         u

   In this paper we detail the methods proposed for the Me-           In case we differentiate between the positive and negative
diaEval 2013 Crowdsourcing for Social Multimedia Task [5].          answer quality this becomes:
The methods in this paper apply the EM method from [1]                                   P     +      i
to infer labels from multiple and possibly noisy labels, as-           +                    u Cu · I(Lu = Y es)
                                                                      pi = P      +                   P    −                 (2)
                                                                               u Cu · I(Lu = Y es) +
                                                                                          i
                                                                                                        u Cu · I(Lu = N o)
                                                                                                                  i
suming that no authoritative ground truth is available, and
estimate both the accuracy of the workers and the actual               The probability of an instance being labeled as negative
labels using the crowdsourced assessments.                          is obviously p−         +                        +      −
                                                                                  i = 1 − pi . We will refer to the pi and pi as
   A similar approach was used for building probablistic mod-       computed by using either method as aggregated soft la-
els [4] to label images using crowdsourcing, for identifying        bels. Moreover, the final aggregated hard label assigned
systematic errors done by crowd workers [3], or for crowd-          by the crowd is given by comparing the difference between
sourcing document relevance judgements [2].                         the positive probability and the negative one:
   In our methods the error-rate is replaced by the worker
                                                                                                            −
                                                                                               Y es, p+
                                                                                             
confidence, used as the weight of a worker contribution in                          i                 i − pi ≥ 0
                                                                                   Lcrowd =           +     −
the aggregated label computation. We attempt to improve                                        N o, pi − pi < 0
the standard EM method by using different ways to boost
the worker confidence, as well as proposing a novel method          2.2   Worker Confidence Computation
for computing it. We introduce the soft evaluation of the             The undiscriminative confidence in a worker is defined as:
worker confidence, where the soft aggregated crowd decision                                    tpu + tnu
is taken into account instead of the hard aggregated label.                       Cu∗ =                                      (3)
                                                                                        tpu + tnu + f pu + f nu
                                                                    In case we discriminate between the quality of positive and
2.   APPROACH                                                       negative answers we use two types of confidence:
  In this section we detail the computation of the aggregated                               tpu                tnu
decision of a crowd for the label of an instance i, Licrowd (i.e.               Cu+ =               ; Cu− =                   (4)
                                                                                         tpu + f pu         tnu + f nu
Yes or No), and of the worker confidence. We distinguish be-
tween two types of worker confidence depending on whether              We distinguish between two types of evaluation of the
we make a discrimination between the quality of the positive        worker confidence: hard evaluation, where we use only the
and negative answers or not. In the case of such a discrimi-        final, aggregated hard labels, and a soft evaluation, where we
nation each worker is characterized by a positive confidence        use the aggregated soft labels.
Cu+ and a negative confidence Cu− , otherwise we use a single          In case of a hard evaluation of the performance of a user
                                                                    we use the following definitions:
                                                                                     X
Copyright is held by the author/owner(s).                                    tpu =       I(Liu = Y es) · I(Licrowd = Y es)
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain                       i
                      X
             tnu =        I(Liu = N o) · I(Licrowd = N o)               R    L
                                                                                         EM decision             Final decision
                                                                                                                                    F1
                      i                                                           S    PN B        F      FC     B     F FC
                                                                             1    -    X    x1     -      -      x1    -   -        0.895
                                                                        1
                                                                                            x1                   x1
                      X
             f pu =       I(Liu = Y es) · I(Licrowd = N o)                   2    -    X           -      -            -   -        0.909
                      i                                                      1    X    X    x1     -      -      x1    -   -        0.894
                                                                        2
                      X                                                      2    X    X    x1     -      -      x1    -   -        0.911
             f nu =       I(Liu = N o) · I(Licrowd = Y es)                   1    -    -    x0.5 X        -      x20 -     -        0.900
                                                                        3
                      i                                                      2    -    -    x2     X      X      x2    -   -        0.913
  In the case of a soft evaluation of the worker confidence                  1    X    X    x3     X      X      x1    -   -        0.898
                                                                        4
we use the following definitions:                                            2    -    -    x2     X      -      x2    -   -        0.913
                                                                             1    X    X    ex     -      -      ex    -   -        0.894
                                                                        5
             X                                 X                             2    X    X    x2     X      X      x2    -   -        0.913
     tpu =       I(Liu = Y es) · p+
                                  i ; tnu =        I(Liu = N o) · p−
                                                                   i
             i                                 i
                                                                       Table 1: Setting for each submission run (R) and
             X                                 X                       label(L), depicted in terms of: using soft labels in
  f pu =         I(Liu = Y es) · p−
                                  i ; f nu =       I(Liu = N o) · p+
                                                                   i   the worker confidence calculation(S), discrimination
             i                                 i
                                                                       between positive and negative answer quality(PN),
2.3     Worker Confidence Correction                                   boosting type (B), using familiarity in the compu-
                                                                       tation (F), familiarity correction (FC), in the com-
   Furthermore we can apply the following corrections to the
                                                                       putation of the decision during the EM iterations
confidence when aggregating the multiple votes: boosting
                                                                       as well as in the final decision, along with the F1
the confidence (Ĉ = boost(Cˇu )) or involving the worker self-
                                                                       measure achieved on the MMSys 2013 dataset
reported familiarity with the category for which Label 2 is
assigned to the image (f amiu ) in the computation of the con-
                                                                                       Submission      Label1     Label2
fidence (Cˇu = Cu · norm(f amiu )). Based on an observation
of a correlation of the familiarity and the type of answers                            Run1            0.7328     0.7533
and their accuracy, we can also use a familiarity correction                           Run2            0.7340     0.7412
strategy                                                                               Run3            0.7264     0.7592
                                                                                       Run4            0.7263     0.7391
                      0.6 f amiu < 3, Iu = Y es
                    
                    
                                                                                      Run5            0.7346     0.7371
                      0.9 f amiu < 3, Iu = N o
                    
              Ĉu =
                    
                     0.8 f amiu > 3, Iu = Y es                              Table 2: Performance of each submission
                      0.8 f amiu > 3, Iu = N o
                    

   The boosting function boost(x) can be ex or xp ; p ∈ R.             experiments carried out with the MMSys 2013 dataset, we
   The transformation of familiarity from an integer within            notice a better performance in the case of the second label.
1 and 7 or missing to a real subunitary positive number,               We can see that for the first label the best performance is
is done by the norm(x) function. norm(x) = (x − 1)/6 if                achieved in Run5, and for the second label by Run3. We
x ∈ N and 0.5 if missing.                                              notice that in the case of Label 1, discriminating between
2.4     Method Settings                                                positive and negative label quality provides a performance
                                                                       increase, while in the case of Label 2 the effect is opposite.
  The computation of the labels in the EM algorithm as
well as of the final decisions after the iterations are finished
depend on the following settings:                                      Acknowledgments
                                                                       This work was partially funded by the European Commis-
     • the use of positive/negative answer discrimination
                                                                       sion FP7 under grant agreements No. 287704 for the CUbRIK
     • the evaluation of worker confidences using soft labels
                                                                       project.
     • the boosting type
     • the use of familiarity in the computation
     • the use of the familiarity correction                           4.   REFERENCES
                                                                       [1] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of
   For picking candidates for the submitted runs, and find-                observer error-rates using the EM algorithm. Applied Statistics,
                                                                           pages 20–28, 1979.
ing the best setting, we evaluated the performance of our
                                                                       [2] M. Hosseini, I. J. Cox, N. Milić-Frayling, G. Kazai, and
methods on the MMSys 2013 Dataset. The selected settings                   V. Vinay. On aggregating labels from multiple crowd workers to
that are used for the submitted runs are detailed in Table1.               infer relevance of documents. In Advances in Information
   The first two runs use the discrimination between the pos-              Retrieval, pages 182–194. Springer, 2012.
                                                                       [3] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on
itive and negative worker confidence. Run1 is using the EM                 amazon mechanical turk. In Proceedings of the ACM SIGKDD
algorithm with hard iterations for both labels. Run2 rep-                  workshop on human computation, pages 64–67. ACM, 2010.
resents the EM algorithm using the soft iterations for both            [4] G. Kasneci, J. V. Gael, D. Stern, and T. Graepel. CoBayes:
labels without any special boosting strategy or involving the              bayesian knowledge corroboration with assessors of unknown
                                                                           areas of expertise. In Proceedings of the fourth ACM
familiarity.                                                               international conference on Web search and data mining,
                                                                           pages 465–474. ACM, 2011.
3.     RESULTS                                                         [5] B. Loni, M. Larson, A. Bozzon, and L. Gottlieb. Crowdsourcing
                                                                           for social multimedia at MediaEval 2013: Challenges, data set,
 The performance of each submission in terms of the F1-                    and evaluation. In MediaEval 2013 Workshop, October 18-19,
measure is presented in Table 2. As already mentioned in                   2013, Barcelona, Spain, 2013.