L3S at MediaEval 2013 Crowdsourcing for Social Multimedia Task Mihai Georgescu, Xiaofei Zhu L3S Research Center, Leibniz Universität Hannover Appelstr. 9a 30167 Hanover, Germany {georgescu,zhu}@l3s.de ABSTRACT value for the worker confidence, Cu∗ . Majority voting means In this paper we present results of our initial research on Cu = 1. Licrowd is computed by aggregating the individual aggregating noisy crowdsourced labels, by using a modified worker labels Liu ∈ {Y es, N o}, ignoring Not Sure labels. version of the EM algorithm introduced in [1]. We pro- In the E step we compute the aggregated crowd labels pose different methods of estimating the worker confidence, using Eq. 2 when discriminating between positive and neg- a measure that indicates how well the worker is performing ative and Eq. 1 otherwise, and in the M step we update the the task, and of integrating it in the computation of the ag- worker confidences as defined in Eq. 3 or Eq. 4. gregated label. Furthermore, we introduce a novel method 2.1 Aggregated Crowd Labels of computing the worker confidence by using the soft ag- gregated labels. In order to assess the effectiveness of our In case we do not discriminate between positive and neg- proposed methods, we experiment on the MediaEval 2013 ative answer quality, the probability of an instance being Crowdsourcing for Social Multimedia Task dataset. labeled as positive is: P ∗ i u Cu · I(Lu = Y es) p+ i = P ∗ · I(Li = Y es) + P ∗ (1) u Cu · I(Lu = N o) i 1. INTRODUCTION C u u u In this paper we detail the methods proposed for the Me- In case we differentiate between the positive and negative diaEval 2013 Crowdsourcing for Social Multimedia Task [5]. answer quality this becomes: The methods in this paper apply the EM method from [1] P + i to infer labels from multiple and possibly noisy labels, as- + u Cu · I(Lu = Y es) pi = P + P − (2) u Cu · I(Lu = Y es) + i u Cu · I(Lu = N o) i suming that no authoritative ground truth is available, and estimate both the accuracy of the workers and the actual The probability of an instance being labeled as negative labels using the crowdsourced assessments. is obviously p− + + − i = 1 − pi . We will refer to the pi and pi as A similar approach was used for building probablistic mod- computed by using either method as aggregated soft la- els [4] to label images using crowdsourcing, for identifying bels. Moreover, the final aggregated hard label assigned systematic errors done by crowd workers [3], or for crowd- by the crowd is given by comparing the difference between sourcing document relevance judgements [2]. the positive probability and the negative one: In our methods the error-rate is replaced by the worker − Y es, p+  confidence, used as the weight of a worker contribution in i i − pi ≥ 0 Lcrowd = + − the aggregated label computation. We attempt to improve N o, pi − pi < 0 the standard EM method by using different ways to boost the worker confidence, as well as proposing a novel method 2.2 Worker Confidence Computation for computing it. We introduce the soft evaluation of the The undiscriminative confidence in a worker is defined as: worker confidence, where the soft aggregated crowd decision tpu + tnu is taken into account instead of the hard aggregated label. Cu∗ = (3) tpu + tnu + f pu + f nu In case we discriminate between the quality of positive and 2. APPROACH negative answers we use two types of confidence: In this section we detail the computation of the aggregated tpu tnu decision of a crowd for the label of an instance i, Licrowd (i.e. Cu+ = ; Cu− = (4) tpu + f pu tnu + f nu Yes or No), and of the worker confidence. We distinguish be- tween two types of worker confidence depending on whether We distinguish between two types of evaluation of the we make a discrimination between the quality of the positive worker confidence: hard evaluation, where we use only the and negative answers or not. In the case of such a discrimi- final, aggregated hard labels, and a soft evaluation, where we nation each worker is characterized by a positive confidence use the aggregated soft labels. Cu+ and a negative confidence Cu− , otherwise we use a single In case of a hard evaluation of the performance of a user we use the following definitions: X Copyright is held by the author/owner(s). tpu = I(Liu = Y es) · I(Licrowd = Y es) MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain i X tnu = I(Liu = N o) · I(Licrowd = N o) R L EM decision Final decision F1 i S PN B F FC B F FC 1 - X x1 - - x1 - - 0.895 1 x1 x1 X f pu = I(Liu = Y es) · I(Licrowd = N o) 2 - X - - - - 0.909 i 1 X X x1 - - x1 - - 0.894 2 X 2 X X x1 - - x1 - - 0.911 f nu = I(Liu = N o) · I(Licrowd = Y es) 1 - - x0.5 X - x20 - - 0.900 3 i 2 - - x2 X X x2 - - 0.913 In the case of a soft evaluation of the worker confidence 1 X X x3 X X x1 - - 0.898 4 we use the following definitions: 2 - - x2 X - x2 - - 0.913 1 X X ex - - ex - - 0.894 5 X X 2 X X x2 X X x2 - - 0.913 tpu = I(Liu = Y es) · p+ i ; tnu = I(Liu = N o) · p− i i i Table 1: Setting for each submission run (R) and X X label(L), depicted in terms of: using soft labels in f pu = I(Liu = Y es) · p− i ; f nu = I(Liu = N o) · p+ i the worker confidence calculation(S), discrimination i i between positive and negative answer quality(PN), 2.3 Worker Confidence Correction boosting type (B), using familiarity in the compu- tation (F), familiarity correction (FC), in the com- Furthermore we can apply the following corrections to the putation of the decision during the EM iterations confidence when aggregating the multiple votes: boosting as well as in the final decision, along with the F1 the confidence (Ĉ = boost(Cˇu )) or involving the worker self- measure achieved on the MMSys 2013 dataset reported familiarity with the category for which Label 2 is assigned to the image (f amiu ) in the computation of the con- Submission Label1 Label2 fidence (Cˇu = Cu · norm(f amiu )). Based on an observation of a correlation of the familiarity and the type of answers Run1 0.7328 0.7533 and their accuracy, we can also use a familiarity correction Run2 0.7340 0.7412 strategy Run3 0.7264 0.7592 Run4 0.7263 0.7391 0.6 f amiu < 3, Iu = Y es    Run5 0.7346 0.7371 0.9 f amiu < 3, Iu = N o  Ĉu =   0.8 f amiu > 3, Iu = Y es Table 2: Performance of each submission 0.8 f amiu > 3, Iu = N o  The boosting function boost(x) can be ex or xp ; p ∈ R. experiments carried out with the MMSys 2013 dataset, we The transformation of familiarity from an integer within notice a better performance in the case of the second label. 1 and 7 or missing to a real subunitary positive number, We can see that for the first label the best performance is is done by the norm(x) function. norm(x) = (x − 1)/6 if achieved in Run5, and for the second label by Run3. We x ∈ N and 0.5 if missing. notice that in the case of Label 1, discriminating between 2.4 Method Settings positive and negative label quality provides a performance increase, while in the case of Label 2 the effect is opposite. The computation of the labels in the EM algorithm as well as of the final decisions after the iterations are finished depend on the following settings: Acknowledgments This work was partially funded by the European Commis- • the use of positive/negative answer discrimination sion FP7 under grant agreements No. 287704 for the CUbRIK • the evaluation of worker confidences using soft labels project. • the boosting type • the use of familiarity in the computation • the use of the familiarity correction 4. REFERENCES [1] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of For picking candidates for the submitted runs, and find- observer error-rates using the EM algorithm. Applied Statistics, pages 20–28, 1979. ing the best setting, we evaluated the performance of our [2] M. Hosseini, I. J. Cox, N. Milić-Frayling, G. Kazai, and methods on the MMSys 2013 Dataset. The selected settings V. Vinay. On aggregating labels from multiple crowd workers to that are used for the submitted runs are detailed in Table1. infer relevance of documents. In Advances in Information The first two runs use the discrimination between the pos- Retrieval, pages 182–194. Springer, 2012. [3] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on itive and negative worker confidence. Run1 is using the EM amazon mechanical turk. In Proceedings of the ACM SIGKDD algorithm with hard iterations for both labels. Run2 rep- workshop on human computation, pages 64–67. ACM, 2010. resents the EM algorithm using the soft iterations for both [4] G. Kasneci, J. V. Gael, D. Stern, and T. Graepel. CoBayes: labels without any special boosting strategy or involving the bayesian knowledge corroboration with assessors of unknown areas of expertise. In Proceedings of the fourth ACM familiarity. international conference on Web search and data mining, pages 465–474. ACM, 2011. 3. RESULTS [5] B. Loni, M. Larson, A. Bozzon, and L. Gottlieb. Crowdsourcing for social multimedia at MediaEval 2013: Challenges, data set, The performance of each submission in terms of the F1- and evaluation. In MediaEval 2013 Workshop, October 18-19, measure is presented in Table 2. As already mentioned in 2013, Barcelona, Spain, 2013.