Work Like a Bee - Taking Advantage of Diligent Crowdsourcing Workers Michael Riegler Preben N. Olsen Pål Halvorsen Simula Simula Simula Research Laboratory AS Research Laboratory AS Research Laboratory AS michael@simula.no preben@simula.no paalh@simula.no ABSTRACT 2. APPROACHES This paper presents our approach for the Crowd Sourcing This section describes our two approaches. As mentioned, Task of the MediaEval 2014 Benchmark. The proposed so- our main approach is to find the most diligent workers, while lution is based on the assumption that the number of Human the second approach is based on the idea of collecting addi- Intelligence Tasks (HITs) completed by a worker is represen- tional crowdsourcing votes. Quality control is a prerequisite tative of his diligence, making workers who completing high of a well-designed crowdsourcing HIT and to increase the volumes of work more reliable than low-performing work- quality of votes for this work, the task organizers included ers. Our approach gives a baseline evaluation indicating the a qualification HIT to make sure that workers understood usefulness of looking at the number of task completed by a the task at hand. As the main task was to classify drops worker. in music tracks, the workers had to prove that they could classify a drop correctly. Only the workers who passed the qualification HIT were allowed to continue. Because of that 1. INTRODUCTION pre-quality control, we did not perform any additional qual- Crowdsourcing creates a lot of opportunities and is gain- ity. ing momentum as an area of interest within the multimedia 2.1 Diligent Workers community. Moreover, current web-based services like Ama- zon Mechanical Turk, Mircrowoker, and Crowdflower have The idea of diligent workers is based on the work presented simplified the task of leveraging the power of human com- by Kazai et al. [2], which describes five different types of putation. workers: (1) diligent, (2) competent, (3) sloppy, (4) incom- The biggest problem in crowdsourcing is still the reliabil- petent, and (5) spammers. Diligent workers are identified ity of the workers. The information we receive using crowd- by the number of completed HITs they produce for a par- sourcing is unreliable because of workers who try to trick ticular task. They also state that most of the HITs are done the system, spam or simply don’t understand the task prop- by the same group of workers. The distribution of workers is erly. The law of large numbers (LLN) describes how noise a power law distribution and leads to around 54% of single is averaged and its effects are removed with a large number HIT workers for a crowdsorucing tasks. An important in- of experiments, but increasing the number of experiments sight from this work is that diligent workers can be identified directly affects costs. This is why the crowdsourcing exer- by the number of HITs per task. After comparing the num- cise for the Crowdsorting Timed Comments Task this year ber completed HITs per worker, we chose a subset of dili- focuses on computing correct labels based on noisy crowd- gent workers. The number of workers in this subset is chosen sourced, metadata or content information. based on the overall distribution of performed HITs between Related work in this area can for example be found in [4, all workers. Experiments on a development set showed that 3]. These approaches try to calculate correctness of the 30% of the best workers leads to a good result. This sub- workers or use the features of the media files like the global set then represents diligent workers who can be trusted and image feature, for a classification. their votes can be used in different ways, e.g., give a higher In contrast, the proposed solution presented here is based weight to their votes or only consider their votes. on the assumption that workers who complete a high number of tasks are high performers, either because they enjoy the 2.2 Additional Crowdsourcing task or that they understand the task well enough to do it For the HITs without a clear result through majority vote efficiently. We believe that both of these circumstances lead between the three provided workers or by weighted subset to reliable results with respect to HITs. As a secondary ap- of the best performing workers, we used additional crowd- proach, we also used labels collected from additional crowd- sourcing. We developed an HTML and SQL-based platform sourcing workers which means that we asked new workers that gave us the opportunity to perform the tests in our lab. for HITs where the original workers could not come to an The requirement for this additional test was that the par- agreement. ticipants had to try their best to find the right answer for the HIT. 3. EXPERIMENTAL SETUP Copyright is held by the author/owner(s). MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain The provided dataset contains 591 songs, metadata, and Table 1: Configuration of the four different methods Table 3: Most frequent class baseline for the given evaluated. dataset. Run Description Baseline WF1-score True Labels Predicted Labels R1 MJV with additional crowdsourcing MFC 0.3809 183, 63, 291 0, 0, 537 R2 Diligent workers vote only R3 MJV with weighted diligent workers cut costs and yield more accurate crowdsourcing results. For R4 R3 with additional crowdsourcing example, by identifying diligent workers early in task execu- tion one can annotate their votes and only consider them as Table 2: MediaEval 2014 Benchmark results. in R2, or weight their votes differently as in R3. That said, Run WF1-score True Labels Predicted Labels we also want to point out that there is a chance our results R1 0.7207 183, 63, 291 192, 68, 279 are dataset specific and further investigations on multiple R2 0.6919 183, 63, 291 208, 95, 234 and larger datasets are needed. R3 0.6912 183, 63, 291 208, 87, 242 At last, we want to highlight that additional crowdsourc- R4 0.6919 183, 63, 291 208, 95, 234 ing does not increase the accuracy when considering diligent workers. This indicates that the quality of work and worker labels generated by human computation, but because some motivation is more important than the number of workers of the songs are duplicates only 537 of them we used in the used or votes gathered. evaluation. The task’s main goal was to classify a drop in music within a limited timespan. A drop can be seen as an event that builds up to a change of the beat or melody in 5. CONCLUSION the song, i.e., a characteristic also found in electronic dance This paper presents two approaches for classifying drops music, and is more than just a simple change. Workers could in electronic dance music segments by utilizing human com- give three different labels to each song segment: (1) the putation and crowdsourcing. The results and insights gained segment contains a complete drop, (2) the segments only by evaluating four different methods indicate that the pro- contains a partial drop, and (3) the segment contains no posed approach, which assumes that diligent workers also drop in music [1]. provide better work quality, is promising. Our investiga- We assessed four different methods executed in four runs. tion also indicates that additional crowdsourcing does not The results are shown in Table 1 where a summarized overview improve results when considering diligent workers. and short descriptions of each method is provided. The first For assurance and increased certainty, we recognize the method (R1) considers the majority vote (MJV) between need for extending the work to include multiple and larger the three votes provided by the original dataset and addi- datasets. Additional future work includes pairing crowd- tional for not clear answers. While in the second run (R2) sourcing results with computer generated content analysis we only consider the votes provided by our diligent workers and further classification of diligent workers. subset. Our third method (R3) takes into account the ma- jority votes, but adds a higher weight to votes provided by 6. ACKNOWLEDGMENT diligent workers. The fourth and last method (R4) used the This work has been funded by the NFR-funded FRINATEK results provided by R3, but with additional crowdsourcing project ”Efficient Execution of Large Workloads on Elastic for ambiguous answers (where MJV could not clearly lead Heterogeneous Resources” (EONS) (project number 231687) to a label). and the iAD center for Research-based Innovation (project number 174867) funded by the Norwegian Research Council. 4. RESULTS Table 2 describes our benchmark results, while Table 3 7. REFERENCES describes the results for the most frequent class baseline, [1] M. L. Karthik Yadati, Pavala S.N. Chandrasekaran in which case all labels get the most frequent class label Ayyanathan. Crowdsorting timed comments about in the dataset assigned. The performance is measured by music: Foundations for a new crowdsourcing task. In the weighted harmonic mean of precision and recall (WF1- MediaEval 2014 Workshop, Barcelona, Spain, October score). This is done to avoid unreliable results based on the 16-17 2014. imbalance of the classes. [2] G. Kazai, J. Kamps, and N. Milic-Frayling. Worker We see from Table 2 that every method evaluated outper- types and personality traits in crowdsourcing relevance forms the most frequent class baseline by at least 30%. The labels. In Proceedings of the 20th ACM international best performing method is R1 with a WF1-score of 0.7207. conference on Information and knowledge management, Compared to R1, the three other methods have a perfor- pages 1941–1944. ACM, 2011. mance drop of around 3%. These methods are not distin- guishable with respect to the results they produces, which [3] B. Loni, J. Hare, M. Georgescu, M. Riegler, X. Zhu, might be because each of the methods rely on the votes M. Morchid, R. Dufour, and M. Larson. Getting by provided by the subset of diligent workers. We find it inter- with a little help from the crowd: Practical approaches esting that R3 and R4, which complements diligent workers to social image labeling. In CROWDMM ’14, November with MJV and additional crowdsourcing, do not significantly 03 - 07 2014, Orlando, FL, USA. ACM, 2014. increase performance compared to R2. [4] M. Riegler, M. Lux, and C. Kofler. Frame the crowd: Moreover, the performance difference between R1 and R2 Global visual features labeling boosted with is low, which strongly indicates that the assumption of work- crowdsourcing information. In MediaEval 2013 ers who complete the majority of crowdsourcing tasks also Workshop, Barcelona, Spain, 2013. perform better is valid. This is a promising insight that can