Aggregating Crowdsourced Image Segmentations Doris Jung-Lin Lee Akash Das Sarma Aditya Parameswaran University of Illinois, Urbana-Champaign Facebook, Inc. University of Illinois, Urbana-Champaign jlee782@illinois.edu akashds@fb.com adityagp@illinois.edu Abstract Instance-level image segmentation provides rich information crucial for scene understanding in a variety of real-world ap- plications. In this paper, we evaluate multiple crowdsourced algorithms for the image segmentation problem, including novel worker-aggregation-based methods and retrieval-based Figure 1: Taxonomy of quality evaluation algorithms for methods from prior work. We characterize the different types of worker errors observed in crowdsourced segmentation, and crowdsourced segmentation, including existing methods present a clustering algorithm as a preprocessing step that is (blue) and our novel algorithms (yellow). able to capture and eliminate errors arising due to workers having different semantic perspectives. We demonstrate that aggregation-based algorithms attain higher accuracies than cuss such errors in Section 3 and propose a clustering-based existing retrieval-based approaches, while scaling better with preprocessing technique that resolves them in Section 5. increasing numbers of worker segmentations. 2 Related Work 1 Introduction As shown in Figure 1, quality evaluation methods for crowd- sourced segmentation can be classified into two categories: Precise, instance-level object segmentation is crucial for Retrieval-based methods pick the “best” worker segmenta- identifying and tracking objects in a variety of real-world tion based on some scoring criteria that evaluates the qual- emergent applications of autonomy, including robotics (Na- ity of each segmentation, including vision information (Vit- tonek 1998), image organization and retrieval (Yamaguchi tayakorn and Hays 2011; Russakovsky et al. 2015), and click- 2012), and medicine (Irshad and et. al. 2014). To this end, stream behavior (Cabezas et al. 2015; Sameki et al. 2015; there has been a lot of work on employing crowdsourcing Sorokin and Forsyth 2008). to generate training data for segmentation, including Pascal- Aggregation-based methods combine multiple worker seg- VOC (Everingham et al. 2015), LabelMe (Torralba et al. mentations to produce a final segmentation that is not re- 2010), OpenSurfaces (Bell et al. 2015), and MS-COCO (Lin stricted to any single worker segmentation. An aggregation- et al. 2012). Unfortunately, raw data collected from the based majority vote approach was employed in Sameki et crowd is known to be noisy due to varying degrees of al. (2015) to create an expert-established gold standard for worker skills, attention, and motivation (Bell et al. 2014; characterizing their dataset and algorithmic accuracies, rather Welinder et al. 2010). than for segmentation quality evaluation as described here. To deal with these challenges, many have employed heuris- tics indicative of crowdsourced segmentation quality to pick 3 Error Analysis the best worker-provided segmentation (Sorokin and Forsyth 2008; Vittayakorn and Hays 2011). However, this approach On collecting and analyzing a number of crowdsourced seg- ends up discarding the majority of the worker segmentations mentations (described in Section 6), we found that common and is limited by what the best worker can do. In this paper, worker segmentation errors can be classified into three types: we make two contributions: First, we introduce a novel class (1) Semantic Ambiguity: workers have differing opinions of aggregation-based methods that incorporates portions of on whether particular regions belong to an object (Figure 2 segmentations from multiple workers into a combined one left: annotations around ‘flower and vase’ when ‘vase’ is re- described in Section 4. To our surprise, despite its intuitive quested); (2) Semantic Error: workers annotate the wrong simplicity, we have not seen this class of algorithms described object entirely (Figure 2 right: annotations around ‘turtle’ and or evaluated in prior work. We evaluate this class of algo- ‘monitor’ when ‘computer’ is requested.); and (3) Boundary rithms against existing methods in Section 6. Second, our Imperfection: workers make unintentional mistakes while analysis of common worker errors in crowdsourced segmen- drawing the boundaries, either due to low image resolution, tation shows that workers often segment the wrong objects or small area of the object, or lack of drawing skills (Figure 3 erroneously include or exclude large semantically-ambiguous left: imprecision around the ‘dog’ object). portions of an object in the resulting segmentation. We dis- Quality evaluation methods in prior work have largely focused on minimizing boundary imperfection issues. So, Copyright c 2018for this paper by its authors. Copying permitted we first describe our novel aggregation-based algorithms de- for private and academic purposes. signed to reduce boundary imperfections in Section 4. Next, in Section 5, we discuss a preprocessing method that elimi- to (estimated) non-overlap area with ground truth, for as long nates semantic ambiguities and errors. We present our exper- as the (estimated) Jaccard similarity of the resulting segmen- imental evaluation in Section 6. tation continues to increase. Intuitively, tiles that have a high Semantic overlap area and low non-overlap area contribute to high Semantic Error [computer] Boundary Imprecision [dog] Ambiguity [vase] recall with limited loss of precision. Since tile overlap and non-overlap areas, and Jaccard similarity of segmentations individual workers ground truth with ground truth are unknown, we use different heuristics pointer to to estimate these values. We discuss details of this algorithm semantic object to be segmented and its theoretical guarantees in our technical report. Retrieval: Number of Control Points (num pts) This algorithm picks the worker segmentation with the largest Figure 2: Examples of common worker errors. number of control points around the segmentation boundary (i.e., the most precise drawing) as the output segmentation (Vittayakorn and Hays 2011; Sorokin and Forsyth 2008). 4 Fixing Boundary Imperfections At the heart of our aggregation techniques is the tile data Tile-based Inference worker 1 worker 3 (5 worker example) representation. A tile is the smallest non-overlapping discrete worker 2 object boundary unit created by overlaying all of the workers’ segmentations on top of each other. The tile representation allows us to t6 aggregate segmentations from multiple workers, rather than being restricted to a single worker’s segmentation, allowing t1 t2 t3 t4 us to fix one worker’s errors with help from another. In Figure 3 (left), we display three worker segmentations for a toy t5 example with 6 resulting tiles. Any subset of these tiles can contribute towards the final segmentation. This simple but powerful idea of tiles also allows us to Figure 3: Left: Toy example demonstrating tiles created by reformulate our problem from one of “generating a segmenta- three workers’ segmentations around an object delineated by tion” to a setting that is much more familiar to crowdsourcing the black dotted line. Right: Segmentation boundaries drawn researchers. Since tiles are the lowest granularity units cre- by five workers shown in red. Overlaid segmentation creates ated by overlaying all workers’ segmentations on top of each a mask where the color indicates the number of workers who other, each tile is either completely contained within or out- voted for the tile region. side a given worker segmentation. Specifically, we can regard a worker segmentation as multiple boolean responses where the worker has voted ‘yes’ or ‘no’ to every tile independently. Intuitively, a worker votes ‘yes’ for every tile that is contained 5 Perspective Resolution in their segmentation, and ‘no’ for every tile that is not. As As discussed in Section 3, disagreements often arise in seg- shown in Figure 3 (right), tile t2 is voted ‘yes’ by worker 1, mentation due to differing worker perspectives on large tile 2, and 3; tile t3 is voted ‘yes’ by worker 2 and 3. The goal regions. We developed a clustering-based preprocessing ap- of our aggregation algorithms is to pick an appropriate set of proach to resolve this issue. Based on the intuition that work- tiles that effectively trades off precision versus recall. ers with similar perspectives will have segmentations that Now that we have modeled segmentation as a collection of are close to each other, we compute the Jaccard similarity worker votes for tiles, we can now develop familiar variants between each pair of segmentations and perform spectral of standard quality evaluation algorithms for this setting. clustering to separate the segmentations into clusters. Fig- Aggregation: Majority Vote Aggregation (MV) ure 2 (bottom) illustrates how spectral clustering divides the This simple algorithm includes a tile in the output segmenta- worker segmentations into clusters with meaningful semantic tion if and only if the tile has ‘yes’ votes from at least 50% associations, reflecting the diversity of perspectives for the of all workers. same task. Clustering results can be used as a preprocessing Aggregation: Expectation-Maximization (EM) step for any quality evaluation algorithm by keeping only Unlike MV, which assumes that all workers perform uni- the segmentations that belong to the largest cluster, which is formly, EM approaches infer the likelihood that a tile is part typically free of semantic errors. of the ground truth segmentation, while simultaneously es- In addition, clustering offers the additional benefit of pre- timating hidden worker qualities. In Section 6 we evaluate serving a worker’s semantic intentions. For example, while an EM variant which assumes that each worker has a (dif- the green cluster in Figure 2 (bottom right) would be consid- ferent) fixed probability for a correct vote. Details of this, ered bad segmentations for the particular task (‘computer’), and more fine-grained variants can be found in our technical this cluster can provide more data for another segmentation report (Lee et al. 2018). task corresponding to ‘monitor’. A potential future work di- Aggregation: Greedy Tile Picking (greedy) rection would be to crowdsource the semantic labels for the The greedy algorithm picks tiles in descending order of the computed clusters to enable the reuse of segmentations across tiles’ ratios of (estimated) overlap area with the ground truth multiple objects to lower costs. After Clustering methods (Retrieval). Then, in Figure 5 (right), we estimate the upper-bound performance of each algorithm by assum- Colors denote ing that ‘full information’ based on ground truth is given clusters with different worker to the algorithm. For greedy, the algorithm is aware of all perspectives. the actual tile overlap and non-overlap areas against ground truth. For EM, the true worker quality parameter values (un- der our worker quality model) are known. For retrieval, the full information version directly picks the worker with the Figure 4: Example image showing clustering performed on highest Jaccard similarity with respect to the ground truth. the same object from Figure 2 left and middle. By making use of ground truth information (Figure 5 right), the best aggregation-based algorithm can achieve a close-to- perfect average Jaccard score of 0.98 as an upper bound, far 6 Experimental Evaluation exceeding the results achievable by any single ‘best’ worker Dataset Description (J=0.91). This result demonstrates that aggregation-based We collected crowdsourced segmentations from Amazon Me- methods are able to achieve better performance by perform- chanical Turk; each HIT consisted of one segmentation task ing inference at the tile granularity, which is guaranteed to for a specific pre-labeled object in an image. Workers were be finer grained than any individual worker segmentation. compensated $0.05 per task. There were a total of 46 objects The performance of aggregation-based methods scale in 9 images from the MSCOCO dataset (Lin et al. 2014) well as more worker segmentations are added. segmented by 40 different workers each, resulting in a total Intuitively, larger numbers of worker segmentations result in of 1840 segmentations. Each task contained a keyword for finer granularity tiles for the aggregation-based methods. The the object and a pointer indicating the object to be segmented. first row in Table 1 shows the average percentage change in Two of the authors generated the ground truth segmentations performance between 5-workers and 30-workers samples. We by carefully segmenting the objects using the same interface. observe that aggregation based methods typically improve in Evaluation Metrics performance with an increase in number of workers, while Evaluation metrics used in our experiments measure how this is not generally true for retrieval-based methods. well the final segmentation (S) produced by these algorithms Experiment 2: Clustering as preprocessing improves al- compare against ground truth (GT). We use the Jaccard score gorithmic performance. Jaccard (J) = UIA(S) A(S) , which accounts for the intersection The second row in Table 1 shows the average percentage Jaccard change when clustering preprocessing is used. While area, IA = area(S ∩ GT ) and union area, U A = area(S ∪ clustering generally results in an accuracy increase, since the GT ) between the worker and ground truth segmentations. ‘full information’ variants are already free of semantic errors, Experiment 1: Aggregation-based methods perform sig- we do not see further improvement for these variants. nificantly better than retrieval-based methods Retrieval-based Aggregation-based Algorithm num pts worker* MV EM greedy greedy* Worker Scaling -6.30 2.58 2.12 1.78 2.07 5.38 Clustering Effect 5.92 -0.02 2.05 0.03 5.73 0.283 Table 1: Jaccard percentage change due to worker scaling and clustering. Algorithms with * use ground truth information. 7 Conclusion and Future Work We identified three different types of errors for crowdsourced image segmentation, developed a clustering-based method to capture the semantic diversity caused by differing worker Figure 5: Performance of the original algorithms that do not perspectives, and introduced novel aggregation-based meth- make use of ground truth information (Left) and ones that ods that produce more accurate segmentations than existing do (Right). Here, the EM result overlaps with MV as they retrieval-based methods. exhibit similar performance. Other diverging variants of EM Our preliminary studies show that our worker quality mod- is described in our technical report. els are good indicators of the actual accuracy of worker seg- mentations. We also observe that the greedy algorithm is capa- In Figure 5, we vary the number of worker segmentations ble of achieving close-to-perfect segmentation accuracy with along the x-axis and plot the average Jaccard score on the ground truth information. Given the success of aggregation- y-axis across different worker samples of a given size across based methods, including the simple majority vote algorithm, different algorithms. Figure 5 (left) shows that the perfor- we plan to use our worker quality insights to improve our mance of aggregation-based algorithms (greedy, EM) ex- EM and greedy algorithms. We are also working on using ceeds the best achievable through existing retrieval-based computer vision signals to further improve our algorithms. References mon objects in context. European Conference on Computer [Bell et al. 2014] Sean Bell, Kavita Bala, and Noah Snavely. Vision (ECCV), 8693 LNCS(PART 5):740–755, 2014. Intrinsic images in the wild. ACM Trans. on Graphics (SIG- [Natonek 1998] E. Natonek. Fast range image segmenta- GRAPH), 33(4), 2014. tion for servicing robots. In Proceedings. 1998 IEEE In- [Bell et al. 2015] Sean Bell, Paul Upchurch, Noah Snavely, ternational Conference on Robotics and Automation (Cat. and Kavita Bala. Material recognition in the wild with the No.98CH36146), volume 1, pages 406–411 vol.1, May 1998. materials in context database. Computer Vision and Pattern [Russakovsky et al. 2015] Olga Russakovsky, Li-Jia Li, and Recognition (CVPR), 2015. Li Fei-Fei. Best of Both Worlds: Human-Machine Collabora- [Cabezas et al. 2015] Ferran Cabezas, Axel Carlier, Vincent tion for Object Annotation. pages 2121–2131, 2015. Charvillat, Amaia Salvador, and Xavier Giro-I-Nieto. Quality [Sameki et al. 2015] Mehrnoosh Sameki, Danna Gurari, and control in crowdsourced object segmentation. Proceedings of Margrit Betke. Characterizing Image Segmentation Behavior International Conference on Image Processing, ICIP, 2015- of the Crowd. pages 1–4, 2015. Decem:4243–4247, 2015. [Sorokin and Forsyth 2008] Alexander Sorokin and David [Everingham et al. 2015] M. Everingham, S. M. A. Eslami, Forsyth. Utility data annotaton with Amazon Mechanical L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. Turk. Proceedings of the 1st IEEE Workshop on Internet The pascal visual object classes challenge: A retrospective. Vision at CVPR 08, (c):1–8, 2008. International Journal of Computer Vision, 111(1):98–136, [Torralba et al. 2010] Antonio Torralba, Bryan C. Russell, January 2015. and Jenny Yuen. LabelMe: Online image annotation and [Irshad and et. al. 2014] H Irshad and Montaser-Kouhsari et. applications. Proceedings of the IEEE, 98(8):1467–1484, al. Crowdsourcing Image Annotation for Nucleus Detection 2010. and Segmentation in Computational Pathology: Evaluating [Vittayakorn and Hays 2011] Sirion Vittayakorn and James Experts, Automated Methods, and the Crowd. Biocomputing Hays. Quality Assessment for Crowdsourced Object Annota- 2015, pages 294–305, 2014. tions. Procedings of the British Machine Vision Conference, [Lee et al. 2018] Doris Jung-Lin Lee, Akash Das Sarma, pages 109.1–109.11, 2011. and Aditya Parameswaran. Aggregating crowdsourced im- [Welinder et al. 2010] Peter Welinder, Steve Branson, Serge age segmentations. Technical report, Stanford InfoLab Belongie, and Pietro Perona. The Multidimensional Wis- (ilpubs.stanford.edu:8090/1161/), 2018. dom of Crowds. NIPS (Conference on Neural Information [Lin et al. 2012] Christopher H Lin, Mausam, and Daniel S Processing Systems), 6:1–9, 2010. Weld. Crowdsourcing control : Moving beyond multiple [Yamaguchi 2012] Kota Yamaguchi. Parsing clothing in fash- choice. AAAI Conference on Human Computation and ion photographs. In Proceedings of the 2012 IEEE Confer- Crowdsourcing (HCOMP), pages 491–500, 2012. ence on Computer Vision and Pattern Recognition (CVPR), [Lin et al. 2014] Tsung Yi Lin, Michael Maire, Serge Be- CVPR ’12, pages 3570–3577, Washington, DC, USA, 2012. longie, James Hays, Pietro Perona, Deva Ramanan, Piotr IEEE Computer Society. Dollár, and C. Lawrence Zitnick. Microsoft COCO: Com-