Aggregating Crowdsourced Image Segmentations

Aggregating Crowdsourced Image Segmentations Jung-LinDoris University of Illinois

Urbana-Champaign

Lee University of Illinois

Urbana-Champaign

AkashDas Sarma akashds@fb.com Facebook, Inc AdityaParameswaran adityagp@illinois.edu University of Illinois

Urbana-Champaign

Aggregating Crowdsourced Image Segmentations E9238142A255E8C57818750A11D41AAA GROBID - A machine learning software for extracting information from scholarly documents

Instance-level image segmentation provides rich information crucial for scene understanding in a variety of real-world applications. In this paper, we evaluate multiple crowdsourced algorithms for the image segmentation problem, including novel worker-aggregation-based methods and retrieval-based methods from prior work. We characterize the different types of worker errors observed in crowdsourced segmentation, and present a clustering algorithm as a preprocessing step that is able to capture and eliminate errors arising due to workers having different semantic perspectives. We demonstrate that aggregation-based algorithms attain higher accuracies than existing retrieval-based approaches, while scaling better with increasing numbers of worker segmentations.

Introduction

Precise, instance-level object segmentation is crucial for identifying and tracking objects in a variety of real-world emergent applications of autonomy, including robotics (Natonek 1998), image organization and retrieval (Yamaguchi 2012), and medicine (Irshad and et. al. 2014). To this end, there has been a lot of work on employing crowdsourcing to generate training data for segmentation, including Pascal-VOC (Everingham et al. 2015), LabelMe (Torralba et al. 2010), OpenSurfaces (Bell et al. 2015), and MS-COCO (Lin et al. 2012). Unfortunately, raw data collected from the crowd is known to be noisy due to varying degrees of worker skills, attention, and motivation (Bell et al. 2014;Welinder et al. 2010).

To deal with these challenges, many have employed heuristics indicative of crowdsourced segmentation quality to pick the best worker-provided segmentation (Sorokin and Forsyth 2008;Vittayakorn and Hays 2011). However, this approach ends up discarding the majority of the worker segmentations and is limited by what the best worker can do. In this paper, we make two contributions: First, we introduce a novel class of aggregation-based methods that incorporates portions of segmentations from multiple workers into a combined one described in Section 4. To our surprise, despite its intuitive simplicity, we have not seen this class of algorithms described or evaluated in prior work. We evaluate this class of algorithms against existing methods in Section 6. Second, our analysis of common worker errors in crowdsourced segmentation shows that workers often segment the wrong objects or erroneously include or exclude large semantically-ambiguous portions of an object in the resulting segmentation. We dis-

Related Work

As shown in Figure 1, quality evaluation methods for crowdsourced segmentation can be classified into two categories: Retrieval-based methods pick the "best" worker segmentation based on some scoring criteria that evaluates the quality of each segmentation, including vision information (Vittayakorn and Hays 2011;Russakovsky et al. 2015), and clickstream behavior (Cabezas et al. 2015;Sameki et al. 2015;Sorokin and Forsyth 2008). Aggregation-based methods combine multiple worker segmentations to produce a final segmentation that is not restricted to any single worker segmentation. An aggregationbased majority vote approach was employed in Sameki et al. (2015) to create an expert-established gold standard for characterizing their dataset and algorithmic accuracies, rather than for segmentation quality evaluation as described here.

Error Analysis

On collecting and analyzing a number of crowdsourced segmentations (described in Section 6), we found that common worker segmentation errors can be classified into three types:

(1) Semantic Ambiguity: workers have differing opinions on whether particular regions belong to an object (Figure 2 left: annotations around 'flower and vase' when 'vase' is requested); (2) Semantic Error: workers annotate the wrong object entirely (Figure 2 right: annotations around 'turtle ' and 'monitor' when 'computer' is requested.); and (3) Boundary Imperfection: workers make unintentional mistakes while drawing the boundaries, either due to low image resolution, small area of the object, or lack of drawing skills (Figure 3 left: imprecision around the 'dog' object).

Quality evaluation methods in prior work have largely focused on minimizing boundary imperfection issues. So, we first describe our novel aggregation-based algorithms designed to reduce boundary imperfections in Section 4. Next, in Section 5, we discuss a preprocessing method that eliminates semantic ambiguities and errors. We present our experimental evaluation in Section 6.

Semantic Ambiguity [vase] Semantic Error [computer] Boundary Imprecision [dog]

ground truth individual workers pointer to semantic object to be segmented

Fixing Boundary Imperfections

At the heart of our aggregation techniques is the tile data representation. A tile is the smallest non-overlapping discrete unit created by overlaying all of the workers' segmentations on top of each other. The tile representation allows us to aggregate segmentations from multiple workers, rather than being restricted to a single worker's segmentation, allowing us to fix one worker's errors with help from another. In Figure 3 (left), we display three worker segmentations for a toy example with 6 resulting tiles. Any subset of these tiles can contribute towards the final segmentation. This simple but powerful idea of tiles also allows us to reformulate our problem from one of "generating a segmentation" to a setting that is much more familiar to crowdsourcing researchers. Since tiles are the lowest granularity units created by overlaying all workers' segmentations on top of each other, each tile is either completely contained within or outside a given worker segmentation. Specifically, we can regard a worker segmentation as multiple boolean responses where the worker has voted 'yes' or 'no' to every tile independently. Intuitively, a worker votes 'yes' for every tile that is contained in their segmentation, and 'no' for every tile that is not. As shown in Figure 3 (right), tile t 2 is voted 'yes' by worker 1, 2, and 3; tile t 3 is voted 'yes' by worker 2 and 3. The goal of our aggregation algorithms is to pick an appropriate set of tiles that effectively trades off precision versus recall. Now that we have modeled segmentation as a collection of worker votes for tiles, we can now develop familiar variants of standard quality evaluation algorithms for this setting. Aggregation: Majority Vote Aggregation (MV) This simple algorithm includes a tile in the output segmentation if and only if the tile has 'yes' votes from at least 50% of all workers. Aggregation: Expectation-Maximization (EM) Unlike MV, which assumes that all workers perform uniformly, EM approaches infer the likelihood that a tile is part of the ground truth segmentation, while simultaneously estimating hidden worker qualities. In Section 6 we evaluate an EM variant which assumes that each worker has a (different) fixed probability for a correct vote. Details of this, and more fine-grained variants can be found in our technical report (Lee et al. 2018). Aggregation: Greedy Tile Picking (greedy) The greedy algorithm picks tiles in descending order of the tiles' ratios of (estimated) overlap area with the ground truth to (estimated) non-overlap area with ground truth, for as long as the (estimated) Jaccard similarity of the resulting segmentation continues to increase. Intuitively, tiles that have a high overlap area and low non-overlap area contribute to high recall with limited loss of precision. Since tile overlap and non-overlap areas, and Jaccard similarity of segmentations with ground truth are unknown, we use different heuristics to estimate these values. We discuss details of this algorithm and its theoretical guarantees in our technical report.

Perspective Resolution

As discussed in Section 3, disagreements often arise in segmentation due to differing worker perspectives on large tile regions. We developed a clustering-based preprocessing approach to resolve this issue. Based on the intuition that workers with similar perspectives will have segmentations that are close to each other, we compute the Jaccard similarity between each pair of segmentations and perform spectral clustering to separate the segmentations into clusters. Figure 2 (bottom) illustrates how spectral clustering divides the worker segmentations into clusters with meaningful semantic associations, reflecting the diversity of perspectives for the same task. Clustering results can be used as a preprocessing step for any quality evaluation algorithm by keeping only the segmentations that belong to the largest cluster, which is typically free of semantic errors.

In addition, clustering offers the additional benefit of preserving a worker's semantic intentions. For example, while the green cluster in Figure 2 (bottom right) would be considered bad segmentations for the particular task ('computer'), this cluster can provide more data for another segmentation task corresponding to 'monitor'. A potential future work direction would be to crowdsource the semantic labels for the computed clusters to enable the reuse of segmentations across multiple objects to lower costs.

After Clustering

Colors denote clusters with different worker perspectives.

Experimental Evaluation Dataset Description

We collected crowdsourced segmentations from Amazon Mechanical Turk; each HIT consisted of one segmentation task for a specific pre-labeled object in an image. Workers were compensated $0.05 per task. There were a total of 46 objects in 9 images from the MSCOCO dataset (Lin et al. 2014) segmented by 40 different workers each, resulting in a total of 1840 segmentations. Each task contained a keyword for the object and a pointer indicating the object to be segmented. Two of the authors generated the ground truth segmentations by carefully segmenting the objects using the same interface.

Evaluation Metrics

Evaluation metrics used in our experiments measure how well the final segmentation (S) produced by these algorithms compare against ground truth (GT). We use the Jaccard score Jaccard (J) = U A(S) IA(S) , which accounts for the intersection area, IA = area(S ∩ GT ) and union area, U A = area(S ∪ GT ) between the worker and ground truth segmentations. Experiment 1: Aggregation-based methods perform significantly better than retrieval-based methods Figure 5: Performance of the original algorithms that do not make use of ground truth information (Left) and ones that do (Right). Here, the EM result overlaps with MV as they exhibit similar performance. Other diverging variants of EM is described in our technical report.

In Figure 5, we vary the number of worker segmentations along the x-axis and plot the average Jaccard score on the y-axis across different worker samples of a given size across different algorithms. Figure 5 (left) shows that the performance of aggregation-based algorithms (greedy, EM) exceeds the best achievable through existing retrieval-based methods (Retrieval). Then, in Figure 5 (right), we estimate the upper-bound performance of each algorithm by assuming that 'full information' based on ground truth is given to the algorithm. For greedy, the algorithm is aware of all the actual tile overlap and non-overlap areas against ground truth. For EM, the true worker quality parameter values (under our worker quality model) are known. For retrieval, the full information version directly picks the worker with the highest Jaccard similarity with respect to the ground truth. By making use of ground truth information (Figure 5 right), the best aggregation-based algorithm can achieve a close-toperfect average Jaccard score of 0.98 as an upper bound, far exceeding the results achievable by any single 'best' worker (J=0.91). This result demonstrates that aggregation-based methods are able to achieve better performance by performing inference at the tile granularity, which is guaranteed to be finer grained than any individual worker segmentation. The performance of aggregation-based methods scale well as more worker segmentations are added. Intuitively, larger numbers of worker segmentations result in finer granularity tiles for the aggregation-based methods. The first row in Table 1 shows the average percentage change in performance between 5-workers and 30-workers samples. We observe that aggregation based methods typically improve in performance with an increase in number of workers, while this is not generally true for retrieval-based methods. Experiment 2: Clustering as preprocessing improves algorithmic performance.

The second row in Table 1 shows the average percentage Jaccard change when clustering preprocessing is used. While clustering generally results in an accuracy increase, since the 'full information' variants are already free of semantic errors, we do not see further improvement for these variants.

Conclusion and Future Work

We identified three different types of errors for crowdsourced image segmentation, developed a clustering-based method to capture the semantic diversity caused by differing worker perspectives, and introduced novel aggregation-based methods that produce more accurate segmentations than existing retrieval-based methods.

Our preliminary studies show that our worker quality models are good indicators of the actual accuracy of worker segmentations. We also observe that the greedy algorithm is capable of achieving close-to-perfect segmentation accuracy with ground truth information. Given the success of aggregationbased methods, including the simple majority vote algorithm, we plan to use our worker quality insights to improve our EM and greedy algorithms. We are also working on using computer vision signals to further improve our algorithms.

Figure 1 :1Figure 1: Taxonomy of quality evaluation algorithms for crowdsourced segmentation, including existing methods (blue) and our novel algorithms (yellow).

Figure 2 :2Figure 2: Examples of common worker errors.

Retrieval: Number of Control Points (num pts) This algorithm picks the worker segmentation with the largest number of control points around the segmentation boundary (i.e., the most precise drawing) as the output segmentation (Vittayakorn and Hays 2011; Sorokin and Forsyth 2008).

Figure 3 :3Figure 3: Left: Toy example demonstrating tiles created by three workers' segmentations around an object delineated by the black dotted line. Right: Segmentation boundaries drawn by five workers shown in red. Overlaid segmentation creates a mask where the color indicates the number of workers who voted for the tile region.

Figure 4 :4Figure 4: Example image showing clustering performed on the same object from Figure 2 left and middle.

Table 1 :1Jaccard percentage change due to worker scaling and clustering. Algorithms with * use ground truth information.Retrieval-based Aggregation-basedAlgorithmnum pts worker* MV EM greedy greedy*Worker Scaling -6.30 2.582.12 1.78 2.07 5.38Clustering Effect 5.92-0.022.05 0.03 5.73 0.283

Material recognition in the wild with the materials in context database Bell Proceedings of International Conference on Image Processing, ICIP International Conference on Image Processing, ICIP 2014. 2014. 2015. 2015-Decem. 2015. January 2015 33 ACM Trans. on Graphics (SIG-GRAPH) Crowdsourcing Image Annotation for Nucleus Detection and Segmentation in Computational Pathology: Evaluating Experts, Automated Methods, and the Crowd Irshad stanford.edu:8090/1161/ Biocomputing 2014. 2015. 2014. 2018. 2018 Stanford InfoLab Technical report Aggregating crowdsourced image segmentations Crowdsourcing control : Moving beyond multiple choice Lin Quality Assessment for Crowdsourced Object Annotations. Procedings of the British Machine Vision Conference

Washington, DC, USA

IEEE Computer Society 2012. 2012. 2014. May 1998. 2015. 2015. 2015. 2008. 2010. 2011. 2010. 2010. 2012 8693 CVPR '12