Aggregating Crowdsourced Image Segmentations
          Doris Jung-Lin Lee                             Akash Das Sarma                      Aditya Parameswaran
 University of Illinois, Urbana-Champaign                    Facebook, Inc.            University of Illinois, Urbana-Champaign
           jlee782@illinois.edu                             akashds@fb.com                       adityagp@illinois.edu

                           Abstract
  Instance-level image segmentation provides rich information
  crucial for scene understanding in a variety of real-world ap-
  plications. In this paper, we evaluate multiple crowdsourced
  algorithms for the image segmentation problem, including
  novel worker-aggregation-based methods and retrieval-based
                                                                      Figure 1: Taxonomy of quality evaluation algorithms for
  methods from prior work. We characterize the different types
  of worker errors observed in crowdsourced segmentation, and         crowdsourced segmentation, including existing methods
  present a clustering algorithm as a preprocessing step that is      (blue) and our novel algorithms (yellow).
  able to capture and eliminate errors arising due to workers
  having different semantic perspectives. We demonstrate that
  aggregation-based algorithms attain higher accuracies than          cuss such errors in Section 3 and propose a clustering-based
  existing retrieval-based approaches, while scaling better with      preprocessing technique that resolves them in Section 5.
  increasing numbers of worker segmentations.
                                                                                          2    Related Work
                     1    Introduction                                As shown in Figure 1, quality evaluation methods for crowd-
                                                                      sourced segmentation can be classified into two categories:
Precise, instance-level object segmentation is crucial for            Retrieval-based methods pick the “best” worker segmenta-
identifying and tracking objects in a variety of real-world           tion based on some scoring criteria that evaluates the qual-
emergent applications of autonomy, including robotics (Na-            ity of each segmentation, including vision information (Vit-
tonek 1998), image organization and retrieval (Yamaguchi              tayakorn and Hays 2011; Russakovsky et al. 2015), and click-
2012), and medicine (Irshad and et. al. 2014). To this end,           stream behavior (Cabezas et al. 2015; Sameki et al. 2015;
there has been a lot of work on employing crowdsourcing               Sorokin and Forsyth 2008).
to generate training data for segmentation, including Pascal-         Aggregation-based methods combine multiple worker seg-
VOC (Everingham et al. 2015), LabelMe (Torralba et al.                mentations to produce a final segmentation that is not re-
2010), OpenSurfaces (Bell et al. 2015), and MS-COCO (Lin              stricted to any single worker segmentation. An aggregation-
et al. 2012). Unfortunately, raw data collected from the              based majority vote approach was employed in Sameki et
crowd is known to be noisy due to varying degrees of                  al. (2015) to create an expert-established gold standard for
worker skills, attention, and motivation (Bell et al. 2014;           characterizing their dataset and algorithmic accuracies, rather
Welinder et al. 2010).                                                than for segmentation quality evaluation as described here.
   To deal with these challenges, many have employed heuris-
tics indicative of crowdsourced segmentation quality to pick                             3    Error Analysis
the best worker-provided segmentation (Sorokin and Forsyth
2008; Vittayakorn and Hays 2011). However, this approach               On collecting and analyzing a number of crowdsourced seg-
ends up discarding the majority of the worker segmentations            mentations (described in Section 6), we found that common
and is limited by what the best worker can do. In this paper,         worker segmentation errors can be classified into three types:
we make two contributions: First, we introduce a novel class          (1) Semantic Ambiguity: workers have differing opinions
of aggregation-based methods that incorporates portions of             on whether particular regions belong to an object (Figure 2
segmentations from multiple workers into a combined one                left: annotations around ‘flower and vase’ when ‘vase’ is re-
described in Section 4. To our surprise, despite its intuitive         quested); (2) Semantic Error: workers annotate the wrong
simplicity, we have not seen this class of algorithms described        object entirely (Figure 2 right: annotations around ‘turtle’ and
or evaluated in prior work. We evaluate this class of algo-           ‘monitor’ when ‘computer’ is requested.); and (3) Boundary
rithms against existing methods in Section 6. Second, our              Imperfection: workers make unintentional mistakes while
analysis of common worker errors in crowdsourced segmen-               drawing the boundaries, either due to low image resolution,
tation shows that workers often segment the wrong objects or           small area of the object, or lack of drawing skills (Figure 3
erroneously include or exclude large semantically-ambiguous            left: imprecision around the ‘dog’ object).
portions of an object in the resulting segmentation. We dis-              Quality evaluation methods in prior work have largely
                                                                       focused on minimizing boundary imperfection issues. So,
Copyright c 2018for this paper by its authors. Copying permitted      we first describe our novel aggregation-based algorithms de-
for private and academic purposes.                                     signed to reduce boundary imperfections in Section 4. Next,
in Section 5, we discuss a preprocessing method that elimi-                                       to (estimated) non-overlap area with ground truth, for as long
nates semantic ambiguities and errors. We present our exper-                                      as the (estimated) Jaccard similarity of the resulting segmen-
imental evaluation in Section 6.                                                                  tation continues to increase. Intuitively, tiles that have a high
  Semantic
                                                                                                  overlap area and low non-overlap area contribute to high
                    Semantic Error [computer]   Boundary Imprecision [dog]
Ambiguity [vase]                                                                                  recall with limited loss of precision. Since tile overlap and
                                                                                                  non-overlap areas, and Jaccard similarity of segmentations
                                                                             individual workers

                                                                             ground truth
                                                                                                  with ground truth are unknown, we use different heuristics
                                                                             pointer to           to estimate these values. We discuss details of this algorithm
                                                                             semantic object
                                                                             to be segmented      and its theoretical guarantees in our technical report.
                                                                                                  Retrieval: Number of Control Points (num pts)
                                                                                                  This algorithm picks the worker segmentation with the largest
           Figure 2: Examples of common worker errors.                                            number of control points around the segmentation boundary
                                                                                                  (i.e., the most precise drawing) as the output segmentation
                                                                                                  (Vittayakorn and Hays 2011; Sorokin and Forsyth 2008).
            4      Fixing Boundary Imperfections
At the heart of our aggregation techniques is the tile data                                                                                         Tile-based Inference
                                                                                                                worker 1          worker 3          (5 worker example)
representation. A tile is the smallest non-overlapping discrete                                                 worker 2          object boundary
unit created by overlaying all of the workers’ segmentations
on top of each other. The tile representation allows us to                                                                   t6
aggregate segmentations from multiple workers, rather than
being restricted to a single worker’s segmentation, allowing                                               t1           t2           t3      t4
us to fix one worker’s errors with help from another. In Figure
3 (left), we display three worker segmentations for a toy                                                                    t5
example with 6 resulting tiles. Any subset of these tiles can
contribute towards the final segmentation.
   This simple but powerful idea of tiles also allows us to                                       Figure 3: Left: Toy example demonstrating tiles created by
reformulate our problem from one of “generating a segmenta-                                       three workers’ segmentations around an object delineated by
tion” to a setting that is much more familiar to crowdsourcing                                    the black dotted line. Right: Segmentation boundaries drawn
researchers. Since tiles are the lowest granularity units cre-                                    by five workers shown in red. Overlaid segmentation creates
ated by overlaying all workers’ segmentations on top of each                                      a mask where the color indicates the number of workers who
other, each tile is either completely contained within or out-                                    voted for the tile region.
side a given worker segmentation. Specifically, we can regard
a worker segmentation as multiple boolean responses where
the worker has voted ‘yes’ or ‘no’ to every tile independently.
Intuitively, a worker votes ‘yes’ for every tile that is contained                                                  5        Perspective Resolution
in their segmentation, and ‘no’ for every tile that is not. As                                    As discussed in Section 3, disagreements often arise in seg-
shown in Figure 3 (right), tile t2 is voted ‘yes’ by worker 1,                                    mentation due to differing worker perspectives on large tile
2, and 3; tile t3 is voted ‘yes’ by worker 2 and 3. The goal                                      regions. We developed a clustering-based preprocessing ap-
of our aggregation algorithms is to pick an appropriate set of                                    proach to resolve this issue. Based on the intuition that work-
tiles that effectively trades off precision versus recall.                                        ers with similar perspectives will have segmentations that
   Now that we have modeled segmentation as a collection of                                       are close to each other, we compute the Jaccard similarity
worker votes for tiles, we can now develop familiar variants                                      between each pair of segmentations and perform spectral
of standard quality evaluation algorithms for this setting.                                       clustering to separate the segmentations into clusters. Fig-
Aggregation: Majority Vote Aggregation (MV)                                                       ure 2 (bottom) illustrates how spectral clustering divides the
This simple algorithm includes a tile in the output segmenta-                                     worker segmentations into clusters with meaningful semantic
tion if and only if the tile has ‘yes’ votes from at least 50%                                    associations, reflecting the diversity of perspectives for the
of all workers.                                                                                   same task. Clustering results can be used as a preprocessing
Aggregation: Expectation-Maximization (EM)                                                        step for any quality evaluation algorithm by keeping only
Unlike MV, which assumes that all workers perform uni-                                            the segmentations that belong to the largest cluster, which is
formly, EM approaches infer the likelihood that a tile is part                                    typically free of semantic errors.
of the ground truth segmentation, while simultaneously es-                                           In addition, clustering offers the additional benefit of pre-
timating hidden worker qualities. In Section 6 we evaluate                                        serving a worker’s semantic intentions. For example, while
an EM variant which assumes that each worker has a (dif-                                          the green cluster in Figure 2 (bottom right) would be consid-
ferent) fixed probability for a correct vote. Details of this,                                    ered bad segmentations for the particular task (‘computer’),
and more fine-grained variants can be found in our technical                                      this cluster can provide more data for another segmentation
report (Lee et al. 2018).                                                                         task corresponding to ‘monitor’. A potential future work di-
Aggregation: Greedy Tile Picking (greedy)                                                         rection would be to crowdsource the semantic labels for the
The greedy algorithm picks tiles in descending order of the                                       computed clusters to enable the reuse of segmentations across
tiles’ ratios of (estimated) overlap area with the ground truth                                   multiple objects to lower costs.
                 After Clustering
                                                                    methods (Retrieval). Then, in Figure 5 (right), we estimate
                                                                    the upper-bound performance of each algorithm by assum-
                                                  Colors denote     ing that ‘full information’ based on ground truth is given
                                                  clusters with
                                                 diﬀerent worker
                                                                    to the algorithm. For greedy, the algorithm is aware of all
                                                  perspectives.     the actual tile overlap and non-overlap areas against ground
                                                                    truth. For EM, the true worker quality parameter values (un-
                                                                    der our worker quality model) are known. For retrieval, the
                                                                    full information version directly picks the worker with the
Figure 4: Example image showing clustering performed on
                                                                    highest Jaccard similarity with respect to the ground truth.
the same object from Figure 2 left and middle.
                                                                    By making use of ground truth information (Figure 5 right),
                                                                    the best aggregation-based algorithm can achieve a close-to-
                                                                    perfect average Jaccard score of 0.98 as an upper bound, far
            6   Experimental Evaluation                             exceeding the results achievable by any single ‘best’ worker
Dataset Description                                                (J=0.91). This result demonstrates that aggregation-based
We collected crowdsourced segmentations from Amazon Me-             methods are able to achieve better performance by perform-
chanical Turk; each HIT consisted of one segmentation task          ing inference at the tile granularity, which is guaranteed to
for a specific pre-labeled object in an image. Workers were         be finer grained than any individual worker segmentation.
compensated $0.05 per task. There were a total of 46 objects       The performance of aggregation-based methods scale
in 9 images from the MSCOCO dataset (Lin et al. 2014)              well as more worker segmentations are added.
segmented by 40 different workers each, resulting in a total        Intuitively, larger numbers of worker segmentations result in
of 1840 segmentations. Each task contained a keyword for            finer granularity tiles for the aggregation-based methods. The
the object and a pointer indicating the object to be segmented.     first row in Table 1 shows the average percentage change in
Two of the authors generated the ground truth segmentations         performance between 5-workers and 30-workers samples. We
by carefully segmenting the objects using the same interface.       observe that aggregation based methods typically improve in
Evaluation Metrics                                                  performance with an increase in number of workers, while
Evaluation metrics used in our experiments measure how              this is not generally true for retrieval-based methods.
well the final segmentation (S) produced by these algorithms        Experiment 2: Clustering as preprocessing improves al-
compare against ground truth (GT). We use the Jaccard score         gorithmic performance.
Jaccard (J) = UIA(S)
                  A(S)
                       , which accounts for the intersection       The second row in Table 1 shows the average percentage
                                                                   Jaccard change when clustering preprocessing is used. While
area, IA = area(S ∩ GT ) and union area, U A = area(S ∪             clustering generally results in an accuracy increase, since the
GT ) between the worker and ground truth segmentations.            ‘full information’ variants are already free of semantic errors,
Experiment 1: Aggregation-based methods perform sig-               we do not see further improvement for these variants.
nificantly better than retrieval-based methods
                                                                                        Retrieval-based Aggregation-based
                                                                     Algorithm         num pts worker* MV EM greedy greedy*
                                                                     Worker Scaling -6.30 2.58          2.12 1.78 2.07 5.38
                                                                     Clustering Effect 5.92     -0.02   2.05 0.03 5.73 0.283

                                                                   Table 1: Jaccard percentage change due to worker scaling and
                                                                   clustering. Algorithms with * use ground truth information.


                                                                            7    Conclusion and Future Work
                                                                   We identified three different types of errors for crowdsourced
                                                                   image segmentation, developed a clustering-based method
                                                                   to capture the semantic diversity caused by differing worker
Figure 5: Performance of the original algorithms that do not       perspectives, and introduced novel aggregation-based meth-
make use of ground truth information (Left) and ones that          ods that produce more accurate segmentations than existing
do (Right). Here, the EM result overlaps with MV as they           retrieval-based methods.
exhibit similar performance. Other diverging variants of EM           Our preliminary studies show that our worker quality mod-
is described in our technical report.                              els are good indicators of the actual accuracy of worker seg-
                                                                   mentations. We also observe that the greedy algorithm is capa-
In Figure 5, we vary the number of worker segmentations            ble of achieving close-to-perfect segmentation accuracy with
along the x-axis and plot the average Jaccard score on the         ground truth information. Given the success of aggregation-
y-axis across different worker samples of a given size across      based methods, including the simple majority vote algorithm,
different algorithms. Figure 5 (left) shows that the perfor-       we plan to use our worker quality insights to improve our
mance of aggregation-based algorithms (greedy, EM) ex-             EM and greedy algorithms. We are also working on using
ceeds the best achievable through existing retrieval-based         computer vision signals to further improve our algorithms.
                        References                                mon objects in context. European Conference on Computer
[Bell et al. 2014] Sean Bell, Kavita Bala, and Noah Snavely.      Vision (ECCV), 8693 LNCS(PART 5):740–755, 2014.
  Intrinsic images in the wild. ACM Trans. on Graphics (SIG-     [Natonek 1998] E. Natonek. Fast range image segmenta-
 GRAPH), 33(4), 2014.                                             tion for servicing robots. In Proceedings. 1998 IEEE In-
[Bell et al. 2015] Sean Bell, Paul Upchurch, Noah Snavely,        ternational Conference on Robotics and Automation (Cat.
  and Kavita Bala. Material recognition in the wild with the      No.98CH36146), volume 1, pages 406–411 vol.1, May 1998.
  materials in context database. Computer Vision and Pattern     [Russakovsky et al. 2015] Olga Russakovsky, Li-Jia Li, and
 Recognition (CVPR), 2015.                                        Li Fei-Fei. Best of Both Worlds: Human-Machine Collabora-
[Cabezas et al. 2015] Ferran Cabezas, Axel Carlier, Vincent       tion for Object Annotation. pages 2121–2131, 2015.
  Charvillat, Amaia Salvador, and Xavier Giro-I-Nieto. Quality   [Sameki et al. 2015] Mehrnoosh Sameki, Danna Gurari, and
  control in crowdsourced object segmentation. Proceedings of     Margrit Betke. Characterizing Image Segmentation Behavior
 International Conference on Image Processing, ICIP, 2015-        of the Crowd. pages 1–4, 2015.
  Decem:4243–4247, 2015.                                         [Sorokin and Forsyth 2008] Alexander Sorokin and David
[Everingham et al. 2015] M. Everingham, S. M. A. Eslami,          Forsyth. Utility data annotaton with Amazon Mechanical
  L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.      Turk. Proceedings of the 1st IEEE Workshop on Internet
 The pascal visual object classes challenge: A retrospective.     Vision at CVPR 08, (c):1–8, 2008.
 International Journal of Computer Vision, 111(1):98–136,        [Torralba et al. 2010] Antonio Torralba, Bryan C. Russell,
 January 2015.                                                    and Jenny Yuen. LabelMe: Online image annotation and
[Irshad and et. al. 2014] H Irshad and Montaser-Kouhsari et.      applications. Proceedings of the IEEE, 98(8):1467–1484,
  al. Crowdsourcing Image Annotation for Nucleus Detection        2010.
  and Segmentation in Computational Pathology: Evaluating        [Vittayakorn and Hays 2011] Sirion Vittayakorn and James
  Experts, Automated Methods, and the Crowd. Biocomputing         Hays. Quality Assessment for Crowdsourced Object Annota-
 2015, pages 294–305, 2014.                                       tions. Procedings of the British Machine Vision Conference,
[Lee et al. 2018] Doris Jung-Lin Lee, Akash Das Sarma,            pages 109.1–109.11, 2011.
  and Aditya Parameswaran. Aggregating crowdsourced im-          [Welinder et al. 2010] Peter Welinder, Steve Branson, Serge
  age segmentations. Technical report, Stanford InfoLab           Belongie, and Pietro Perona. The Multidimensional Wis-
 (ilpubs.stanford.edu:8090/1161/), 2018.                          dom of Crowds. NIPS (Conference on Neural Information
[Lin et al. 2012] Christopher H Lin, Mausam, and Daniel S         Processing Systems), 6:1–9, 2010.
 Weld. Crowdsourcing control : Moving beyond multiple            [Yamaguchi 2012] Kota Yamaguchi. Parsing clothing in fash-
  choice. AAAI Conference on Human Computation and                ion photographs. In Proceedings of the 2012 IEEE Confer-
 Crowdsourcing (HCOMP), pages 491–500, 2012.                      ence on Computer Vision and Pattern Recognition (CVPR),
[Lin et al. 2014] Tsung Yi Lin, Michael Maire, Serge Be-          CVPR ’12, pages 3570–3577, Washington, DC, USA, 2012.
  longie, James Hays, Pietro Perona, Deva Ramanan, Piotr          IEEE Computer Society.
  Dollár, and C. Lawrence Zitnick. Microsoft COCO: Com-