<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Aggregating Crowdsourced Image Segmentations</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jung-Lin</forename><surname>Doris</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Illinois</orgName>
								<address>
									<settlement>Urbana-Champaign</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><surname>Lee</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Illinois</orgName>
								<address>
									<settlement>Urbana-Champaign</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Akash</forename><surname>Das Sarma</surname></persName>
							<email>akashds@fb.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Facebook, Inc</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Aditya</forename><surname>Parameswaran</surname></persName>
							<email>adityagp@illinois.edu</email>
							<affiliation key="aff2">
								<orgName type="institution">University of Illinois</orgName>
								<address>
									<settlement>Urbana-Champaign</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Aggregating Crowdsourced Image Segmentations</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E9238142A255E8C57818750A11D41AAA</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T17:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Instance-level image segmentation provides rich information crucial for scene understanding in a variety of real-world applications. In this paper, we evaluate multiple crowdsourced algorithms for the image segmentation problem, including novel worker-aggregation-based methods and retrieval-based methods from prior work. We characterize the different types of worker errors observed in crowdsourced segmentation, and present a clustering algorithm as a preprocessing step that is able to capture and eliminate errors arising due to workers having different semantic perspectives. We demonstrate that aggregation-based algorithms attain higher accuracies than existing retrieval-based approaches, while scaling better with increasing numbers of worker segmentations.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Precise, instance-level object segmentation is crucial for identifying and tracking objects in a variety of real-world emergent applications of autonomy, including robotics (Natonek 1998), image organization and retrieval (Yamaguchi 2012), and medicine <ref type="bibr" target="#b1">(Irshad and et. al. 2014)</ref>. To this end, there has been a lot of work on employing crowdsourcing to generate training data for segmentation, including Pascal-VOC <ref type="bibr" target="#b0">(Everingham et al. 2015)</ref>, LabelMe <ref type="bibr" target="#b2">(Torralba et al. 2010)</ref>, OpenSurfaces <ref type="bibr" target="#b0">(Bell et al. 2015)</ref>, and MS-COCO <ref type="bibr" target="#b2">(Lin et al. 2012)</ref>. Unfortunately, raw data collected from the crowd is known to be noisy due to varying degrees of worker skills, attention, and motivation <ref type="bibr" target="#b0">(Bell et al. 2014;</ref><ref type="bibr" target="#b2">Welinder et al. 2010)</ref>.</p><p>To deal with these challenges, many have employed heuristics indicative of crowdsourced segmentation quality to pick the best worker-provided segmentation <ref type="bibr" target="#b2">(Sorokin and Forsyth 2008;</ref><ref type="bibr">Vittayakorn and Hays 2011)</ref>. However, this approach ends up discarding the majority of the worker segmentations and is limited by what the best worker can do. In this paper, we make two contributions: First, we introduce a novel class of aggregation-based methods that incorporates portions of segmentations from multiple workers into a combined one described in Section 4. To our surprise, despite its intuitive simplicity, we have not seen this class of algorithms described or evaluated in prior work. We evaluate this class of algorithms against existing methods in Section 6. Second, our analysis of common worker errors in crowdsourced segmentation shows that workers often segment the wrong objects or erroneously include or exclude large semantically-ambiguous portions of an object in the resulting segmentation. We dis- </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>As shown in Figure <ref type="figure" target="#fig_0">1</ref>, quality evaluation methods for crowdsourced segmentation can be classified into two categories: Retrieval-based methods pick the "best" worker segmentation based on some scoring criteria that evaluates the quality of each segmentation, including vision information <ref type="bibr">(Vittayakorn and Hays 2011;</ref><ref type="bibr" target="#b2">Russakovsky et al. 2015)</ref>, and clickstream behavior <ref type="bibr" target="#b0">(Cabezas et al. 2015;</ref><ref type="bibr" target="#b2">Sameki et al. 2015;</ref><ref type="bibr" target="#b2">Sorokin and Forsyth 2008)</ref>. Aggregation-based methods combine multiple worker segmentations to produce a final segmentation that is not restricted to any single worker segmentation. An aggregationbased majority vote approach was employed in <ref type="bibr" target="#b2">Sameki et al. (2015)</ref> to create an expert-established gold standard for characterizing their dataset and algorithmic accuracies, rather than for segmentation quality evaluation as described here.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Error Analysis</head><p>On collecting and analyzing a number of crowdsourced segmentations (described in Section 6), we found that common worker segmentation errors can be classified into three types:</p><p>(1) Semantic Ambiguity: workers have differing opinions on whether particular regions belong to an object (Figure <ref type="figure" target="#fig_1">2</ref> left: annotations around 'flower and vase' when 'vase' is requested); (2) Semantic Error: workers annotate the wrong object entirely (Figure <ref type="figure" target="#fig_1">2</ref> right: annotations around 'turtle <ref type="bibr">' and 'monitor' when 'computer' is requested.)</ref>; and (3) Boundary Imperfection: workers make unintentional mistakes while drawing the boundaries, either due to low image resolution, small area of the object, or lack of drawing skills (Figure <ref type="figure" target="#fig_3">3</ref> left: imprecision around the 'dog' object).</p><p>Quality evaluation methods in prior work have largely focused on minimizing boundary imperfection issues. So, we first describe our novel aggregation-based algorithms designed to reduce boundary imperfections in Section 4. Next, in Section 5, we discuss a preprocessing method that eliminates semantic ambiguities and errors. We present our experimental evaluation in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Semantic Ambiguity [vase] Semantic Error [computer] Boundary Imprecision [dog]</head><p>ground truth individual workers pointer to semantic object to be segmented </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Fixing Boundary Imperfections</head><p>At the heart of our aggregation techniques is the tile data representation. A tile is the smallest non-overlapping discrete unit created by overlaying all of the workers' segmentations on top of each other. The tile representation allows us to aggregate segmentations from multiple workers, rather than being restricted to a single worker's segmentation, allowing us to fix one worker's errors with help from another. In Figure <ref type="figure" target="#fig_3">3</ref> (left), we display three worker segmentations for a toy example with 6 resulting tiles. Any subset of these tiles can contribute towards the final segmentation. This simple but powerful idea of tiles also allows us to reformulate our problem from one of "generating a segmentation" to a setting that is much more familiar to crowdsourcing researchers. Since tiles are the lowest granularity units created by overlaying all workers' segmentations on top of each other, each tile is either completely contained within or outside a given worker segmentation. Specifically, we can regard a worker segmentation as multiple boolean responses where the worker has voted 'yes' or 'no' to every tile independently. Intuitively, a worker votes 'yes' for every tile that is contained in their segmentation, and 'no' for every tile that is not. As shown in Figure <ref type="figure" target="#fig_3">3</ref> (right), tile t 2 is voted 'yes' by worker 1, 2, and 3; tile t 3 is voted 'yes' by worker 2 and 3. The goal of our aggregation algorithms is to pick an appropriate set of tiles that effectively trades off precision versus recall. Now that we have modeled segmentation as a collection of worker votes for tiles, we can now develop familiar variants of standard quality evaluation algorithms for this setting. Aggregation: Majority Vote Aggregation (MV) This simple algorithm includes a tile in the output segmentation if and only if the tile has 'yes' votes from at least 50% of all workers. Aggregation: Expectation-Maximization (EM) Unlike MV, which assumes that all workers perform uniformly, EM approaches infer the likelihood that a tile is part of the ground truth segmentation, while simultaneously estimating hidden worker qualities. In Section 6 we evaluate an EM variant which assumes that each worker has a (different) fixed probability for a correct vote. Details of this, and more fine-grained variants can be found in our technical report <ref type="bibr" target="#b1">(Lee et al. 2018)</ref>. Aggregation: Greedy Tile Picking (greedy) The greedy algorithm picks tiles in descending order of the tiles' ratios of (estimated) overlap area with the ground truth to (estimated) non-overlap area with ground truth, for as long as the (estimated) Jaccard similarity of the resulting segmentation continues to increase. Intuitively, tiles that have a high overlap area and low non-overlap area contribute to high recall with limited loss of precision. Since tile overlap and non-overlap areas, and Jaccard similarity of segmentations with ground truth are unknown, we use different heuristics to estimate these values. We discuss details of this algorithm and its theoretical guarantees in our technical report.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Perspective Resolution</head><p>As discussed in Section 3, disagreements often arise in segmentation due to differing worker perspectives on large tile regions. We developed a clustering-based preprocessing approach to resolve this issue. Based on the intuition that workers with similar perspectives will have segmentations that are close to each other, we compute the Jaccard similarity between each pair of segmentations and perform spectral clustering to separate the segmentations into clusters. <ref type="bibr">Figure 2 (bottom)</ref> illustrates how spectral clustering divides the worker segmentations into clusters with meaningful semantic associations, reflecting the diversity of perspectives for the same task. Clustering results can be used as a preprocessing step for any quality evaluation algorithm by keeping only the segmentations that belong to the largest cluster, which is typically free of semantic errors.</p><p>In addition, clustering offers the additional benefit of preserving a worker's semantic intentions. For example, while the green cluster in Figure <ref type="figure" target="#fig_1">2</ref> (bottom right) would be considered bad segmentations for the particular task ('computer'), this cluster can provide more data for another segmentation task corresponding to 'monitor'. A potential future work direction would be to crowdsource the semantic labels for the computed clusters to enable the reuse of segmentations across multiple objects to lower costs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>After Clustering</head><p>Colors denote clusters with different worker perspectives. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Experimental Evaluation Dataset Description</head><p>We collected crowdsourced segmentations from Amazon Mechanical Turk; each HIT consisted of one segmentation task for a specific pre-labeled object in an image. Workers were compensated $0.05 per task. There were a total of 46 objects in 9 images from the MSCOCO dataset (Lin et al. 2014) segmented by 40 different workers each, resulting in a total of 1840 segmentations. Each task contained a keyword for the object and a pointer indicating the object to be segmented. Two of the authors generated the ground truth segmentations by carefully segmenting the objects using the same interface.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Evaluation Metrics</head><p>Evaluation metrics used in our experiments measure how well the final segmentation (S) produced by these algorithms compare against ground truth (GT). We use the Jaccard score Jaccard (J) = U A(S) IA(S) , which accounts for the intersection area, IA = area(S ∩ GT ) and union area, U A = area(S ∪ GT ) between the worker and ground truth segmentations. Experiment 1: Aggregation-based methods perform significantly better than retrieval-based methods Figure <ref type="figure">5</ref>: Performance of the original algorithms that do not make use of ground truth information (Left) and ones that do (Right). Here, the EM result overlaps with MV as they exhibit similar performance. Other diverging variants of EM is described in our technical report.</p><p>In Figure <ref type="figure">5</ref>, we vary the number of worker segmentations along the x-axis and plot the average Jaccard score on the y-axis across different worker samples of a given size across different algorithms. Figure <ref type="figure">5</ref> (left) shows that the performance of aggregation-based algorithms (greedy, EM) exceeds the best achievable through existing retrieval-based methods (Retrieval). Then, in Figure <ref type="figure">5</ref> (right), we estimate the upper-bound performance of each algorithm by assuming that 'full information' based on ground truth is given to the algorithm. For greedy, the algorithm is aware of all the actual tile overlap and non-overlap areas against ground truth. For EM, the true worker quality parameter values (under our worker quality model) are known. For retrieval, the full information version directly picks the worker with the highest Jaccard similarity with respect to the ground truth. By making use of ground truth information (Figure <ref type="figure">5</ref> right), the best aggregation-based algorithm can achieve a close-toperfect average Jaccard score of 0.98 as an upper bound, far exceeding the results achievable by any single 'best' worker (J=0.91). This result demonstrates that aggregation-based methods are able to achieve better performance by performing inference at the tile granularity, which is guaranteed to be finer grained than any individual worker segmentation. The performance of aggregation-based methods scale well as more worker segmentations are added. Intuitively, larger numbers of worker segmentations result in finer granularity tiles for the aggregation-based methods. The first row in Table <ref type="table" target="#tab_0">1</ref> shows the average percentage change in performance between 5-workers and 30-workers samples. We observe that aggregation based methods typically improve in performance with an increase in number of workers, while this is not generally true for retrieval-based methods. Experiment 2: Clustering as preprocessing improves algorithmic performance.</p><p>The second row in Table <ref type="table" target="#tab_0">1</ref> shows the average percentage Jaccard change when clustering preprocessing is used. While clustering generally results in an accuracy increase, since the 'full information' variants are already free of semantic errors, we do not see further improvement for these variants. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion and Future Work</head><p>We identified three different types of errors for crowdsourced image segmentation, developed a clustering-based method to capture the semantic diversity caused by differing worker perspectives, and introduced novel aggregation-based methods that produce more accurate segmentations than existing retrieval-based methods.</p><p>Our preliminary studies show that our worker quality models are good indicators of the actual accuracy of worker segmentations. We also observe that the greedy algorithm is capable of achieving close-to-perfect segmentation accuracy with ground truth information. Given the success of aggregationbased methods, including the simple majority vote algorithm, we plan to use our worker quality insights to improve our EM and greedy algorithms. We are also working on using computer vision signals to further improve our algorithms.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Taxonomy of quality evaluation algorithms for crowdsourced segmentation, including existing methods (blue) and our novel algorithms (yellow).</figDesc><graphic coords="1,355.28,162.00,166.95,55.17" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Examples of common worker errors.</figDesc><graphic coords="2,106.99,74.48,133.71,100.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>Retrieval: Number of Control Points (num pts) This algorithm picks the worker segmentation with the largest number of control points around the segmentation boundary (i.e., the most precise drawing) as the output segmentation (Vittayakorn and Hays 2011; Sorokin and Forsyth 2008).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Left: Toy example demonstrating tiles created by three workers' segmentations around an object delineated by the black dotted line. Right: Segmentation boundaries drawn by five workers shown in red. Overlaid segmentation creates a mask where the color indicates the number of workers who voted for the tile region.</figDesc><graphic coords="2,424.87,226.11,117.66,103.06" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Example image showing clustering performed on the same object from Figure 2 left and middle.</figDesc><graphic coords="3,106.63,61.29,114.87,66.26" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Jaccard percentage change due to worker scaling and clustering. Algorithms with * use ground truth information.</figDesc><table><row><cell></cell><cell cols="3">Retrieval-based Aggregation-based</cell></row><row><cell>Algorithm</cell><cell cols="3">num pts worker* MV EM greedy greedy*</cell></row><row><cell cols="3">Worker Scaling -6.30 2.58</cell><cell>2.12 1.78 2.07 5.38</cell></row><row><cell cols="2">Clustering Effect 5.92</cell><cell>-0.02</cell><cell>2.05 0.03 5.73 0.283</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Material recognition in the wild with the materials in context database</title>
		<author>
			<persName><surname>Bell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of International Conference on Image Processing, ICIP</title>
				<meeting>International Conference on Image Processing, ICIP</meeting>
		<imprint>
			<date type="published" when="2014">2014. 2014. 2015. 2015-Decem. 2015. January 2015</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="98" to="136" />
		</imprint>
	</monogr>
	<note>ACM Trans. on Graphics (SIG-GRAPH)</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Crowdsourcing Image Annotation for Nucleus Detection and Segmentation in Computational Pathology: Evaluating Experts, Automated Methods, and the Crowd</title>
		<author>
			<persName><surname>Irshad</surname></persName>
		</author>
		<idno>stanford.edu:8090/1161/</idno>
	</analytic>
	<monogr>
		<title level="j">Biocomputing</title>
		<imprint>
			<biblScope unit="page" from="294" to="305" />
			<date type="published" when="2014">2014. 2015. 2014. 2018. 2018</date>
		</imprint>
		<respStmt>
			<orgName>Stanford InfoLab</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
	<note>Aggregating crowdsourced image segmentations</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Crowdsourcing control : Moving beyond multiple choice</title>
		<author>
			<persName><forename type="first">Lin</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Quality Assessment for Crowdsourced Object Annotations. Procedings of the British Machine Vision Conference</title>
				<meeting><address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="1998">2012. 2012. 2014. May 1998. 2015. 2015. 2015. 2008. 2010. 2011. 2010. 2010. 2012</date>
			<biblScope unit="volume">8693</biblScope>
			<biblScope unit="page" from="3570" to="3577" />
		</imprint>
	</monogr>
	<note>CVPR &apos;12</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
