<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Measuring Crowd Truth: Disagreement Metrics Combined with Worker Behavior Filters</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guillermo Soberon</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lora Aroyo</string-name>
          <email>l.m.aroyo@cs.vu.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Welty</string-name>
          <email>cawelty@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oana Inel</string-name>
          <email>oana.inel@vu.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hui Lin</string-name>
          <email>hui.lin2013@nl.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manfred Overmeen</string-name>
          <email>manfred.overmeen@nl.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Netherlands</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IBM Research</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>VU University</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>When crowdsourcing gold standards for NLP tasks, the workers may not reach a consensus on a single correct solution for each task. The goal of Crowd Truth is to embrace such disagreement between individual annotators and harness it as useful information to signal vague or ambiguous examples. Even though the technique relies on disagreement, we also assume that the di ering opinions will cluster around the more plausible alternatives. Therefore it is possible to identify workers who systematically disagree - both with the majority opinion and with the rest of their co-workers- as low quality or spam workers. We present in this paper a more detailed formalization of metrics for Crowd Truth in the context of medical relation extraction, and a set of additional ltering techniques that require the workers to brie y justify their answers. These explanation-based techniques are shown to be particularly useful in conjunction with disagreement-based metrics, and achieve 95% accuracy for identifying low quality and spam submissions in crowdsourcing settings where spam is quite high.</p>
      </abstract>
      <kwd-group>
        <kwd>crowdsourcing</kwd>
        <kwd>disagreement</kwd>
        <kwd>quality control</kwd>
        <kwd>relation extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The creation of gold standards by expert annotators can be a very slow and
expensive process. When it comes to NLP tasks, like relation extraction,
annotators have to deal with the ambiguity of the expressions in the text in di erent
levels, frequently leading to disagreement between annotators. To overcome this,
detailed guidelines for annotators are developed, in order to handle the di erent
cases that have been observed, through practice, to generate disagreement.
However, the process of avoiding disagreement has lead in many cases to brittleness
and over generality in the ground truth, making it di cult to transfer annotated
data between domains or to use the results for anything practical.</p>
      <p>
        In comparison with expert generated ground truth, crowdsourcing gold
standard can be a cheaper and more scalabe solution. Crowdsourced gold standards
typically show lower overall scores [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], especially for complex NLP tasks such
as relation extraction, since the workers perform small, simple (micro) tasks and
cannot be relied on to read a long guideline document. Rather than eliciting
an arti cial agreement between workers, in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we presented \Crowd Truth", a
crowdsourced gold standard technique that, instead of considering the lack of
agreement something to be avoided, it is used as something informative from
which characteristics and features of the annotated content may be inferred.
For instance, a high disagreement for a particular sentence may be a sign of
ambiguity in the sentence.
      </p>
      <p>
        As the nal Crowd Truth is a by-product of the di erent contributions of the
members of the crowd, being able to identify and lter possible low quality
contributors is crucial to reduce their impact on the overall quality of the aggregate
result. Most of the existing approaches for detecting low quality contributions
in crowdsourcing tasks are based on the assumption that for each task there is a
single correct answer, enabling distance and clustering metrics to detect outliers
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or using gold units [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], establishing an equivalency between disagreement
with the majority and low quality contributions.
      </p>
      <p>
        For Crowd Truth the initial premise is that there is not only one right answer,
and the diversity of opinions is to be preserved. However, disagreement with the
majority can still be used as a way to distinguish low quality annotators.For each
task, it may be assumed that the workers answers will be distributed among the
possible options, with the most plausible answers concentrating the highest
number of workers, and the improbable answers being stated by none or very few
workers. That way, workers whose opinions are di erent from those of the
majority, are likely to nd other workers with similar views over the issue. On the
other hand, the answers of workers who complete the task randomly or without
understanding the task or its content, tend to be not aligned with those of the
rest. Hence, it would be possible to lter by identifying those workers who, not
only disagree with the majority opinion of the crowd on a task basis, but whose
opinions are systematically not shared by many of their peers. The initial de
nition of the content-based disagreement metrics was introduced by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to identify
and lter low quality workers for relation extraction tasks, establishing metrics
for the inter-worker agreement and the agreement with the crowd opinion.
      </p>
      <p>While ltering workers by disagreement has showed to be an e ective way of
detecting low quality contributors, achieving high precision, we demonstrate that
it is not su cient to lter all the existing ones. We have extended the relation
extraction task by asking the workers to provide a written justi cation for their
answers, and the manual inspection of the results contained several instances
of badly formed, incomplete or even random-text explanations, which can be
securely attributed to low quality workers or even automated spam bots.</p>
      <p>In order to complement the disagreement lters, we propose several ways to
use the explanations provided by the contributors, to implement new low quality
worker lters that extend and complement the recall of the disagreement lter.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Disagreement</title>
        <p>In the absence of gold standard, a di erent evaluation schemes can be used
for worker quality evaluation. For instance, the results among workers can be
compared and the agreement in their responses can be used as quality estimator.</p>
        <p>
          As is well known [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the frequency of disagreement can be used to estimate
worker error probabilities. In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] the computation of quality estimators for
workers quality based on disagreement is proposed as part of a set of techniques to
evaluate workers, along with con dence intervals for each one of this schemas;
which allows to estimate the "e ciency" of each one of them.
        </p>
        <p>
          A simpler method is proposed in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], which assumes "plurality of answers"
for a task, and estimates the quality of a worker based on the number of tasks for
which a worker agrees with "the plurality answer" (i.e the one from the majority
of the workers).
        </p>
        <p>While these disagreement-based schemas do not rely on the assumption that
there is only one single answer per task (thus, allowing room for disagreement
between workers responses), they still assume a correlation between
disagreement and low quality of the worker. Crowd Truth not only allows but fosters
disagreement between the workers, as it is considered informative.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Filtering by explanations</title>
        <p>
          As stated in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], cheaters tend to avoid tasks that involve creativity and abstract
thinking, and even for simple straightforward tasks, the addition of the
nonrepetitive elements discourage low quality contribution and automation of the
task. Apart from the dissuasive element for spammers of introducing these
nonrepetitive elements in the task design, our work additionally tries to use this as
a base for ltering once the task is completed.
        </p>
        <p>
          Previous experiences [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] have shown that workers tend to provide good
answers to open-ended questions when those are concrete, and response length
can be used as an indicator of the participant engagement in the task.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Crowd Watson</title>
        <p>
          Watson [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is an arti cial intelligent system capable of answering questions posed
in natural language designed by IBM. To build its knowledge base Watson was
trained on a series of databases, taxonomies, and ontologies of publicly available
data [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Currently, IBM Research aims at adapting the Watson technology for
question-answering in the medical domain. For this, large amounts of training
and evaluation data (ground truth medical text annotation) are needed, and the
traditional ground-truth annotation approach is slow and expensive, and
constrained by too restrictive annotation guidelines that are necessary to achieve
good inter-annotator agreement, which result in the aforementioned over
generalization.
        </p>
        <p>
          The Crowd Watson project [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] implements the Crowd truth approach to
generate a crowdsourced gold standard for training and evaluation of IBM
Watson NLP components in the medical domain. Complementary to the Crowd truth
implementation, and within the general Crowd Watson architecture, a gaming
approach for crowdsourcing has been proposed [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], as a way to enhance
engagement of the experts annotators.
        </p>
        <p>
          Also, within the context of the Crowd Watson project, [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] has shown how the
worker metrics initially set up for the medical domain can be adapted to other
domains and tasks, such as event extraction.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Representation</title>
      <p>
        CrowdFlower workers were presented sentences with the argument words
highlighted and 12 relations (manually selected from UMLS [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) as shown below in
Fig 1; they were asked to choose all the relations from the set of 12 that related
the two arguments in the sentence. They were also given the options to indicate
that the argument words were not related in the sentence (NONE), or that the
argument words were related but not by one of the 12 relations (OTHER). They
were also asked to justify their choices by selecting the words in the sentence
that they believed \signaled" the chosen relations or, in case they chose NONE
or OTHER, provide the rationale for that decission.
      </p>
      <p>
        Note, that the process and the choices for setting up the annotation template
is out of scope for this paper. Relation extraction task is part of the larger
crowdsourcing framework, Crowd-Watson, which de nes the input text, the templates
and the overall work ow [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this paper we only focus on the representation
and analysis of the collected crowdsourced annotations.
      </p>
      <p>The information gathered from the workers is represented using vectors in
which components are all the relations given to the workers (including the choices
for NONE and OTHER). All metrics are computed from three vector types:
1. worker-sentence vector Vs;i The result of a single worker annotating a single
sentence. For each relation that the worker annotated in the sentence, there
is a 1 in the corresponding component, otherwise a 0.
2. sentence vector Vs The vector sum of the worker-sentence vectors for each
sentence Vs = Pi Vs;i
3. relation vector Ri A unit vector in which only the component for relation i
is 1, the rest 0.</p>
      <p>We collect two di erent kinds of information: the annotations and the
explanations about the annotations (i.e the selected words that signal the chosen
relation, or their rationale for selecting NONE or OTHER).</p>
      <p>We try to identify behaviour that can be associated with low quality workers
from the perspective of these two domains: disagreement metrics rely on the
content of the annotations to identify workers that systematically disagree with
the rest; explanation lters aim at identifying individual behaviours that can be
attributed to spammers or careless workers.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Disagreement metrics</title>
      <p>
        As with the semiotic triangle [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], there are three parts to understanding a
linguistic expression: the sign, the thought or interpreter, the referent. We
instrument the crowdsourcing process in three analogous places: the micro-task,
for the relation extraction case this is a sentence; the workers, who interpret
each sentence; the task semantics, in the case of relation extraction this is the
intended meaning of the relations.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Sentence Metrics</title>
        <p>Sentence metrics are intended to measure the quality of sentences for the relation
extraction task. These measures are our primary concern, we want to provide
the highest quality of training data to machine learning systems.
Sentence-relation score is the core crowd truth metric for relation extraction,
it can be viewed as the probability that the sentence expresses the relation. It is
measured for each relation on each sentence as the cosine of the unit vector for
the relation with the sentence vector: srs(s; r) = cos(Vs; Rr)</p>
        <p>The relation score is used for training and evaluation of the relation extraction
system. This is a fundamental shift from the traditional approach, in which
sentences are simply labelled as expressing, or not, the relation, and presents
new challenges for the evaluation metric and especially for training.
Sentence clarity is de ned for each sentence as the max sentence-relation score
for that sentence: scs(s) = maxr(srs(s; r))</p>
        <p>If all the workers selected the same relation for a sentence, the max relation
score will be 1, indicating a clear sentence.</p>
        <p>Sentence clarity is used to weight sentences in training and evaluation of the
relation extraction system, since annotators have a hard time classifying them,
the machine should not be penalized as much for getting it wrong in evaluation,
nor should it treat such training examples as exemplars.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Worker Metrics</title>
        <p>Worker metrics are primarily to establish worker quality; low quality workers and
spammers should be eliminated as they contribute only noise to the disagreement
scores, and high quality workers may get paid more as an incentive to return.
We investigated several dimensions of worker quality for the relation extraction
task:
Number of annotations per sentence is a worker metric indicating the
average number of di erent relations per sentence used by a worker for annotating
a set of sentences. Unambiguous sentences should ideally be annotated with one
relation, and generally speaking each worker interprets a sentence their own
way, but a worker who consistently annotates individual sentences with multiple
relations usually does not understand the task.</p>
        <p>Worker-worker agreement is the asymmetric pairwise agreement between
two workers across all sentences they annotate in common:
wwa(wi; wj ) = PsP2Si;j RelationsInCommon(wi;wj;s)</p>
        <p>s2Si;j NumAnnotations(wi;s)
where Si;j is the subset of all sentences S annotated by both workers wi and
wj , RelationsInCommon(wi; wj ; s) is the number of identical annotations
(relations selected) on a sentence between the two workers, and N umAnnotations(wi; s)
is the number of annotations by a worker on a sentence.</p>
        <p>Average worker-worker agreement is a worker metric based on the average
worker-worker agreement between a worker and the rest of workers, weighted by
the number of sentences in common. While we intend to allow disagreement, it
should vary by sentence. Workers who consistently disagree with other workers
usually do not understand the task:</p>
        <p>avg wwa(wi) = Pj6=i PjSi;ij6=jj:wjSwia;j(jwi;wj)
Worker-sentence similarity is the vector cosine similarity between the
annotations of a worker and the aggregated annotations of the other workers in a
sentence, re ecting how close the relation(s) chosen by the worker are to the opinion
of the majority for that sentence. This is simply wss(wi; s) = cos(Vs Vs;i; Vs;i)
Worker-sentence disagreement is a measure of the quality of the annotations
of a worker for a sentence. It is de ned, for each sentence and worker, as the
di erence between the Sentence Clarity (q.v. above) for the sentence and the
worker sentence similarity for that sentence: wsd(wi; s) = scs(s) wss(wi; s).
Workers who di er drastically from the most popular choices will have large
disagreement scores, workers who agree with the most popular choice will score
0.</p>
        <p>
          The intuition for using the di erence from the clarity score over the cosine
similarity, as originally proposed in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], is to capture worker quality on a sentence
compared to the quality of the sentence itself. In uni-modal cases, e.g. where a
sentence has one clear majority interpretation, the cosine similarity works well,
but in the case where a sentence has a bimodal distribution, e.g. multiple popular
interpretations, the worker's cosine similarity will not be very high even for those
that agree with one of the two most popular interpretations, which seems less
desireable.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Average worker-sentence disagreement is a worker metric based on the</title>
        <p>average worker-sentence disagreement score across all sentences, avg wsd(wi) =
Ps2SijSwisjd(wi;s) where Si is the subset of all sentences annotated by worker wi.</p>
        <p>The worker-worker and worker-sentence scores are clearly similar, they both
measure deviation from the crowd, but they di er in emphasis. The wsd metric
simply measures the average divergence of a worker from the crowd on a sentence
basis, someone who tends to disagree with the majority will have a low score.
For wwa, workers who may not always agree with the crowd on a sentence basis
might be found to agree with a group of people that disagree with the crowd in
a similar way, and would have a low score. This could re ect di erent cultural
or educational perspectives, as opposed to simply a low quality worker.
4.3</p>
      </sec>
      <sec id="sec-4-4">
        <title>Relation Metrics</title>
        <p>Relation clarity is de ned for each relation as the max sentence-relation score
for the relation over all sentences:
rcs(r) = maxs(srs(s; r))</p>
        <p>If a relation has a high clarity score, it means that it is at least possible to
express the relation clearly. We nd in our experiments that a lot of relations
that exist in structured sources are very di cult to express clearly in language,
and are not frequently present in textual sources. Unclear relations may indicate
unattainable learning tasks.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Explanation lters</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we showed results using the worker metrics to detect low quality workers.
In order to evaluate our results, we had workers justify their answers. The
explanations of the annotation tasks are not strictly necessary for the crowd truth
data, and represent additional time and therefore cost to gather. In this section
we analyze the value of this information.
      </p>
      <p>We examined whether this additional e ort dissuaded workers from
completing the task. Two di erent implications are to be distinguished for this
circumstance: one positive, by driving away low quality workers or spammers -whose
main objective is to maximize its economic reward with the minimum possible
e ort-; and one negative, as it may induce some good contributors to choose
easier, less demanding tasks. In order to prevent this, it might be necessary to
increase the economic reward to make up for the extra e ort, so, at the end, the
addition of explanations implies an increase in the task price. And, nally, we
want to test whether the explanations -apart from preventing low quality
workers to complete the task- may contain information that it is useful for detecting
low quality workers.</p>
      <p>Apart from the presence of explanations, another variable to take into
account for spam detection is the channel of the workers. CrowdFlower has over 50
labor channel partners or external labor of workers, such as Amazon Mechanical
Turk and TrialPay, which can be used (individually or combined) to run
crowdsourcing processes. Our intuition was that di erent channels have di erent spam
control mechanisms, which may redound in di erent spammer ratios, depending
on the channel.</p>
      <p>To explore these variables, we set up an experiment to annotate the same 35
sentences, over di erent con gurations:
1. Without explanations, using workers from multiple Crowd ower channels
2. Without explanations, using workers from Amazon Mechanical Turk (AMT)
3. With explanations, using workers from multiple Crowd ower channels
4. With explanations, using workers from AMT</p>
      <p>Note that AMT was among the multiple channels used on 1 and 3, but the
presence of workers from AMT was minority.</p>
      <p>By comparing the pairs formed by 1 and 2, and 3 and 4, we can test whether
the channel has any in uence in the low quality worker ratio. Likewise, the
pairs formed by 1 and 3, and 2 and 4, can be used to test the in uence of the
explanations, independently of the channel used.</p>
      <p>We collected 18 judgments per sentence (for a total of 2522 judgements),
and workers were allowed to annotate a maximum of 10 di erent sentences.
The number of unique workers per batch was comprehended between 67 and 77
workers.</p>
      <p>In the results we observed that the time to run the task using multiple
channels was signi cantly lower than doing so only on AMT, independently of
whether the explanations where required or not. The time invested on annotating
a sentence of the batch was substantially lower, on average, when explanations
were not required.</p>
      <p>The number of workers labelled as possible low quality workers by the
disagreement lters was low, and more or less was kept within the same range for
the four batches (between 6 and 9 ltered workers per batch); so we cannot infer
whether including explanations discourages low quality workers from working in
it.</p>
      <p>However, manual exploration of the annotations revealed four patterns that
may be indicative of possible spam behaviour:
1. No valid words (No Valid in Table 1) were used, either on the explanation
or in the selected words, using instead random text or characters.
2. Using the same text for both the explanation and the selected words
(Rep Resp in Table 1). According to the task de nition, both elds are
exclusive: either the explanation or the selected words that indicate the
rationale of the decision are to be provided, so lling in both may be due bad
understanding of the task de nitions. Also, both are semantically di erent
reasons, so it is unlikely that the same text is applicable for both.
3. Workers that repeated the same text (Rep Text in Table 1) for all their
annotations, either justifying their choice using the exact same words or
selecting the same words from the sentence.</p>
      <sec id="sec-5-1">
        <title>4. [NONE] and [OTHER] used with other relations (None/Other in</title>
        <p>Table 1). None and Other are intended to be exclusive: according to the task
de nition, by selecting them the annotator is stating that none of the other
relations is applicable for the sentence. Hence, it is semantically incorrect
to choose [NONE] or [OTHER] in combination with other(s) relations, and
doing so may re ect a bad understandment of the task de nition.</p>
        <p>The degree to which these patterns may indicate spam behaviour is di erent:
in most cases, \No valid words" is a strong indicator of a low quality worker, while
a bad use of [NONE] or [OTHER] may be the re ection of a bad understanding
of the task (i.e. when should one text box be lled and when the other), rather
than a bad worker.</p>
        <sec id="sec-5-1-1">
          <title>Chan.</title>
          <p>Multiple
AMT</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>Disag. Explanation lters</title>
          <p>Filters # Spam - (% Overlap w/ disagr. lters)
(# Spam) NOotnheer/ RReespp TReexpt No Valid
9 7 (29%) 14 (29%) 3 (33%) 11 (36%) 18
6 9 (22%) 2 (0%) 2 (50%) 1 (0%) 11
Table 1. Results from 35 Sentences with explanation-based lters</p>
          <p>Table 1 contains an overview of the number of occurrences in the batches
with explanations of each of the previous patterns. For each pattern, the
percentage of workers identi ed as low quality workers is indicated. This percentage and
the last column -which indicates the number of workers for which at least one
of the low quality patterns have been observed but are not labelled as low
quality by the disagreement lters- shows that there is little overlap between these
patterns and what the disagreement lters considers low quality \behaviour".
Therefore, it seems reasonable to further explore the use of this patterns as
\explanation lters" for low quality workers. Also, the number of potential low
quality workers according to the spam patterns, seems bigger when the task is
run on multiple channels rather than only on AMT. This observation cannot be
considered conclusive, but it seems reasonable to explore it further.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>We designed a series of experiments to gather evidence in support of our
hypothesis that the disagreement lters may not be su cient and that the explanations
can be used to implement additional lters to improve the spam detection.
6.1</p>
      <sec id="sec-6-1">
        <title>Data</title>
        <p>The data for the main experiments consist of two di erent sets of 90 sentences.
The rst set (Experiment 2 or EXP2) is annotated only by workers from Amazon
Mechanical Turk (AMT), and the second (Experiment 3 or EXP3) is annotated
by workers from multiple channels among those o ered by Crowd ower
(including AMT, though the AMT workers were a minority).</p>
        <p>To optimize the time and worker dynamics we split the 90 sentences sets in
batches of 30 sentences. The batches of the rst set were run on three di erent
days, and the batches of the second were all run on the same day. Workers were
not allowed to annotate more than 10 sentence in the rst set, and no more than
15 in the second. We collected 450 judgments (15 per sentence) in each batch
(for a total of 1350 per set), from 143 unique workers in the rst set and 144 in
the second.</p>
        <p>From our previous experiences, judgements from workers who annotated two
or fewer sentences were uninformative, so we have removed these leaving 110
and 93 workers and a total of 1292 and 1302 judgements on each set.</p>
        <p>We have manually gone through the data and identi ed low quality workers
from their answers. 12 workers (out of 110) were identi ed as low quality workers
for EXP2 and 20 (out of 93) for EXP3. While all the spammers on EXP2 were
identi ed as such by the disagreement lters, only half of the low quality workers
in EXP3 were detected.</p>
        <p>Also it is important to notice that the number of annotations by workers
identi ed spammers is much higher for EXP3 (386 out of 1291, 30%) than for
EXP2 (139 out of 1302, 11%).
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Filtering low quality workers</title>
        <p>In this section, we address our hypotheses by, rst, describing disagreement
performance for EXP3, it is shown how it is not su cient by itself; and, second,
showing how the explanation lters are informative and disjoint from the
disagreement lters (they indicate something, and that \something" is di erent
from what disagreement points to).</p>
        <p>A sense of the di erent disagreement metrics in detecting low quality workers
is shown in Figure 2 and 3. Each metric is plotted against overall accuracy at
di erent con dence thresholds, on each experiment. Clearly, the overall accuracy
of the disagreement metrics is lower for EXP3. While it is possible to achieve
a 100% accuracy for EXP2 by linearly combinining the disagreement metrics,
only 89% is achieved for EXP3 by this means.</p>
        <p>In order to make up for this, we analyzed the explanations lters, exploring
whether they provide some information about possible spammer behaviour that
it is not already contained in the disagreement metrics. The explanation lters
are not very e ective by themselves: their recall value is pretty low (in all cases,
below 0.6), and it is not substantially improved by combining them.</p>
        <p>The tables 2 and 3 present an overview of the workers identi ed as possible
spammers by each lter, re ecting the intersections and di erences between the
disagreement lters and the explanation lters.</p>
        <p>Note that we analyze the experiments both on a \job" basis, and on an
aggregate \Experiment" basis. This displays how jobs are more or less
\homogeneous" (for instance, that one of them is not clearly biased in one particular
batch, therefore, biasing the aggregated experiment). However, for ltering
purposes, we treat the experiments as atomic units.</p>
        <p>It can be observed how the overlap (i.e. the number of workers identi ed as
possible spammers by two di erent lters) between the disagreement lters and
each of the explanation lters is not really signi cative.</p>
        <p>On the other hand, the number of workers identi ed as possible spammers
exclusively by the explanation lters is quite big for the EXP3. Not only it's
higher than for EXP2, but also in comparison with the number of workers ltered
by the disagreement lters. This is coherent with the manual identi cation of
spammers, which revealed 26 spammers.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Results and future work</title>
      <p>By linearly combining the lters, we have obtained a classi er with 95% accuracy
and F-measure 0.88, improving the disagreement-only ltering (88% accuracy
and F-measure 0.66) for EXP3. More data is needed to improve and rigorously
validate this approach, but this initial results are already promising.</p>
      <p>This linear combination of lters serves to the purpose of complementing
disagreement lters with explanation lters. In future work, we will further explore
di erent ways of combining these lters to improve quality, such as bagging.</p>
      <p>For the current implementation, we have omitted the di erences in the
prediction power of each of the explanation lters, when it can be resonably assumed
that they are not equally good indications of spam behaviour. It is also worth
considering using a boosting approach to improve this.</p>
      <p>Also, disagreement lters may be complemented by other kinds of
information. For instance, for EXP3, the workers completing the task come from di erent
channels. In future work, we will further explore whether the worker provenance
is a signi cative toward low quality detection.</p>
      <p>While sentence and worker metrics have proven to be informative, the
available data is not su cient to reach similar conclusions for the relation metrics,
as the di erent relations are unevenly represented. We will try to collect more
data in order to further explore this metrics.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusions</title>
      <p>We presented formalizations of sentence and worker metrics for Crowd Truth,
and showed how the worker metrics could be used to detect low quality workers.
We then introduced a set of explanation-based lters based on workers
justi cation of their answers, and we ran experiments on various crowdsourcing
\channels".</p>
      <p>The conducted experiments seem to indicate that, when in presence of a
small number of low quality annotations, disagreement lters are su cient to
preseve data quality. On the other hand, in the presence of a higher number of
low quality annotations, the e ectivity of disagreement lters diminishes, and
are not enough to detect all the possible low quality contributions.</p>
      <p>We have showed how the explanations provided by the workers about their
answers can be used to identify patterns that can reasonably associated with
spamming or low quality annotation behaviours. We used these these patterns
combined with the worker metrics to detect low quality workers with 95%
accuracy in a small cross-validation experiment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Lora</given-names>
            <surname>Aroyo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Welty</surname>
          </string-name>
          .
          <article-title>Crowd truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard</article-title>
          .
          <source>In Proc. Websci</source>
          <year>2013</year>
          . ACM Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          .
          <article-title>The uni ed medical language system (UMLS): integrating biomedical terminology</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <volume>32</volume>
          (
          <issue>suppl 1</issue>
          ):D267{
          <fpage>D270</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jacob</surname>
          </string-name>
          Cohen et al.
          <article-title>A coe cient of agreement for nominal scales</article-title>
          .
          <source>Educational and psychological measurement</source>
          ,
          <volume>20</volume>
          (
          <issue>1</issue>
          ):
          <volume>37</volume>
          {
          <fpage>46</fpage>
          ,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Philip Dawid and Allan M Skene.</surname>
          </string-name>
          <article-title>Maximum likelihood estimation of observer error-rates using the em algorithm</article-title>
          .
          <source>Applied Statistics</source>
          , pages
          <volume>20</volume>
          {
          <fpage>28</fpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Anca</given-names>
            <surname>Dumitrache</surname>
          </string-name>
          , Lora Aroyo, Chris Welty, and
          <string-name>
            <surname>Robert-Jan Sips</surname>
          </string-name>
          . Dr.
          <article-title>Detective: combining gami cationtechniques and crowdsourcing to create a goldstandard for the medical domain</article-title>
          .
          <source>Technical report</source>
          , VU University Amsterdam,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Carsten</given-names>
            <surname>Eickho and Arjen P. de Vries</surname>
          </string-name>
          .
          <article-title>Increasing cheat robustness of crowdsourcing tasks</article-title>
          .
          <source>Inf</source>
          . Retr.,
          <volume>16</volume>
          (
          <issue>2</issue>
          ):
          <volume>121</volume>
          {
          <fpage>137</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>David</given-names>
            <surname>Ferrucci</surname>
          </string-name>
          ,
          <string-name>
            <surname>Eric Brown</surname>
          </string-name>
          , Jennifer Chu-Carroll,
          <string-name>
            <given-names>James</given-names>
            <surname>Fan</surname>
          </string-name>
          , David Gondek,
          <string-name>
            <given-names>Aditya A.</given-names>
            <surname>Kalyanpur</surname>
          </string-name>
          , Adam Lally,
          <string-name>
            <given-names>J. William</given-names>
            <surname>Murdock</surname>
          </string-name>
          , Eric Nyberg, John Prager, Nico Schlaefer, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Welty</surname>
          </string-name>
          .
          <article-title>Building watson: An overview of the deepqa project</article-title>
          .
          <source>AI Magazine</source>
          ,
          <volume>31</volume>
          :
          <fpage>59</fpage>
          {
          <fpage>79</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Oana</given-names>
            <surname>Inel</surname>
          </string-name>
          , Lora Aroyo, Chris Welty, and
          <string-name>
            <surname>Robert-Jan Sips</surname>
          </string-name>
          .
          <article-title>Exploiting Crowdsourcing Disagreement with Various Domain-Independent Quality Measures</article-title>
          .
          <source>Technical report</source>
          , VU University Amsterdam,
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Manas</given-names>
            <surname>Joglekar</surname>
          </string-name>
          ,
          <article-title>Hector Garcia-Molina, and Aditya Parameswaran. Evaluating the crowd with con dence</article-title>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Aditya</surname>
            <given-names>Kalyanpur</given-names>
          </string-name>
          , BK Boguraev, Siddharth Patwardhan,
          <string-name>
            <given-names>J William</given-names>
            <surname>Murdock</surname>
          </string-name>
          , Adam Lally, Chris Welty, John M Prager,
          <string-name>
            <given-names>Bonaventura</given-names>
            <surname>Coppola</surname>
          </string-name>
          ,
          <string-name>
            <surname>Achille</surname>
            <given-names>FokoueNkoutche</given-names>
          </string-name>
          , Lei Zhang, et al.
          <article-title>Structured data and inference in DeepQA</article-title>
          .
          <source>IBM Journal of Research and Development</source>
          ,
          <volume>56</volume>
          (
          <issue>3</issue>
          .4):
          <volume>10</volume>
          {
          <fpage>1</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Hui</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>Crowd Watson: Crowdsourced Text Annotations</article-title>
          .
          <source>Technical report</source>
          , VU University Amsterdam,
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>C.</given-names>
            <surname>Marshall</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Shipman</surname>
          </string-name>
          .
          <article-title>Experiences surveying the crowd: Re ections on methods, participation, and reliability</article-title>
          .
          <source>In Proc. Websci</source>
          <year>2013</year>
          . ACM Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>C.K. Ogden</surname>
            and
            <given-names>I. A.</given-names>
          </string-name>
          <string-name>
            <surname>Richards</surname>
          </string-name>
          .
          <article-title>The meaning of meaning: A study of the in uence of language upon thought and of the science of symbolism</article-title>
          . 8th ed.
          <year>1923</year>
          . Reprint New York: Harcourt Brace Jovanovich,
          <year>1923</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Vikas</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Raykar</surname>
            and
            <given-names>Shipeng</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Eliminating spammers and ranking annotators for crowdsourced labeling tasks</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>13</volume>
          :
          <fpage>491</fpage>
          {
          <fpage>518</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Cristina</surname>
            <given-names>Sarasua</given-names>
          </string-name>
          , Elena Simperl, and Natalya Fridman Noy. Crowdmap:
          <article-title>Crowdsourcing ontology alignment with microtasks</article-title>
          .
          <source>In International Semantic Web Conference (1)</source>
          , pages
          <fpage>525</fpage>
          {
          <fpage>541</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Petros</given-names>
            <surname>Venetis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hector</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Quality control for comparison microtasks</article-title>
          .
          <source>In Proceedings of the First International Workshop on Crowdsourcing and Data Mining</source>
          , pages
          <volume>15</volume>
          {
          <fpage>21</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>