<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anca Dumitrache?</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oana Inel?</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lora Aroyo</string-name>
          <email>l.m.aroyog@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Timmermans</string-name>
          <email>b.timmermans@nl.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Welty</string-name>
          <email>cawelty@gmail.com</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CAS IBM</institution>
          <country country="NL">Nederland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vrije Universiteit Amsterdam</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Typically crowdsourcing-based approaches to gather annotated data use inter-annotator agreement as a measure of quality. However, in many domains, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. In this paper, we present ongoing work into the CrowdTruth metrics, that capture and interpret inter-annotator disagreement in crowdsourcing. The CrowdTruth metrics model the inter-dependency between the three main components of a crowdsourcing system { worker, input data, and annotation. The goal of the metrics is to capture the degree of ambiguity in each of these three components. The metrics are available online at https://github.com/CrowdTruth/CrowdTruth-core.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The process of gathering ground truth data through human annotation is a
major bottleneck in the use of information extraction methods.
Crowdsourcingbased approaches are gaining popularity in the attempt to solve the issues
related to volume of data and lack of annotators. Typically these practices use
inter-annotator agreement as a measure of quality. However, this assumption
often creates issues in practice. Previous experiments we performed [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] found
that inter-annotator disagreement is usually never captured, either because the
number of annotators is too small to capture the full diversity of opinion, or
because the crowd data is aggregated with metrics that enforce consensus, such
as majority vote. These practices create arti cial data that is neither general nor
re ects the ambiguity inherent in the data.
      </p>
      <p>
        To address these issues, we proposed the CrowdTruth [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] method for
crowdsourcing ground truth by harnessing inter-annotator disagreement. We present
an alternative approach for crowdsourcing ground truth data that, instead of
enforcing agreement between annotators, captures the ambiguity inherent in
? Equal contribution, authors listed alphabetically.
semantic annotation through the use of disagreement-aware metrics for
aggregating crowdsourcing responses. In this paper, we introduce the second version
of CrowdTruth metrics { a set of metrics that capture and interpret
interannotator disagreement in crowdsourcing annotation tasks. As opposed to the
rst version of the metrics, published in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the current version models the
inter-dependency between the three main components of a crowdsourcing system
{ worker, input data, and annotation. This update is based on the intuition
that disagreement caused by low quality workers should not be interpreted as
the data being ambiguous, but also that ambiguous input data should not be
interpreted as due to the low quality of the workers.
      </p>
      <p>
        This paper presents the de nitions of the CrowdTruth metrics 2.0, together
with the theoretical motivations of the updates based on the previous version 1.0.
The code of the implementation of the metrics is available on the CrowdTruth
Github.4 The 2.0 version of the metrics has already been applied successfully to
a number of use cases, e.g. semantic frame disambiguation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], relation
extraction from sentences [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], topic relevance [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In the future, we plan to continue
the validation of the metrics through evaluation over di erent annotation tasks,
comparing CrowdTruth approach with other disagreement-aware crowd
aggregation methods.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>CrowdTruth Methodology</title>
      <p>
        The CrowdTruth methodology consists of a set of quality metrics and best
practices to aggregate inter-annotator agreement such that ambiguity in the task
is preserved. The methodology uses the triangle of disagreement model (based on
4 https://github.com/CrowdTruth/CrowdTruth-core
the triangle reference [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) to represent the crowdsourcing system and its three
main components { input media units, workers, and annotations (Figure 1).
The triangle model expresses how ambiguity in any of the corners disseminates
and in uences the other components of the triangle. For example, an unclear
sentence or an ambiguous annotation scheme would cause more disagreement
between workers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and thus, both need to be accounted for when measuring
the quality of the workers.
      </p>
      <p>The CrowdTruth methodology calculates quality metrics for workers, media
units and annotations. The novel contribution of version 2.0 is that the way how
ambiguity propagates between the three components of the crowdsourcing system
has been made explicit in the quality formulas of the components. So for example,
the quality of a worker is weighted by the quality of the media units the worker
has annotated, and the quality of the annotations in the task.</p>
      <p>This section describes the two steps of the CrowdTruth methodology:
1. formalizing the output from crowd tasks into annotation vectors;
2. calculating quality scores over the annotation vectors using disagreement
metrics.
2.1</p>
      <sec id="sec-2-1">
        <title>Building the Annotation Vectors</title>
        <p>In order to measure the quality of the crowdsourced data, we need to formalize
crowd annotations into a vector space representation. For closed tasks, the
annotation vector contains the given answer options in the task template, which
the crowd can choose from. For example, the template of a closed task can be
composed of a multiple choice question, which appears as a list checkboxes or
radio buttons, thus, having a nite list of options to choose from. Figure 2 shows
an example of a closed and an open task, indicating also what the media units
and annotations are for both cases.</p>
        <p>While for closed tasks the number of elements in the annotation vector is
known in advance, for open-ended tasks the number of elements in the annotation
vector can only be determined when all the judgments for a media unit have been
gathered. An example of such a task can be highlighting words or word phrases
in a sentence, or as an input text eld where the workers can introduce keywords.
In this case the answer space is composed of all the unique keywords from all
the workers that solved that media unit. As a consequence, all the media units
in a closed task have the same answers space, while for open-ended tasks the
answer space is di erent across all the media units. Although the answer space
for open-ended tasks is not known from the beginning, it still can be further
processed in a nite answer space.</p>
        <p>In the annotation vector, each answer option is a boolean value, showing
whether the worker annotated that answer or not. This allows the annotations
of each worker on a given media unit to be aggregated, resulting in a media
unit vector that represents for each option how often it was annotated. Figure 2
shows how the worker and media unit vectors are formed for both a closed and
an open task.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Disagreement Metrics</title>
        <p>Using the vector representations, we calculate three core metrics that capture
the media unit quality, worker quality and annotation quality. These
metrics are mutually dependent (e.g. the media unit quality is weighted by the
annotation quality and worker quality), based on the idea from the triangle of
disagreement that ambiguity in any of the corners disseminates and in uences
the other components of the triangle. The mutual dependence requires an
iterative dynamic programming approach, calculating the metrics in a loop until
convergence is reached. All the metrics have scores in the [0; 1] interval, with 0
meaning low quality and 1 meaning high quality. Before starting the iterative
dynamic programming approach, the quality metrics are initialized with 1.</p>
        <p>To de ne the CrowdTruth metrics, we introduce the following notation:
{ workers(u) : all workers that annotate media unit u;
{ units(i) : all input media units annotated by worker i;
{ W orkV ec(i; u) : annotations of worker i on media unit u as a binary vector;
{ M ediaU nitV ec(s) = Pi2workers(s) W orkV ec(i; s), where s is an input
media unit.</p>
        <p>To calculate agreement between 2 workers on the same media unit, we
compute the cosine similarity over the 2 worker vectors. In order to re ect the
dependency of the agreement on the degree of clarity of the annotations, we compute
W cos, the weighted version of the cosine similarity. The Annotation Quality
Score (AQS), which will be described in more detail at the end of the section, is
used as the weight. For open-ended tasks, where annotation quality cannot be
calculated across multiple media units, we consider annotation quality equal to
1 (the maximum value) in all cases. Given 2 worker vectors, vec1 and vec2 on
the same media unit, the formula for the weighted cosine score is:
W cos(vec1; vec2) =
=</p>
        <p>Pa vec1(a) vec2(a) AQS(a)
p(Pa vec21(a) AQS(a)) (Pa vec22(a) AQS(a))
;
8a - annotation:</p>
        <p>The Media Unit Quality Score (UQS) expresses the overall worker
agreement over one media unit. Given an input media unit u, U QS(u) is computed as
the average cosine similarity between all worker vectors, weighted by the worker
quality (W QS) and annotation quality (AQS). Through the weighted average,
workers and annotations with lower quality will have less of an impact on the
nal score. The formula used in its calculation is:</p>
        <p>U QS(u) = i;j</p>
        <sec id="sec-2-2-1">
          <title>P W orkV ecW cos(i; j; u) W QS(i) W QS(j) P W QS(i) W QS(j)</title>
          <p>The Worker-Worker Agreement (WWA) for a given worker i measures
the average pairwise agreement between i and all other workers, across all
media units they annotated in common, indicating how close a worker performs
compared to workers solving the same task. The metric gives an indication as
to whether there are consistently like-minded workers. This is useful for
identifying communities of thought. W W A(i) is the average cosine distance between
the annotations of a worker i and all other workers that have worked on the
same media units as worker i, weighted by the worker and annotation qualities.
Through the weighted average, workers and annotations with lower quality will
have less of an impact on the nal score of the given worker.</p>
          <p>W W A(i) =</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>P W orkV ecW cos(i; j; u) W QS(j) U QS(u)</title>
          <p>j;u</p>
          <p>P W QS(j) U QS(u)
j;u
;
8j 2 workers(u 2 units(i)); i 6= j:</p>
          <p>The Worker-Media Unit Agreement (WUA) measures the similarity
between the annotations of a worker and the aggregated annotations of the
rest of the workers. In contrast to the W W A which calculates agreement with
individual workers, W U A calculates the agreement with the consensus over all
workers. W U A(i) is the average cosine distance between the annotations of a
worker i and all annotations for the media units they have worked on, weighted
by the media unit (U QS) and annotation quality (AQS). Through the weighted
average, media units and annotations with lower quality will have less of an
impact on the nal score.</p>
          <p>W U A(i) =</p>
          <p>P
u2units(i)</p>
          <p>W orkU nitW cos(u; i) U QS(u)</p>
          <p>P
u2units(i)</p>
          <p>U QS(u)
W orkU nitW cos(u; i) = W cos(W orkV ec(i; u);</p>
          <p>M ediaU nitV ec(u)</p>
          <p>W orkV ec(i; u))</p>
          <p>The Annotation Quality Score (AQS) measures the agreement over an
annotation in all media units that it appears. Therefore, it is only applicable to
closed tasks, where the same annotation set is used for all input media units. It
is based on Pa(ijj), the probability that if a worker j annotates a in a media
unit, worker i will also annotate it.</p>
          <p>Pa(ijj) =</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>P U QS(u) W orkV ec(j; u)(r)</title>
          <p>u</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>8u 2 units(i) \ units(j):</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>P U QS(u) W orkV ec(i; s)[a] W orkV ec(j; s)[a]</title>
          <p>u</p>
          <p>Given an annotation a, AQS(a) is the weighted average of Pa(ijj) for all
possible pairs of workers i and j. Through the weighted average, input media
units and workers with lower quality will have less of an impact on the nal
score of the annotation.
;
;
AQS(a) = i;j</p>
          <p>P W QS(i) W QS(j) Pa(ijj)</p>
          <p>The formulas for media unit, worker and annotation quality are all
mutually dependent. To calculate them, we apply an iterative dynamic programming
approach. First, we initialize each quality metric with the score for maximum
quality (i.e. equal to 1). Then we repeatedly re-calculate the quality metrics
until each of the values are stabilized. This is assessed by calculating the sum of
variations between iterations for all quality values, and checking until it drops
under a set threshold t.</p>
          <p>The nal metric we calculate is the Media Unit - Annotation Score
(UAS) { the degree of clarity with which an annotation is expressed in a unit.
Given an annotation a and a media unit u, U AS(u; a) is the ratio of the number
of workers that picked annotation u over all workers that annotated the unit,
weighted by the worker quality.</p>
          <p>U AS(u; a) =</p>
          <p>P
i2workers(u)</p>
          <p>W orkV ec(i; u)(a) W QS(i)</p>
          <p>P
i2workers(u)</p>
          <p>W QS(i)
:
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper, we present ongoing work into the CrowdTruth metrics, that
capture and interpret inter-annotator disagreement in crowdsourcing. Typically
crowdsourcing-based approaches to gather annotated data use inter-annotator
agreement as a measure of quality. However, in many domains, there is ambiguity
in the data, as well as a multitude of perspectives of the information examples.
The CrowdTruth metrics model the inter-dependency between the three main
components of a crowdsourcing system { worker, input data, and annotation.</p>
      <p>
        We have presented the de nitions and formulas of several CrowdTruth
metrics, including the three core metrics measuring the quality of workers,
annotations, and input media units. The metrics are based on the idea of the triangle
of disagreement, expressing how ambiguity in any of the corners disseminates
and in uences the other components of the triangle. Because of this,
disagreement caused by low quality workers should not be interpreted as the data being
ambiguous, but also that ambiguous input data should not be interpreted as
due to the low quality of the workers. The metrics have already been applied
successfully to use cases in topic relevance [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], semantic frame disambiguation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and relation extraction from sentences [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The Three Sides of CrowdTruth</article-title>
          .
          <source>Journal of Human Computation</source>
          <volume>1</volume>
          ,
          <issue>31</issue>
          {
          <fpage>34</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Crowd Truth:
          <article-title>Harnessing disagreement in crowdsourcing a relation extraction gold standard</article-title>
          .
          <source>Web Science</source>
          <year>2013</year>
          . ACM (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Truth Is a Lie: CrowdTruth and the Seven Myths of Human Annotation</article-title>
          .
          <source>AI Magazine</source>
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <volume>15</volume>
          {
          <fpage>24</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dumitrache</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>False positive and cross-relation signals in distant supervision data (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dumitrache</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Capturing ambiguity in crowdsourcing frame disambiguation (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Inel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khamkham</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cristea</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumitrache</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rutjes</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van der Ploeg</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Romaszko</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sips</surname>
          </string-name>
          , R.J.: Crowdtruth:
          <article-title>Machine-human computation framework for sing disagreement in gathering annotated data</article-title>
          .
          <source>In: The Semantic Web{ISWC</source>
          <year>2014</year>
          , pp.
          <volume>486</volume>
          {
          <fpage>504</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Inel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haralabopoulos</surname>
          </string-name>
          , G.,
          <string-name>
            <surname>Van Gysel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szlvik</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simperl</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Studying topical relevance with evidence-based crowdsourcing</article-title>
          . In: To Appear
          <source>in the Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Knowlton</surname>
            ,
            <given-names>J.Q.</given-names>
          </string-name>
          :
          <article-title>On the de nition of \picture"</article-title>
          .
          <source>AV Communication Review</source>
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <volume>157</volume>
          {
          <fpage>183</fpage>
          (
          <year>1966</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>