<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LimitBias! Measuring Worker Biases in the Crowdsourced Collection of Subjective Judgments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christoph Hube</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Besnik Fetahu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ujwal Gadiraju</string-name>
          <email>gadirajug@L3S.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bias in Crowdsourcing Data Acquisition</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>L3S Research Center, Leibniz Universita ̈t Hannover Appelstrasse 4</institution>
          ,
          <addr-line>Hannover 30167</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Crowdsourcing results acquired for tasks that comprise a subjective component (e.g. opinion detection, sentiment analysis) are affected by the inherent bias of the crowd workers. This leads to weaker and noisy ground-truth data. In this work we propose an approach for measuring crowd worker bias. We explore worker bias through the example task of bias detection where we compare the worker's opinions with their annotations for specific topics. This is a first important step towards mitigating crowd worker bias in subjective tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Crowdsourcing is one of the most common means in obtaining ground-truth data for
training automated models for a large variety of tasks [
        <xref ref-type="bibr" rid="ref6 ref7 ref9">9, 7, 6</xref>
        ]. In many cases, the
annotations are affected by the subjective nature of tasks (e.g. opinion detection, sentiment
analysis) or the biases of the workers themselves. For instance, for tasks like
determining the political leaning or biased language in a piece of text the annotations, how
we perceive something as liberal/conservative or biased is subject to several factors
like framing and epistemological biases in language, social and cultural background of
workers, etc. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In this work, we aim at understanding and mitigating the worker biases in
crowdsourced annotation tasks that are of subjective nature (e.g., political leaning of a
statement, biased language etc.). In particular, we are interested in the case where for a given
set of strict annotation rules how does the workers’ bias influence their annotation
quality. Furthermore, having this setting in mind, how can we mitigate such worker biases
in subjective tasks. We propose an approach for measuring crowd worker bias based
on the example task of labeling statements as either biased or neutral. In addition to
the main task we ask workers for their personal opinion on each statement’s topic. This
additional information allows us to measure correlations between a worker’s opinion
and their choice of labeling. In future work we will introduce methods for mitigating
the measured bias.
2.1
ronments (i.e., the hardware and software affordances at the disposal of workers) have
also shown to influence and bias task related outcomes such as completion time and
work quality [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In other closely related work, Eickhoff has studied the prevalence
of cognitive biases as a source of noise in crowdsourced data curation, annotation and
evaluation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Eickhoff studied the effect size of common cognitive biases such as the
ambiguity effect, anchoring, bandwagon and decoy effect in a typical relevance
judgment task framework. Crowdsourcing tasks are often susceptible to participation biases.
This can be further exacerbated by incentive schemes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Other demographic attributes
can also become a source of biased judgments. It has also been found that American and
Indian workers differed in their perceptions of non-monetary benefits of participation.
Indian workers valued self-improvement benefits, whereas American workers valued
emotional benefits [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>In this work, we aim to disentangle the potential sources of worker bias using the
example task of bias detection. This will be a first holistic approach towards bias
management in crowdsourcing.
2.2</p>
      <p>
        The Case of Subjective Annotations
For many tasks such as detecting subjective statements in text (i.e., text pieces
reflecting opinions), or biased and framing issues that are often encountered in political
discourse [
        <xref ref-type="bibr" rid="ref3 ref9">9, 3</xref>
        ], the quality of the ground-truth is crucial.
      </p>
      <p>
        Yano et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] show the impact of crowd worker biases in annotating statements
(without their context) where the labels corresponded to the political biases, e.g. very
liberal, very conservative, no bias, etc. Their study shows that crowd workers who
identify themselves as moderates perceive less bias, whereas conservatives perceive more
bias in both ends of the spectrum (very liberal and very conservative). Interestingly, the
distribution of workers is heavily biased towards moderates. This raises several issues.
First, how can we ensure a balanced representation of workers, where for subjective
tasks a balanced representation is crucial. Second, which judgments are more reliable
having in mind that more conservative workers tend to perceive statements as more
biased in both ends of the political spectrum.
      </p>
      <p>
        In a similar study to, Iyyer et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] showed the impact of the workers in annotating
statements with their corresponding political ideology. In nearly 30% of the cases, it was
found that workers annotate statements with the presence of a bias, however, without
necessarily being clear in the political leaning (e.g. liberal or conservative). While it is
difficult to understand the exact factors that influence workers in such cases, possible
reasons may be their lack of domain knowledge, respectively the stances with which
different political ideologies are represented on a given topic, or it may be the political
leanings of the workers themselves. Such aspects remain largely unexplored and given
their prevalence they represent significant quality concerns in ground-truth generation
through crowdsourcing.
      </p>
      <p>In this work, we aim at addressing these unresolved quality concerns of
crowdsourcing for subjective tasks by disentangling all the possible bias factors.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Measuring crowd worker Bias</title>
      <p>In Section 1 we introduced the problem of measuring crowd worker bias for a
crowdsourcing task including a subjective component. In this section we propose an approach
for measuring crowd worker bias for the example task of labeling statements as either
biased or neutral. The same approach can be used for other tasks as stated in Section 1.</p>
      <p>
        For the example task we use statements from datasets of subjective and opinionated
statements that have been extracted from Wikipedia [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or ideological books [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We
first create a set of 10 statement groups with each group containing statements for one
controversial topic from a list of widely discussed controversial topics, e.g. abortion,
capital punishment, feminism. Each statement group contains one main statement that
reflects the central pro/against aspect of the controversy, e.g. “Abortion should be legal”.
In our task design we use the main statement to determine the worker’s opinion on the
given topic. Furthermore each group contains 4 additional opinionated statements from
the dataset that follow the group’s topic, two statements that support the main statement
and two against it.
      </p>
      <p>
        To accurately measure worker bias, we divide the task into two subtasks. In the first
subtask we show the worker the opinionated statements. The worker has to label each
statement as either “biased” or “neutral”. We explain the concepts of biased and neutral
wording to the workers and give them a guideline when to label a statement as biased
or neutral. We give multiple examples for both classes. We additionally provide a third
“I don’t know” option. The task design for the first subtask is depicted in Figure 1.
A similar task design has been used to create a ground truth for the problem of bias
detection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>In the second subtask we ask the worker’s opinion for each topic from the statement
group. We show the worker the main statement from each group and 5 options on a
Likert scale reaching from “I strongly agree” to “I strongly disagree”. The task design
for the second subtask is depicted in Figure 2.</p>
      <p>Our hypothesis is that workers who agree with a statement are more likely to label
it as neutral, i.e. a worker who agrees that abortion should be illegal is more likely to
label the statement “An abortion is the murder of a human baby embryo or fetus from
the uterus” as neutral. As stated in Section 1 this behavior can negatively influence the
crowdsourcing results of this task since crowd workers should label according to the
given guidelines and not to personal opinion.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Future Work</title>
      <p>We introduced an approach for measuring crowd worker bias for crowdsourcing tasks
including a subjective component. For future work we are planning to develop methods
for mitigating the measured bias. Possible approaches could include balancing
judgments between workers of different opinions, making workers aware of their biases
(meta-cognition), and discounting strongly biased crowdworkers. Furthermore we want
to analyze the influence of task design on worker bias.</p>
      <p>Acknowledgments This work is funded by the ERC Advanced Grant ALEXANDRIA
(grant no. 339233), DESIR (grant no. 31081), and H2020 AFEL project (grant no.
687916).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Carsten</given-names>
            <surname>Eickhoff</surname>
          </string-name>
          .
          <article-title>Cognitive biases in crowdsourcing</article-title>
          .
          <source>In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining</source>
          , pages
          <fpage>162</fpage>
          -
          <lpage>170</lpage>
          . ACM,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Carsten</given-names>
            <surname>Eickhoff and Arjen P de Vries</surname>
          </string-name>
          .
          <article-title>Increasing cheat robustness of crowdsourcing tasks</article-title>
          .
          <source>Information retrieval</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ):
          <fpage>121</fpage>
          -
          <lpage>137</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Roger</given-names>
            <surname>Fowler</surname>
          </string-name>
          .
          <article-title>Language in the News: Discourse and Ideology in the Press</article-title>
          . Routledge,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Ujwal</given-names>
            <surname>Gadiraju</surname>
          </string-name>
          , Alessandro Checco, Neha Gupta, and
          <string-name>
            <given-names>Gianluca</given-names>
            <surname>Demartini</surname>
          </string-name>
          .
          <article-title>Modus operandi of crowd workers: The invisible role of microtask work environments</article-title>
          .
          <source>Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies</source>
          ,
          <volume>1</volume>
          (
          <issue>3</issue>
          ):
          <fpage>49</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Ujwal</given-names>
            <surname>Gadiraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jie</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Bozzon</surname>
          </string-name>
          .
          <article-title>Clarity is a worthwhile quality: On the role of task clarity in microtask crowdsourcing</article-title>
          .
          <source>In Proceedings of the 28th ACM Conference on Hypertext and Social Media</source>
          , pages
          <fpage>5</fpage>
          -
          <lpage>14</lpage>
          . ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Hube</surname>
          </string-name>
          and
          <string-name>
            <given-names>Besnik</given-names>
            <surname>Fetahu</surname>
          </string-name>
          .
          <article-title>Detecting biased statements in wikipedia</article-title>
          .
          <source>In Companion of the The Web Conference 2018 on The Web Conference</source>
          <year>2018</year>
          ,
          <article-title>WWW 2018</article-title>
          , Lyon , France,
          <source>April 23-27</source>
          ,
          <year>2018</year>
          , pages
          <fpage>1779</fpage>
          -
          <lpage>1786</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Mohit</given-names>
            <surname>Iyyer</surname>
          </string-name>
          , Peter Enns,
          <string-name>
            <surname>Jordan</surname>
            Boyd-Graber, and
            <given-names>Philip</given-names>
          </string-name>
          <string-name>
            <surname>Resnik</surname>
          </string-name>
          .
          <article-title>Political ideology detection using recursive neural networks</article-title>
          .
          <source>In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>1</volume>
          , pages
          <fpage>1113</fpage>
          -
          <lpage>1122</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Ling</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Wagner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bonnie</given-names>
            <surname>Nardi</surname>
          </string-name>
          .
          <article-title>Not just in it for the money: A qualitative investigation of workers' perceived benefits of micro-task crowdsourcing</article-title>
          .
          <source>In System Sciences (HICSS)</source>
          ,
          <year>2015</year>
          48th Hawaii International Conference on, pages
          <fpage>773</fpage>
          -
          <lpage>782</lpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Dietram</surname>
            <given-names>A</given-names>
          </string-name>
          Scheufele.
          <article-title>Framing as a theory of media effects</article-title>
          .
          <source>Journal of communication</source>
          ,
          <volume>49</volume>
          (
          <issue>1</issue>
          ):
          <fpage>103</fpage>
          -
          <lpage>122</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Tae</surname>
            <given-names>Yano</given-names>
          </string-name>
          , Philip Resnik, and
          <string-name>
            <surname>Noah</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Shedding (a thousand points of) light on biased language</article-title>
          .
          <source>In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech</source>
          and
          <article-title>Language Data with Amazon's Mechanical Turk</article-title>
          , pages
          <fpage>152</fpage>
          -
          <lpage>158</lpage>
          . Association for Computational Linguistics,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>