<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing the reliability of crowdsourced labels via Twitter</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Noor Jamaludeen</string-name>
          <email>noor.jamaludeen@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vishnu Unnikrishnan</string-name>
          <email>vishnu.unnikrishnan@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maya S. Sekeran</string-name>
          <email>maya.santhira@st.ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Majed Ali</string-name>
          <email>majed.ali@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Le Anh Trang</string-name>
          <email>anh1.le@st.ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Myra Spiliopoulou</string-name>
          <email>myra@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Magdeburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Crowdsourcing has been recently a popular solution to overcome the high cost of acquiring labeled datasets. However, the reliability of crowdsourced labels remains a challenge. Many approaches rely on domain experts who are scarce and expensive. In this work, we propose to use Twitter to acquire labels and to juxtapose them with crowdsourced ones. This allows us to measure annotator reliability. Since annotator expertise may vary, depending on content, we propose a new topic-based reliability measurement approach. We compare our model with Kappa Weighted Voting and Majority Voting as baseline methods, and show that our approach performs well and is robust when up to 30% of the annotators is not reliable.</p>
      </abstract>
      <kwd-group>
        <kwd>crowdsourcing</kwd>
        <kwd>kappa weighted voting</kwd>
        <kwd>annotator reliability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Building a robust classification model requires a labeled dataset.
Crowdsourcing for annotations has been gaining popularity over recent years. Platforms such
as AmazonTurk 1 and CrowdFlower 2 offer to pay people for providing annotations.
However, the quality of the annotations still needs to be checked against labels
acquired from domain experts. Hence, there is a need to measure annotator reliability.</p>
      <p>We introduce a new approach that collects labels for tweets from Twitter,
organizes them on topic and assesses the reliability of the annotators with respect to the
labels they assign to the tweets, taking the topics into account. In our approach, we
take advantage of the fact that people are spending about two hours a day on these
Social Media platforms and the amount of time spent is on a steady increase 3. On
Twitter alone, according to Statista 4, there are 335 million monthly active users.</p>
      <p>Our contributions are as follows. We propose a new annotation tool for tweet
sentiment labeling, that capitalizes on topic-specific expertise of Twitter users. We
derive topics from the tweets and use them to derive topic-based reliability scores
for the annotators. These scores we use in a weighting scheme for the annotated
tweets. This allows us to exploit the fact that an annotator may be more reliable for
tweets belonging to a certain topic than to other topics.</p>
      <p>This work is organized as follows. We next discuss related work on crowdsourcing
and annotator reliability. In Section 3 we present the components of our approach.
Section 4 contains our evaluation framework, which encompasses also a simulator
for annotators. In Section 5 we report on our experiments for various percentages of
unreliable annotators, as generated by our simulator. The last section concludes our
study with a summary and future issues.</p>
      <p>A note on terminology: Throughout this work, we use the terms “instance” and
“tweet” interchangeably. We term a user who assigns a “label” to an instance as
“annotator” and call this activity “annotation”.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>There are various approaches to tackle annotator reliability when crowdsourcing
for labels [1], [2], [3]. However these studies require domain experts to validate the
labels collected from annotators. In [1], Hao et al. model annotators reliability based
on their cumulative performance. However, they do not consider the possibility of
not having the same annotator providing labels over time. In [2], Bhowmick et al.
propose a coefficient to measure annotator reliability where multiple labels can be
assigned to an annotation. Here, we will be using a single label for every tweet.</p>
      <p>Close to our work is the method of Swanson et al. [4] where annotators who
have high agreements with other annotators are given higher reliability scores. In
our work, annotators who deliver annotations identical to the inferred labels are
assigned high reliability scores over the topics comprised in the annotated tweets. In
[5], Pion-Tonachini et al. use Latent Dirichlet Allocation to model the annotators’
expertise over the classes which are analogous to topics in the topic-modeling
application, in which is common to apply LDA. They define vote-class relationship to model
the annotators’ individual interpretation of the classes given the votes. In our work,
we do not limit the annotators’ expertise over only the classes, we learn the
annotators’ reliability on latent topics modeled over the dataset, which simulates the real
world setting more.</p>
      <p>Furthermore, Pion-Tonachini et al. [5] present CL-LDA-BPE an extension model
to incorporate prior knowledge of the annotators’ expertise through a structured
Bayesian framework. We rather assume no prior knowledge, and therefore induce
the annotators expertise from the annotations only.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Our Approach</title>
      <p>
        Our goal is to acquire reliable sentiment labels for tweets, using Twitter users as
annotators. Our approach towards this goal encompasses following tasks, depicted
on Figure 1 and described in the next subsections. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Collecting instances and
mapping them into topics, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Ranking instances on consensus among annotators, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
Topicbased reliability model for the annotators, and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Weighted Voting with Topic-based
Reliability Scoring mechanism (WVTRS).
      </p>
      <sec id="sec-3-1">
        <title>3.1 Collecting instances and mapping them into topics</title>
        <p>For the database of tweets Y (with L denoting the cardinality of Y ) we acquire class
labels (in our experiments: labels on sentiment) from Twitter: we developed a tool
where each y 2 Y is posted to Twitter as a poll for a period of 7 days, during which
users of Twitter can vote for one of the possible labels. The nature of the
environment automatically limits users to voting only once. Once the poll has expired, every
response to the tweet by each user x is stored as (y, x, vote(y, x)), where vote(y, x)) 2 C
and C is the set of classes . The annotators constitute a set X , denoting its cardinality
as M .</p>
        <p>We learn the topics over Y by computing the TF-IDF values for all terms,
building an instance-term matrix, and decomposing it into Instance-Topic matrix and
Topic-Term matrix using Non-negative Matrix Factorization (NMF). According to the
Topic-Term matrix, each term is assigned to the topic, in which the term has its
maximum value. When that term occurs in a tweet, we consider this maximum value as
the contribution of the corresponding topic in the tweet and we refer to it as T Py,j .
In case many terms belonging to the same topic occur in the same tweet, then the
T Py,j is the sum of these terms-topic’s maximum values. We represent each tweet y
as an N -dimensional vector, as y ÆÇ T Py,1, T Py,2, ..., T Py,N È, where N is the number
of topics.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Ranking instances on consensus</title>
        <p>In most real-life crowdsourcing scenarios without monetary remunerations, it is
reasonable to expect that very few users will contribute consistently to the system,
giving skewed intensities with which users interact with a system. It is also possible that
some instances receive more votes than others for a variety of reasons (ease of
annotation, skewed availability of expertise, etc.). To accommodate this fact, we first sort
the tweets on ‘maximum consensus’, and then step through the collected responses
one tweet at a time, incrementally updating the annotator reliability (which is
computed as described in the next subsection).</p>
        <p>For tweet y and class label c, let votes(y, c) be the number of annotators who
assigned c to y. We assign each tweet to the class according to the majority voting, i.e.
mvlabel(y) Æ ar g maxc2C votes(y, c).We use this number also to assign a rank to y:
We rank the instances in list W on how often they receive the class label mvlabel(y)
assigned to them. The instance with the largest number of votes takes rank position
1
1. This can be achieved by computing for each y the value maxc2C votes(y,c) and sorting
the instances accordingly. The rank reflects the agreement of annotators concerning
the selected class label according to the majority voting labelling. We consider
consensus as indicator of how much the class label of the instance can be trusted, and
process high-ranked instances before low-ranked instances when computing
annotator reliability (see next subsection).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Topic-based reliability model for annotators</title>
        <p>To distinguish reliable annotators from unreliable ones, we introduce the concept of
reliability score: for each tweet y 2 W annotated by x, we set
agreement(x, y) Æ
(0, if vote(x, y) 6Æ inferredlabel(y)</p>
        <sec id="sec-3-3-1">
          <title>1, if vote(x, y) Æ inferredlabel(y)</title>
          <p>Then, we define the reliability score of annotator x over topic j as RSx,j , where RSx,j Æ
Py2W ^T Py,j 6Æ0 agreement(x, y). Each annotator is represented as N -dimensional
vector. The j th vector position contains reliability score of topic j , for j Æ 1 . . . N .</p>
          <p>RSx,j 2 [1, n j Å 1] where n j is the number of tweets comprising topic j . We
consider annotator a more reliable than annotator b in topic j , if RSa,j È RSb,j , i.e.
annotator a provided more annotations identical to the inferred labels than annotator
b did for tweets comprises topic j . A high topic-based reliability score indicates the
annotator’s high reliability over that topic. In the next subsection, we will refine the
computing scheme of the reliability scores by taking the incremental processing of
instances into account.
3.4</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Weighted Voting with Topic-based Reliability Scores</title>
        <p>Here we introduce our unsupervised incremental learning approach that applies
topic-based weights to the given annotations. The votes are weighted with the
annotators’ topic-based reliability scores without considering the different proportions of
topics comprised in a tweet. We only consider the incidence of topics.</p>
        <p>
          Let W be the set of the ranked tweets. The tweets y 2 W are processed
incrementally and the reliability scores are updated simultaneously. Here, we refine the
computing scheme of the reliability scores introduced earlier in subsection 3.3. The
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
computing will be applied in an incremental mode. We start processing with top-1
instance in list W , we infer the label of this instance using the initial reliability scores,
update the reliability scores for topics comprised in the top-1 instance according to
its inferred label, then move to infer the label of top-2 instance employing the
updated reliability scores, reupdate again the reliability socres accordingly and so on,
till we reach the last element top-N instance in the list W . The approach is described
in the following steps:
1. Initialize the reliabilities for all (x, j ) pairs to 1.
        </p>
        <p>RSx,j,1 Ã 1
2. We infer labels for tweets incrementally, starting the inference process at the
instance at rank 1. Each vote is weighted with the sum of the annotator
tweetrelated topic reliability scores.</p>
        <p>
          voteWeight(x, y) Ã X RSx,j,t¡1 (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
3. The weights are aggregated for annotators who provided identical votes by
summing them up.
        </p>
        <p>T Py,j 6Æ0
classWeight(c, y) Ã</p>
        <p>
          X
vote(x,y)Æc
voteWeight(x, y)
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
4. We select the class label that collected the highest weight as the label for the
tweet:
        </p>
        <p>
          InferredLabel(y) Ã ar g maxc2C (classWeight(c, y))
5. For each annotator who gave a vote identical to the inferred label, increment the
tweet-related topic reliability scores by 1 as follows:
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
Whereas the reliability scores for other annotators remain the same:
RSx,j,t Ã RSx,j,t¡1 Å 1
        </p>
        <p>RSx,j,t Ã RSx,j,t¡1</p>
        <p>
          Repeat the steps from (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) to (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) for the next tweets in the ranked list W , until all
tweets in the list are processed.
        </p>
        <p>The steps on how we infer the labels and derive reliability scores for annotators
are detailed in the Algorithm 1.</p>
      </sec>
      <sec id="sec-3-5">
        <title>INPUT:</title>
        <p>X: set of annotators, W: set of ranked tweets, J: set of Topics
C: set of classes, R: set of topic-based reliability scores
TPy,j : contribution of topic j in tweet y</p>
        <p>end
//Initialize all topic-based reliability scores
for x 2 X do
for j 2 J do</p>
        <p>RSx,j Ã 1
end
for y 2 W do
for c 2 C do
classWeight(c, y) Ã 0
for x 2 X do
if label(x, y) 6Æ 0 ^ vote(x, y) Æ c then
foreach j 2 J do
if T Py,j 6Æ 0 then</p>
        <p>classWeight(c, y)Å Æ RSx,j
;
end
;
end
;
end</p>
        <p>;
end
end
//choose the class that collected the highest weight to be the label of tweet y</p>
        <sec id="sec-3-5-1">
          <title>InferredLabel(y) Ã ar g M axc2C (classWeight(c, y))</title>
          <p>// update the topic-based reliability scores
for x 2 X do
if vote(x, y) Æ InferredLabel(y) then
foreach j 2 J do
if T Py,j 6Æ 0 then</p>
          <p>RSx,j ++
end
OUTPUT: Inferred labels and annotators topic-based reliability scores
Algorithm 1: Weighted voting with topic-based reliability scores(WVTRS)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Evaluation framework</title>
      <p>To evaluate our approach we propose the metrics presented in subsection 4.1.
We do not have ground truth on topic reliability. Therefore, we built a simulator,
described in subsection 4.2.</p>
      <sec id="sec-4-1">
        <title>4.1 Experiment Evaluation Metrics</title>
        <p>As basis of our evaluation, we consider accuracy, computed as the ratio of correctly
labeled tweets to all tweets. We further introduce an error rate metric that computes
the difference between estimated and true reliability score:</p>
        <sec id="sec-4-1-1">
          <title>ErrorTopicReliabilityScores Æ M ¤ N</title>
          <p>where si mRSx,j are the reliability score values created by the simulator introduced
in the next subsection; they serve as ground truth.</p>
          <p>The Kappa weighted voting method [4] and the majority voting baselines do not
employ topic reliability scores in the inference process. Therefore, we use the
preliminary computing scheme of the reliability score introduced in subsection 3.3, in
which the computation of the reliability scores is not conducted incrementally.</p>
          <p>s PiMÆ1 PNjÆ1(si mRSx,j ¡ RSx,j )2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Simulation</title>
        <p>Due to the difficulty of collecting labeled tweets in a closed setting for the purpose
of this study, our experiment setting involves simulating annotations as we are using
Social Media to collect labels. In this project, we simulate three different types of
annotators; reliable, partially reliable and unreliable annotators. Reliable and partially
reliable annotators represent humans with good intentions.</p>
        <p>We refer to annotator’s reliability accuracy for annotator x over topic j as R Ax,j ,
where R Ax,j 2 {0, 1}. Reliable annotators are more likely to deliver correct labels than
partially reliable ones, hence, we assign high topic-based reliability accuracy R Ax,j Æ
0.8 to reliable annotators and relatively low topic-based reliability accuracy R Ax,j Æ
0.05 to partially reliable annotators. For each topic, we generate 75% of annotators to
be reliable while the remaining 25% are assumed to be partially reliable. Unreliable
annotators are assumed to always provide wrong labels with R Ax,j Æ 0.0.</p>
        <p>To simulate the likelihood of responding to a tweet, we assume that the number
of annotations each annotator provides is a random variable follows a uniform
distribution. In the simulator component, we incorporate the different proportions of
topics comprised in the tweets when we compute the probability of annotator x to
label tweet y correctly as the weighted average with the tweet-topic coefficients of
the topic-based reliability accuracy as per the formula below:</p>
        <sec id="sec-4-2-1">
          <title>ProbabilityOfCorrectLabely,x Æ T Py,1 ¤ R Ax,1 Å T Py,2 ¤ R Ax,2 Å .. Å T Py,N ¤ R Ax,N T Py,1 Å T Py,2 Å T Py,3 Å ... Å T Py,N</title>
          <p>
            (
            <xref ref-type="bibr" rid="ref8">8</xref>
            )
(
            <xref ref-type="bibr" rid="ref9">9</xref>
            )
          </p>
          <p>For every tweet y and annotator x, annotations are generated according to the
likelihood of responding to tweet y and to the probability of correctly labeling it.</p>
          <p>After the simulation of annotations, we assign to every annotator x a reliability
score si mRSx,j over topic j ; they serve as ground truth. These reliability scores are
computed in a similar manner to the preliminary computing scheme of the
reliability scores introduced earlier in subsection 3.3. However, instead of relying on the
inferred labels, the si mRSx,j are computed based on the generated annotations and
label(y) with respect to the known ground truth .</p>
          <p>For each tweet y annotated by x, we set
simAgreement(x, y) Æ
(0, if vote(x, y) 6Æ label(y)</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>1, if vote(x, y) Æ label(y)</title>
          <p>(10)
Then, we compute the reliability score of annotator x over topic j as si mRSx,j Æ
Py2Y ^T Py,j 6Æ0 simAgreement(x, y). Each annotator is represented as N -dimensional
vector. The j th vector position contains reliability score of topic j , for j Æ 1 . . . N .
si mRSx,j 2 [1, n j Å 1] where n j is the number of tweets comprising topic j .</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Experiments</title>
      <sec id="sec-5-1">
        <title>5.1 Outline</title>
        <p>We ran several experiments to investigate how the number of labels an
annotator provides and the reliability of this annotator affects the quality of a model that
classifies the instances on sentiment.</p>
        <p>We run our experiments on the U.S. Airline Sentiment dataset 5, which we denote as
A(irline) thereafter. From it, we created three random samples of size 1000, three of
size 2500 and three of size 5000 tweets to be annotated. Whenever we report quality
in the experiments, we refer to accuracy, averaged over the three samples of the same
size.</p>
        <p>We first run experiments to find the best number of topics to be used for our
approach (subsection 5.2). Then, we tested the effect of consensus ranking on the
performance of our model (subsection 5.3), assuming 500, 1000 and 2000 annotators.</p>
        <p>
          To evaluate the robustness of our model we simulated three types of crowds A, B,
C, in which we incorporated different percentages of unreliable annotators: 1) Crowd
A: 30% of the annotators are unreliable. 2) Crowd B: 10% of the annotators are
unreliable. 3) Crowd C: only reliable annotators. We used these crowds to study the effect
of retaining the annotators’ reliability scores in the system across many annotation
tasks, assuming that a subset of annotators is active and assigns labels for several
annotation tasks on the annotation platform (subsection 5.4). To test the effect of
learning the annotators’ reliability scores on the performance, we conducted a
comparison over two aspects:(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) different number of annotators.(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) different number of
annotations per annotator. (subsection 5.5).
        </p>
        <p>Finally, in subsection 5.6 we report the overall performance of our model against
the baselines Kappa weighted voting [4] and the Majority Voting for different number
of tweets and varying number of annotators.</p>
        <p>Across all experiments discussed earlier, the accuracy reported is the average
accuracy computed over three disjoint sets of tweets.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2 Experiment on organizing the tweets into topics</title>
        <p>In this experiment, we study how the number of topics affects the performance. We
assume 1000 tweets and 1000 annotators, 10% of whom are assumed to be
unreliable. We find that having 15 topics modeled over the entire dataset gave us the best
performance as shown in Figure 2.
5 https://www.kaggle.com/crowdflower/twitter-airline-sentiment</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3 Experiment on instances ranking</title>
        <p>In this experiment, we study how ranking of instances improves the model
performance. Due to the complete absence of any prior knowledge about the annotators,
their reliability scores are estimated only from the provided annotations. Based on
our assumption that the majority is reliable and since tweets are processed
sequentially, we test the impact of processing the tweets that received the highest
consensus first. Ranking of instances gives better estimation of the reliability scores, hence,
it improves the model performance. Detailed results of comparing between ranked
and unranked tweets is shown for 1000 tweets annotated by 500 annotators along
different types of crowd in Table 1.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Model performance for constantly active annotators</title>
        <p>To test the performance of the model in this scenario, we simulate the time factor by
assuming that annotating five datasets, where each dataset consists of 1000 tweets, is
equivalent to annotating one set with 5000 tweets. Annotating two datasets of 1000
tweets per set is equivalent to annotating one set of 2000 tweets. For every dataset
the annotators labeled four random tweets. As shown in Table 2, the best results are
observed when annotators participated in more annotation tasks(i.e five dataset).</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5 Comparison of performance achieved by different number of annotators annotating randomly varying number of tweets</title>
        <p>Assuming a fixed number of annotations is given by each annotator, the larger the
number of annotators participated, the higher the accuracy achieved. A better
performance was recorded for a larger set of annotators (1000 annotators) compared to
the group of 500 annotators annotating randomly four tweets. Whereas a higher
accuracy was recorded when the group of the 500 annotators annotated more tweets
(eight tweets). However, the best performance was delivered by the smallest group of
annotators (500 annotators) labeling the largest dataset with 5000 tweets according
to the results shown in Table 3.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6 Overall Performance</title>
        <p>We compare our approach WVTRS against the baselines Kappa Weighted Voting KWV
and Majority Voting MV. The overall results according to different number of
annotators processed for Dataset A are as shown in Table 4.</p>
        <p>Across all the experiments, our approach performed the best compared to the
baselines. The model achieved the best performance when the smallest number of
annotators (500) annotated a dataset with 5000 tweets. As a result, the more
annotations the annotator delivers, the model’s capacity of estimating the annotator’s
reliability scores improves, thus the labels inference enhances. The model was also
robust across different percentages of unreliable annotators and performed better
than the Kappa Weighted Voting approach. We experimented our approach on a
dataset which has very high homogeneity level of the comprised topics, further tests
are required to determine if our model performs better with datasets that are more
heterogeneous. These results suggest that the WVTRS approach that we propose
gives promising results that can be augmented with testing the approach on different
datasets with different levels of heterogeneity in the topics contained in the instances
and more informative topics.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6 Conclusion</title>
      <p>In this paper we propose an approach to distinguish between reliable and
unreliable annotators over topical areas and to infer the labels through a weighted voting
with annotators’ topic-based reliability scores. We believe there is potential for our
approach to improve the accuracy by improving the topic modeling step. The
limitations of our approach are: 1) The different proportions of topics comprised in a
tweet are treated equally. The votes are weighted with the annotators’ topic-based
reliability scores without considering the different proportions of topics comprised in
the tweet. Due to the homogeneity of topics in the chosen dataset, the experiments
do not manifest the impact of this limitation. 2) Processing the tweets online is not
feasible, due to the tweets ranking step.</p>
      <p>As future work we intend to work on the following aspect: Incorporate prior
knowledge about the annotators through crawling their Twitter profiles. We can consider
each annotator as a document, then apply topic modeling over tweets and
annotators. Hence, we can measure the similarity between annotators and tweets and weigh
the votes given by annotators with these similarities. The higher the similarity
between an annotator and a tweet, the more reliable that annotators’ annotation for
that tweet is.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgement</title>
      <p>This work is supported by the German Research Foundation (DFG) under project
OSCAR (Opinion Stream Classification with Ensembles and Active learners).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C. H.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Miao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , “
          <article-title>Active crowdsourcing for annotation</article-title>
          ,” in
          <source>2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)</source>
          ,
          <source>vol. 2</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Bhowmick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Basu</surname>
          </string-name>
          , “
          <article-title>An agreement measure for determining interannotator reliability of human judgements on affective text,”</article-title>
          <source>in Proceedings of the Workshop on Human Judgements in Computational Linguistics</source>
          ,
          <source>HumanJudge '08</source>
          , pp.
          <fpage>58</fpage>
          -
          <lpage>65</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. P.-Y. Hsueh,
          <string-name>
            <given-names>P.</given-names>
            <surname>Melville</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Sindhwani</surname>
          </string-name>
          , “
          <article-title>Data quality from crowdsourcing: A study of annotation selection criteria</article-title>
          ,
          <source>” Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing</source>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>35</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Swanson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lukin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Eisenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Corcoran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Walker</surname>
          </string-name>
          , “
          <article-title>Getting reliable annotations for sarcasm in online dialogues</article-title>
          ,”
          <source>in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014)</source>
          , pp.
          <fpage>4250</fpage>
          -
          <lpage>4257</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. L.
          <string-name>
            <surname>Pion-Tonachini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Makeig</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Kreutz-Delgado</surname>
          </string-name>
          , “
          <article-title>Crowd labeling latent dirichlet allocation</article-title>
          ,
          <source>” Knowledge and Information Systems</source>
          , vol.
          <volume>53</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>V. C.</given-names>
            <surname>Raykar</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          , “
          <article-title>Ranking annotators for crowdsourced labeling tasks</article-title>
          ,
          <source>” NIPS</source>
          , vol.
          <volume>24</volume>
          , pp.
          <fpage>1809</fpage>
          -
          <lpage>1817</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Nowak</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rüger</surname>
          </string-name>
          , “
          <article-title>How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation,”</article-title>
          <source>in Proceedings of the International Conference on Multimedia Information Retrieval, MIR '10</source>
          , pp.
          <fpage>557</fpage>
          -
          <lpage>566</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Farra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , “
          <article-title>SemEval-2017 task 4: Sentiment analysis in twitter</article-title>
          ,”
          <source>in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          , pp.
          <fpage>502</fpage>
          -
          <lpage>518</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>T.</given-names>
            <surname>Hashimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kuboyama</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          , “
          <article-title>Topic extraction from millions of tweets using singular value decomposition and feature selection</article-title>
          ,
          <source>” in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)</source>
          , pp.
          <fpage>1145</fpage>
          -
          <lpage>1150</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>