1 Introduction

Assessing the reliability of crowdsourced labels via Twitter

Noor Jamaludeen

noor.jamaludeen@ovgu.de 0

Vishnu Unnikrishnan

vishnu.unnikrishnan@ovgu.de 0

Maya S. Sekeran

maya.santhira@st.ovgu.de 0

Majed Ali

majed.ali@ovgu.de 0

Le Anh Trang

anh1.le@st.ovgu.de 0

Myra Spiliopoulou

myra@ovgu.de 0 0 University of Magdeburg , Germany

Crowdsourcing has been recently a popular solution to overcome the high cost of acquiring labeled datasets. However, the reliability of crowdsourced labels remains a challenge. Many approaches rely on domain experts who are scarce and expensive. In this work, we propose to use Twitter to acquire labels and to juxtapose them with crowdsourced ones. This allows us to measure annotator reliability. Since annotator expertise may vary, depending on content, we propose a new topic-based reliability measurement approach. We compare our model with Kappa Weighted Voting and Majority Voting as baseline methods, and show that our approach performs well and is robust when up to 30% of the annotators is not reliable.

crowdsourcing kappa weighted voting annotator reliability

1 Introduction

Building a robust classification model requires a labeled dataset. Crowdsourcing for annotations has been gaining popularity over recent years. Platforms such as AmazonTurk 1 and CrowdFlower 2 offer to pay people for providing annotations. However, the quality of the annotations still needs to be checked against labels acquired from domain experts. Hence, there is a need to measure annotator reliability.

We introduce a new approach that collects labels for tweets from Twitter, organizes them on topic and assesses the reliability of the annotators with respect to the labels they assign to the tweets, taking the topics into account. In our approach, we take advantage of the fact that people are spending about two hours a day on these Social Media platforms and the amount of time spent is on a steady increase 3. On Twitter alone, according to Statista 4, there are 335 million monthly active users.

Our contributions are as follows. We propose a new annotation tool for tweet sentiment labeling, that capitalizes on topic-specific expertise of Twitter users. We derive topics from the tweets and use them to derive topic-based reliability scores for the annotators. These scores we use in a weighting scheme for the annotated tweets. This allows us to exploit the fact that an annotator may be more reliable for tweets belonging to a certain topic than to other topics.

This work is organized as follows. We next discuss related work on crowdsourcing and annotator reliability. In Section 3 we present the components of our approach. Section 4 contains our evaluation framework, which encompasses also a simulator for annotators. In Section 5 we report on our experiments for various percentages of unreliable annotators, as generated by our simulator. The last section concludes our study with a summary and future issues.

A note on terminology: Throughout this work, we use the terms “instance” and “tweet” interchangeably. We term a user who assigns a “label” to an instance as “annotator” and call this activity “annotation”. 2

Related Work

There are various approaches to tackle annotator reliability when crowdsourcing for labels [1], [2], [3]. However these studies require domain experts to validate the labels collected from annotators. In [1], Hao et al. model annotators reliability based on their cumulative performance. However, they do not consider the possibility of not having the same annotator providing labels over time. In [2], Bhowmick et al. propose a coefficient to measure annotator reliability where multiple labels can be assigned to an annotation. Here, we will be using a single label for every tweet.

Close to our work is the method of Swanson et al. [4] where annotators who have high agreements with other annotators are given higher reliability scores. In our work, annotators who deliver annotations identical to the inferred labels are assigned high reliability scores over the topics comprised in the annotated tweets. In [5], Pion-Tonachini et al. use Latent Dirichlet Allocation to model the annotators’ expertise over the classes which are analogous to topics in the topic-modeling application, in which is common to apply LDA. They define vote-class relationship to model the annotators’ individual interpretation of the classes given the votes. In our work, we do not limit the annotators’ expertise over only the classes, we learn the annotators’ reliability on latent topics modeled over the dataset, which simulates the real world setting more.

Furthermore, Pion-Tonachini et al. [5] present CL-LDA-BPE an extension model to incorporate prior knowledge of the annotators’ expertise through a structured Bayesian framework. We rather assume no prior knowledge, and therefore induce the annotators expertise from the annotations only. 3

Our Approach

Our goal is to acquire reliable sentiment labels for tweets, using Twitter users as annotators. Our approach towards this goal encompasses following tasks, depicted on Figure 1 and described in the next subsections. ( 1 ) Collecting instances and mapping them into topics, ( 2 ) Ranking instances on consensus among annotators, ( 3 ) Topicbased reliability model for the annotators, and ( 4 ) Weighted Voting with Topic-based Reliability Scoring mechanism (WVTRS).

3.1 Collecting instances and mapping them into topics

For the database of tweets Y (with L denoting the cardinality of Y ) we acquire class labels (in our experiments: labels on sentiment) from Twitter: we developed a tool where each y 2 Y is posted to Twitter as a poll for a period of 7 days, during which users of Twitter can vote for one of the possible labels. The nature of the environment automatically limits users to voting only once. Once the poll has expired, every response to the tweet by each user x is stored as (y, x, vote(y, x)), where vote(y, x)) 2 C and C is the set of classes . The annotators constitute a set X , denoting its cardinality as M .

We learn the topics over Y by computing the TF-IDF values for all terms, building an instance-term matrix, and decomposing it into Instance-Topic matrix and Topic-Term matrix using Non-negative Matrix Factorization (NMF). According to the Topic-Term matrix, each term is assigned to the topic, in which the term has its maximum value. When that term occurs in a tweet, we consider this maximum value as the contribution of the corresponding topic in the tweet and we refer to it as T Py,j . In case many terms belonging to the same topic occur in the same tweet, then the T Py,j is the sum of these terms-topic’s maximum values. We represent each tweet y as an N -dimensional vector, as y ÆÇ T Py,1, T Py,2, ..., T Py,N È, where N is the number of topics.

3.2 Ranking instances on consensus

In most real-life crowdsourcing scenarios without monetary remunerations, it is reasonable to expect that very few users will contribute consistently to the system, giving skewed intensities with which users interact with a system. It is also possible that some instances receive more votes than others for a variety of reasons (ease of annotation, skewed availability of expertise, etc.). To accommodate this fact, we first sort the tweets on ‘maximum consensus’, and then step through the collected responses one tweet at a time, incrementally updating the annotator reliability (which is computed as described in the next subsection).

For tweet y and class label c, let votes(y, c) be the number of annotators who assigned c to y. We assign each tweet to the class according to the majority voting, i.e. mvlabel(y) Æ ar g maxc2C votes(y, c).We use this number also to assign a rank to y: We rank the instances in list W on how often they receive the class label mvlabel(y) assigned to them. The instance with the largest number of votes takes rank position 1 1. This can be achieved by computing for each y the value maxc2C votes(y,c) and sorting the instances accordingly. The rank reflects the agreement of annotators concerning the selected class label according to the majority voting labelling. We consider consensus as indicator of how much the class label of the instance can be trusted, and process high-ranked instances before low-ranked instances when computing annotator reliability (see next subsection).

3.3 Topic-based reliability model for annotators

To distinguish reliable annotators from unreliable ones, we introduce the concept of reliability score: for each tweet y 2 W annotated by x, we set agreement(x, y) Æ (0, if vote(x, y) 6Æ inferredlabel(y)

1, if vote(x, y) Æ inferredlabel(y)

Then, we define the reliability score of annotator x over topic j as RSx,j , where RSx,j Æ Py2W ^T Py,j 6Æ0 agreement(x, y). Each annotator is represented as N -dimensional vector. The j th vector position contains reliability score of topic j , for j Æ 1 . . . N .

RSx,j 2 [1, n j Å 1] where n j is the number of tweets comprising topic j . We consider annotator a more reliable than annotator b in topic j , if RSa,j È RSb,j , i.e. annotator a provided more annotations identical to the inferred labels than annotator b did for tweets comprises topic j . A high topic-based reliability score indicates the annotator’s high reliability over that topic. In the next subsection, we will refine the computing scheme of the reliability scores by taking the incremental processing of instances into account. 3.4

Weighted Voting with Topic-based Reliability Scores

Here we introduce our unsupervised incremental learning approach that applies topic-based weights to the given annotations. The votes are weighted with the annotators’ topic-based reliability scores without considering the different proportions of topics comprised in a tweet. We only consider the incidence of topics.

Let W be the set of the ranked tweets. The tweets y 2 W are processed incrementally and the reliability scores are updated simultaneously. Here, we refine the computing scheme of the reliability scores introduced earlier in subsection 3.3. The ( 1 ) computing will be applied in an incremental mode. We start processing with top-1 instance in list W , we infer the label of this instance using the initial reliability scores, update the reliability scores for topics comprised in the top-1 instance according to its inferred label, then move to infer the label of top-2 instance employing the updated reliability scores, reupdate again the reliability socres accordingly and so on, till we reach the last element top-N instance in the list W . The approach is described in the following steps: 1. Initialize the reliabilities for all (x, j ) pairs to 1.

RSx,j,1 Ã 1 2. We infer labels for tweets incrementally, starting the inference process at the instance at rank 1. Each vote is weighted with the sum of the annotator tweetrelated topic reliability scores.

voteWeight(x, y) Ã X RSx,j,t¡1 ( 3 ) 3. The weights are aggregated for annotators who provided identical votes by summing them up.

T Py,j 6Æ0 classWeight(c, y) Ã

X vote(x,y)Æc voteWeight(x, y) ( 4 ) 4. We select the class label that collected the highest weight as the label for the tweet:

InferredLabel(y) Ã ar g maxc2C (classWeight(c, y)) 5. For each annotator who gave a vote identical to the inferred label, increment the tweet-related topic reliability scores by 1 as follows: ( 2 ) ( 5 ) ( 6 ) ( 7 ) Whereas the reliability scores for other annotators remain the same: RSx,j,t Ã RSx,j,t¡1 Å 1

RSx,j,t Ã RSx,j,t¡1

Repeat the steps from ( 2 ) to ( 5 ) for the next tweets in the ranked list W , until all tweets in the list are processed.

The steps on how we infer the labels and derive reliability scores for annotators are detailed in the Algorithm 1.

INPUT:

X: set of annotators, W: set of ranked tweets, J: set of Topics C: set of classes, R: set of topic-based reliability scores TPy,j : contribution of topic j in tweet y

end //Initialize all topic-based reliability scores for x 2 X do for j 2 J do

RSx,j Ã 1 end for y 2 W do for c 2 C do classWeight(c, y) Ã 0 for x 2 X do if label(x, y) 6Æ 0 ^ vote(x, y) Æ c then foreach j 2 J do if T Py,j 6Æ 0 then

classWeight(c, y)Å Æ RSx,j ; end ; end ; end

; end end //choose the class that collected the highest weight to be the label of tweet y

InferredLabel(y) Ã ar g M axc2C (classWeight(c, y))

// update the topic-based reliability scores for x 2 X do if vote(x, y) Æ InferredLabel(y) then foreach j 2 J do if T Py,j 6Æ 0 then

RSx,j ++ end OUTPUT: Inferred labels and annotators topic-based reliability scores Algorithm 1: Weighted voting with topic-based reliability scores(WVTRS)

4 Evaluation framework

To evaluate our approach we propose the metrics presented in subsection 4.1. We do not have ground truth on topic reliability. Therefore, we built a simulator, described in subsection 4.2.

4.1 Experiment Evaluation Metrics

As basis of our evaluation, we consider accuracy, computed as the ratio of correctly labeled tweets to all tweets. We further introduce an error rate metric that computes the difference between estimated and true reliability score:

ErrorTopicReliabilityScores Æ M ¤ N

where si mRSx,j are the reliability score values created by the simulator introduced in the next subsection; they serve as ground truth.

The Kappa weighted voting method [4] and the majority voting baselines do not employ topic reliability scores in the inference process. Therefore, we use the preliminary computing scheme of the reliability score introduced in subsection 3.3, in which the computation of the reliability scores is not conducted incrementally.

s PiMÆ1 PNjÆ1(si mRSx,j ¡ RSx,j )2

4.2 Simulation

Due to the difficulty of collecting labeled tweets in a closed setting for the purpose of this study, our experiment setting involves simulating annotations as we are using Social Media to collect labels. In this project, we simulate three different types of annotators; reliable, partially reliable and unreliable annotators. Reliable and partially reliable annotators represent humans with good intentions.

We refer to annotator’s reliability accuracy for annotator x over topic j as R Ax,j , where R Ax,j 2 {0, 1}. Reliable annotators are more likely to deliver correct labels than partially reliable ones, hence, we assign high topic-based reliability accuracy R Ax,j Æ 0.8 to reliable annotators and relatively low topic-based reliability accuracy R Ax,j Æ 0.05 to partially reliable annotators. For each topic, we generate 75% of annotators to be reliable while the remaining 25% are assumed to be partially reliable. Unreliable annotators are assumed to always provide wrong labels with R Ax,j Æ 0.0.

To simulate the likelihood of responding to a tweet, we assume that the number of annotations each annotator provides is a random variable follows a uniform distribution. In the simulator component, we incorporate the different proportions of topics comprised in the tweets when we compute the probability of annotator x to label tweet y correctly as the weighted average with the tweet-topic coefficients of the topic-based reliability accuracy as per the formula below:

ProbabilityOfCorrectLabely,x Æ T Py,1 ¤ R Ax,1 Å T Py,2 ¤ R Ax,2 Å .. Å T Py,N ¤ R Ax,N T Py,1 Å T Py,2 Å T Py,3 Å ... Å T Py,N

( 8 ) ( 9 )

For every tweet y and annotator x, annotations are generated according to the likelihood of responding to tweet y and to the probability of correctly labeling it.

After the simulation of annotations, we assign to every annotator x a reliability score si mRSx,j over topic j ; they serve as ground truth. These reliability scores are computed in a similar manner to the preliminary computing scheme of the reliability scores introduced earlier in subsection 3.3. However, instead of relying on the inferred labels, the si mRSx,j are computed based on the generated annotations and label(y) with respect to the known ground truth .

For each tweet y annotated by x, we set simAgreement(x, y) Æ (0, if vote(x, y) 6Æ label(y)

1, if vote(x, y) Æ label(y)

(10) Then, we compute the reliability score of annotator x over topic j as si mRSx,j Æ Py2Y ^T Py,j 6Æ0 simAgreement(x, y). Each annotator is represented as N -dimensional vector. The j th vector position contains reliability score of topic j , for j Æ 1 . . . N . si mRSx,j 2 [1, n j Å 1] where n j is the number of tweets comprising topic j .

5 Experiments 5.1 Outline

We ran several experiments to investigate how the number of labels an annotator provides and the reliability of this annotator affects the quality of a model that classifies the instances on sentiment.

We run our experiments on the U.S. Airline Sentiment dataset 5, which we denote as A(irline) thereafter. From it, we created three random samples of size 1000, three of size 2500 and three of size 5000 tweets to be annotated. Whenever we report quality in the experiments, we refer to accuracy, averaged over the three samples of the same size.

We first run experiments to find the best number of topics to be used for our approach (subsection 5.2). Then, we tested the effect of consensus ranking on the performance of our model (subsection 5.3), assuming 500, 1000 and 2000 annotators.

To evaluate the robustness of our model we simulated three types of crowds A, B, C, in which we incorporated different percentages of unreliable annotators: 1) Crowd A: 30% of the annotators are unreliable. 2) Crowd B: 10% of the annotators are unreliable. 3) Crowd C: only reliable annotators. We used these crowds to study the effect of retaining the annotators’ reliability scores in the system across many annotation tasks, assuming that a subset of annotators is active and assigns labels for several annotation tasks on the annotation platform (subsection 5.4). To test the effect of learning the annotators’ reliability scores on the performance, we conducted a comparison over two aspects:( 1 ) different number of annotators.( 2 ) different number of annotations per annotator. (subsection 5.5).

Finally, in subsection 5.6 we report the overall performance of our model against the baselines Kappa weighted voting [4] and the Majority Voting for different number of tweets and varying number of annotators.

Across all experiments discussed earlier, the accuracy reported is the average accuracy computed over three disjoint sets of tweets.

5.2 Experiment on organizing the tweets into topics

In this experiment, we study how the number of topics affects the performance. We assume 1000 tweets and 1000 annotators, 10% of whom are assumed to be unreliable. We find that having 15 topics modeled over the entire dataset gave us the best performance as shown in Figure 2. 5 https://www.kaggle.com/crowdflower/twitter-airline-sentiment

5.3 Experiment on instances ranking

In this experiment, we study how ranking of instances improves the model performance. Due to the complete absence of any prior knowledge about the annotators, their reliability scores are estimated only from the provided annotations. Based on our assumption that the majority is reliable and since tweets are processed sequentially, we test the impact of processing the tweets that received the highest consensus first. Ranking of instances gives better estimation of the reliability scores, hence, it improves the model performance. Detailed results of comparing between ranked and unranked tweets is shown for 1000 tweets annotated by 500 annotators along different types of crowd in Table 1.

Model performance for constantly active annotators

To test the performance of the model in this scenario, we simulate the time factor by assuming that annotating five datasets, where each dataset consists of 1000 tweets, is equivalent to annotating one set with 5000 tweets. Annotating two datasets of 1000 tweets per set is equivalent to annotating one set of 2000 tweets. For every dataset the annotators labeled four random tweets. As shown in Table 2, the best results are observed when annotators participated in more annotation tasks(i.e five dataset).

5.5 Comparison of performance achieved by different number of annotators annotating randomly varying number of tweets

Assuming a fixed number of annotations is given by each annotator, the larger the number of annotators participated, the higher the accuracy achieved. A better performance was recorded for a larger set of annotators (1000 annotators) compared to the group of 500 annotators annotating randomly four tweets. Whereas a higher accuracy was recorded when the group of the 500 annotators annotated more tweets (eight tweets). However, the best performance was delivered by the smallest group of annotators (500 annotators) labeling the largest dataset with 5000 tweets according to the results shown in Table 3.

5.6 Overall Performance

We compare our approach WVTRS against the baselines Kappa Weighted Voting KWV and Majority Voting MV. The overall results according to different number of annotators processed for Dataset A are as shown in Table 4.

Across all the experiments, our approach performed the best compared to the baselines. The model achieved the best performance when the smallest number of annotators (500) annotated a dataset with 5000 tweets. As a result, the more annotations the annotator delivers, the model’s capacity of estimating the annotator’s reliability scores improves, thus the labels inference enhances. The model was also robust across different percentages of unreliable annotators and performed better than the Kappa Weighted Voting approach. We experimented our approach on a dataset which has very high homogeneity level of the comprised topics, further tests are required to determine if our model performs better with datasets that are more heterogeneous. These results suggest that the WVTRS approach that we propose gives promising results that can be augmented with testing the approach on different datasets with different levels of heterogeneity in the topics contained in the instances and more informative topics.

6 Conclusion

In this paper we propose an approach to distinguish between reliable and unreliable annotators over topical areas and to infer the labels through a weighted voting with annotators’ topic-based reliability scores. We believe there is potential for our approach to improve the accuracy by improving the topic modeling step. The limitations of our approach are: 1) The different proportions of topics comprised in a tweet are treated equally. The votes are weighted with the annotators’ topic-based reliability scores without considering the different proportions of topics comprised in the tweet. Due to the homogeneity of topics in the chosen dataset, the experiments do not manifest the impact of this limitation. 2) Processing the tweets online is not feasible, due to the tweets ranking step.

As future work we intend to work on the following aspect: Incorporate prior knowledge about the annotators through crawling their Twitter profiles. We can consider each annotator as a document, then apply topic modeling over tweets and annotators. Hence, we can measure the similarity between annotators and tweets and weigh the votes given by annotators with these similarities. The higher the similarity between an annotator and a tweet, the more reliable that annotators’ annotation for that tweet is.

Acknowledgement

This work is supported by the German Research Foundation (DFG) under project OSCAR (Opinion Stream Classification with Ensembles and Active learners).

Hao ,

S. C. H.

Hoi ,

Miao , and

Zhao , “ Active crowdsourcing for annotation ,” in 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) , vol. 2 , pp. 1 - 8 , 2015 .

P. K.

Bhowmick ,

Mitra , and

Basu , “ An agreement measure for determining interannotator reliability of human judgements on affective text,” in Proceedings of the Workshop on Human Judgements in Computational Linguistics , HumanJudge '08 , pp. 58 - 65 , 2008 .

3. P.-Y. Hsueh,

Melville , and

Sindhwani , “ Data quality from crowdsourcing: A study of annotation selection criteria , ” Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing , pp. 27 - 35 , 2009 .

Swanson ,

Lukin ,

Eisenberg ,

Corcoran , and

Walker , “ Getting reliable annotations for sarcasm in online dialogues ,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014) , pp. 4250 - 4257 , 2014 .

5. L. Pion-Tonachini , S.

Makeig , and K.

Kreutz-Delgado , “ Crowd labeling latent dirichlet allocation , ” Knowledge and Information Systems , vol. 53 , pp. 1 - 17 , 2017 .

V. C.

Raykar and

Yu , “ Ranking annotators for crowdsourced labeling tasks , ” NIPS , vol. 24 , pp. 1809 - 1817 , 2011 .

Nowak and

Rüger , “ How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation,” in Proceedings of the International Conference on Multimedia Information Retrieval, MIR '10 , pp. 557 - 566 , 2010 .

Rosenthal ,

Farra , and

Nakov , “ SemEval-2017 task 4: Sentiment analysis in twitter ,” in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , pp. 502 - 518 , 2017 .

Hashimoto ,

Kuboyama , and

Chakraborty , “ Topic extraction from millions of tweets using singular value decomposition and feature selection , ” in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) , pp. 1145 - 1150 , 2015 .