-

Stance Classi cation for Rumour Analysis in Twitter: Exploiting A ective Information and Conversation Structure

Endang Wahyu Pamungkas

Valerio Basile

Viviana Patti Dipartimento di Informatica

pattig@di.unito.it 0 0 Universita degli Studi di Torino

1http://www.journalism.org/2017/09/07/ news-use-across-social-media-platforms-2017/

Analysing how people react to rumours associated with news in social media is an important task to prevent the spreading of misinformation, which is nowadays widely recognized as a dangerous tendency. In social media conversations, users show di erent stances and attitudes towards rumourous stories. Some users take a de nite stance, supporting or denying the rumour at issue, while others just comment it, or ask for additional evidence on the rumour's veracity. A shared task has been proposed at SemEval-2017 (Task 8, SubTask A), which is focused on rumour stance classi cation in English tweets. The goal is predicting user stance towards emerging rumours in Twitter, in terms of supporting, denying, querying, or commenting the original rumour, looking at the conversation threads originated by the rumour. This paper describes a new approach to this task, where the use of conversation-based and a ective-based features, covering di erent facets of a ect, is explored. Our classi cation model outperforms the best-performing systems for stance classi cation at SemEval-2017 showing the effectiveness of the feature set proposed.

Copyright © CIKM 2018 for the individual papers by the papers' authors. Copyright © CIKM 2018 for the volume as a collection by its editors. This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC BY 4.0). 1

Introduction

Nowadays, people increasingly tend to use social media like Facebook and Twitter as their primary source of information and news consumption. There are several reasons behind this tendency, such as the simplicity to gather and share the news and the possibility of staying abreast of the latest news and updated faster than with traditional media. An important factor is also that people can be engaged in conversations on the latest breaking news with their contacts by using these platforms. Pew Research Center's newest report1 shows that two-thirds of U.S. adults gather their news from social media, where Twitter is the most used platform. However, the absence of a systematic approach to do some form of fact and veracity checking may also encourage the spread of rumourous stories and misinformation [PVV13]. Indeed, in social media, unveri ed information can spread very quickly and becomes viral easily, enabling the di usion of false rumours and fake information.

Within this scenario, it is crucial to analyse people attitudes towards rumours in social media and to resolve their veracity as soon as possible. Several approaches have been proposed to check the rumour veracity in social media [SSW+17]. This paper focus on a stance-based analysis of event-related rumours, following the approach proposed at SemEval-2017 in the new RumourEval shared task (Task 8, sub-task A) [DBL+17]. In this task English tweets from conversation threads, each associated to a newsworthy event and the rumours around it, are provided as data. The goal is to determine whether a tweet in the thread is supporting, denying, querying, or commenting the original rumour which started the conversation. It can be considered a stance classi cation task, where we have to predict the user's stance towards the rumour from a tweet, in the context of a given thread. This task has been de ned as open stance classi cation task and is conceived as a key step in rumour resolution, by providing an analysis of people reactions towards an emerging rumour [PVV13, ZLP+16]. The task is also di erent from detecting stance towards a speci c target entity [MKS+16].

Contribution We describe a novel classi cation approach, by proposing a new feature matrix, which includes two new groups: (a) features exploiting the conversational structure of the dataset [DBL+17]; (b) a ective features relying on the use of a wide range of a ective resources capturing di erent facets of sentiment and other a ect-related phenomena. We were also inspired by the fake news study on Twitter in [VRA18], showing that false stories inspire fear, disgust, and surprise in replies, while true stories inspire anticipation, sadness, joy, and trust. Meanwhile, from a dialogue act perspective, the study of [NS13] found that a relationship exists between the use of an a ective lexicon and the communicative intention of an utterance which includes AGREE-ACCEPT (support), REJECT (deny), INFO-REQUEST (question), and OPINION (comment). They exploited several LIWC categories to analyse the role of a ective content.

Our results show that our model outperforms the state of the art on the Semeval-2017 benchmark dataset. Feature analysis highlights the contribution of the di erent feature groups, and error analysis is shedding some light on the main di culties and challenges which still need to be addressed.

Outline The paper is organized as follows. Section 2 introduces the SemEval-2017 Task 8. Section 3 describes our approach to deal with open stance classication by exploiting di erent groups of features. Section 4 describes the evaluation and includes a qualitative error analysis. Finally, Section 5 concludes the paper and points to future directions. 2

SemEval-2017 Task 8: RumourEval

The SemEval-2017 Task 8 Task A [DBL+17] has as its main objective to determine the stance of the users in a Twitter thread towards a given rumour, in terms of support, denying, querying or commenting (SDQC) on the original rumour. Rumour is de ned as a \circulating story of questionable veracity, which is apparently credible but hard to verify, and produces su cient skepticism and/or anxiety so as to motivate nding out the actual truth" [ZLP+15]. The task was very timing due to the growing importance of rumour resolution in the breaking news and to the urgency of preventing the spreading of misinformation.

Rumour

Charlie Hebdo Ebola Essien Ferguson Ottawa Shooting Prince Toronto Putin Missing Sydney Siege

Total Rumour

Ferguson Ottawa Shooting Sydney Siege Charlie Hebdo Germanwings Marina Joyce Hillary's Illness

Total

Dataset2 The data for this task are taken from Twitter conversations about news-related rumours collected by [ZLP+16]. They were annotated using four labels (SDQC): support - S (when tweet's author support the rumour veracity); deny -D (when tweet's author denies the rumour veracity); query Q (when tweet's author ask for additional information/evidence); comment -C (when tweet's author just make a comment and does not give important information to asses the rumour veracity). The distribution consists of three sets: development, training and test sets, as summarized in Table 1, where you can see also the label distribution and the news related to the rumors discussed. Training data consist of 297 Twitter conversations and 4,238 tweets in total with related direct and nested replies, where conversations are associated to seven di erent breaking news. Test data consist of 1049 tweets, where two new rumourous topics were added.

Participants Eight teams participated in the task. The best performing system was developed by Turing (78.4 in accuracy). ECNU, MamaEdha, UWaterloo, and DFKI-DKT utilized ensemble classi er. Some systems also used deep learning techniques, including Turing, IKM, and MamaEdha. Meanwhile, NileTRMG and IITP used classical classi er (SVM) to 2http://alt.qcri.org/semeval2017/task8/index.php?id= data-and-tools build their systems. Most of the participants exploited word embedding to construct their feature space, beside the Twitter domain features. 3

Proposed Method

We developed a new model by exploiting several stylistic and structural features characterizing Twitter language. In addition, we propose to utilize conversational-based features by exploiting the peculiar tree structure of the dataset. We also explored the use of a ective based feature by extracting information from several a ective resources including dialogue-act inspired features. 3.1

Structural Features

They were designed taking into account several Twitter data characteristics, and then selecting the most relevant features to improve the classi cation performance. The set of structural features that we used is listed below.

Retweet Count: The number of retweet of each tweet.

Question Mark: presence of question mark "?"; binary value (0 and 1).

Question Mark Count: number of question marks present in the tweet.

Hashtag Presence: this feature has a binary value 0 (if there is no hashtag in the tweet) or 1 (if there is at least one hashtag in the tweet). Text Length: number of characters after removing Twitter markers such as hashtags, mentions, and URLs.

URL Count: number of URL links in the tweet. 3.2

Conversation Based Features

These features are devoted to exploit the peculiar characteristics of the dataset, which have a tree structure re ecting the conversation thread3.

Text Similarity to Source Tweet: Jaccard

Similarity of each tweet with its source tweet.

Text Similarity to Replied Tweet: the degree

of similarity between the tweet with the previous tweet in the thread (the tweet is a reply to that tweet).

Tweet Depth: the depth value is obtained by counting the node from sources (roots) to each tweet in their hierarchy.

3The implementation of these features is inspired from unpublished shared code [Gra17]. 3.3

A ective Based Features

The idea to use a ective features in the context of our task was inspired by recent works on fake news detection, focusing on emotional responses to true and false rumors [VRA18], and by the work in [NS13] re ecting on the role of a ect in dialogue acts [NS13]. Multifaceted a ective features have been already proven to be e ective in some related tasks [LFPR16], including the stance detection task proposed at SemEval-2016 (Task 6).

We used the following a ective resources relying on di erent emotion models.

Emolex: it contains 14,182 words associated with eight primary emotion based on the Plutchik model [MT13, Plu01].

EmoSenticNet(EmoSN): it is an enriched version of SenticNet [COR14] including 13,189 words labeled by six Ekman's basic emotion [PGH+13, Ekm92].

Dictionary of A ect in Language (DAL): in

cludes 8,742 English words labeled by three scores representing three dimensions: Pleasantness, Activation and Imagery [Whi09].

A ective Norms for English Words

(ANEW): consists of 1,034 English words [BL99] rated with ratings based on the ValenceArousal-Dominance (VAD) model [OST57].

Linguistic Inquiry and Word Count

(LIWC): this psycholinguistic resource [PFB01] includes 4,500 words distributed into 64 emotional categories including positive (PosEMO) and negative (NegEMO). 3.4

Dialogue-Act Features

We also included additional 11 categories from bf LIWC, which were already proven to be e ective in dialogue-act task in previous work [NS13]. Basically, these features are part of the a ective feature group, but we present them separately because we are interested in exploring the contribution of such feature set separately. This feature set was obtained by selecting 4 communicative goals related to our classes in the stance task: agree-accept (support), reject (deny), info-request (question), and opinion (comment). The 11 LIWC categories include:

Agree-accept: Assent, Certain, A ect; Reject: Negate, Inhib; Info-request: You, Cause; Opinion: Future, Sad, Insight, Cogmech.

We used the RumourEval dataset from SemEval-2017 Task 8 described in Section 2. We de ned the rumour stance detection problem as a simple four-way classication task, where every tweet in the dataset (source and direct or nested reply) should be classi ed into one among four classes: support, deny, query, and comment. We conducted a set of experiments in order to evaluate and analyze the e ectiveness of our proposed feature set.4.

The results are summarized in Table 2, showing that our system outperforms all of the other systems in terms of accuracy. Our best result was obtained by a simple con guration with a support vector classi er with radial basis function (RBF) kernel. Our model performed better than the best-perform ing systems in SemEval 2017 Task 8 Subtask A (Turing team, [KLA17]), which exploited deep learning approach by using LTSM-Branch model. In addition, we also got a higher accuracy than the system described in [ADB17], which exploits a Random Forest classi er and word embeddings based features.

We experimented with several classi ers, including Naive Bayes, Decision Trees, Support Vector Machine, and Random Forest, noting that SVM outperforms the other classi ers on this task. We explored the parameter space by tuning the SVM hyperparameters, namely the penalty parameter C, kernel type, and class weights (to deal with class imbalance). We tested several values for C (0.001, 0.01, 0.1, 1, 10, 100, and 1000), four di erent kernels (linear, RBF, polynomial, and sigmoid) and weighted the classes based on their distribution in the training data. The best result was obtained with C=1, RBF kernel, and without class weighting.

An ablation test was conducted to explore the contribution of each feature set. Table 5 shows the result of our ablation test, by exploiting several feature sets on the same classi er (SVM with RBF kernel) 5. This evaluation includes macro-averages of precision, recall and F1-score as well as accuracy. We also presented 4We built our system by using scikit-learn Python Library: http://scikit-learn.org/

5Source code is available on the GitHub platform: https://github.com/dadangewp/SemEval2017-RumourEval

Support Deny Query Comment Support Deny Query Comment

the scores for each class in order to get a better understanding of our classi er's performance.

Using only conversational, a ective, or dialogue-act features (without structural features) did not give a good classi cation result. Set B (conversational features only) was not able to detect the query and deny classes, while set C (a ective features only) and D (dialogue-act features only) failed to catch the support, query, and deny classes. Conversational features were able to improve the classi er performance significantly, especially in detecting the support class. Sets E, H, I, and K which utilize conversational features induce an improvement on the prediction of the support class (roughly from 0.3 to 0.73 on precision). Meanwhile, the combination of a ective and dialogue-act features was able to slightly improve the classi cation of the query class. The improvement can be seen from set E to set K where the F1-score of query class increased from 0.52 to 0.58. Overall, the best result was obtained by the K set which encompasses all sets of features. It is worth to be noted that in our best conguration system, not all of a ective and dialogue-act features were used in our feature vector. After several optimization steps, we found that some features were not improving the system's performance. Our nal list of a ective and dialogue-act based features includes:

DAL Activation, ANEW Dominance, Emolex Negative, Emolex Fear, LIWC Assent, LIWC Cause, LIWC Certain and LIWC Sad. There

fore, we have only 17 columns of features in the best performing system covering structural, conversational, a ective and dialogue-act features.

We conducted a further analysis of the classi cation result obtained by the best performing system (79.50 on accuracy). Table 3 shows the confusion matrix of our result. On the one hand, the system is able to detect the comment tweets very well. However, this result is biased due to the number of comment data in the dataset. On the other hand, the system is failing to detect denying tweets, which were falsely classi ed into comments (68 out of 71)6. Meanwhile, approximately two thirds of supporting tweets and almost half of querying tweets were classi ed into the correct class by the system.

In order to assess the impact of class imbalance on the learning, we performed an additional experiment with a balanced dataset using the best performing conguration. We took a subset of the instances equally distributed with respect to their class from the training set (330 instances for each class) and test set (71 instances for each class). As shown in Table 4, our classi er was able to correctly predict the underrepresented classes much better, although the overall accuracy is lower (59.9%). The result of this analysis clearly indicates that class imbalance has a negative impact on the system performance. 4.1

Error analysis We conducted a qualitative error analysis on the 215 misclassi ed in the test set, to shed some light on the issues and di culties to be addressed in future work and to detect some notable error classes.

Denying by attacking the rumour's author. An

interesting nding from the analysis of the Marina Joyce rumour data is that it contains a lot of denying tweets including insulting comments towards the author of the source tweet, like in the following cases:

Rumour: Marina Joyce

Misclassi ed tweets: (da1) stfu you toxic sludge (da2) @sampepper u need rehab Misclassi cation type: deny (gold) comment (prediction) Source tweet: (s1) Anyone who knows Marina Joyce personally knows she has a serious drug addiction. she needs help, but in the form of rehab #savemarinajoyce Tweets like (da1) and (da2) seem to be more inclined to show the respondent's personal hatred towards the s1-tweet's author than to deny the veracity of the rumour. In other words, they represent a peculiar form of denying the rumour, which is expressed by personal attack and by showing negative attitudes or hatred towards the rumour's author. This is di erent from denying by attacking the source tweet content, and it was di cult to comprehend for our system, that often misclassi ed such kind of tweets as comments. Noisy text, speci c jargon, very short text. In (da1) and (da2) (as in many tweets in the test set), we also observe the use of noisy text (abbreviations, misspellings, slang words and slurs, question statements without question mark, and so on) that our classi er struggles to handle . Moreover, especially in tweets from the Marina Joyce rumour's group, we found some very short tweets in the denying class that do not provide enough information, e.g. tweets like \shut up!", \delete", and \stop it. get some help".

Argumentation context. We also observed misclassi cation cases that seem to be related to a deeper capability of dealing with the argumentation context underlying the conversation thread.

Rumour: Ferguson

Misclassi ed tweet: (arg1)@QuadCityPat @AP I join you in this demand. Unconscionable.

Misclassi cation type: deny (gold) comment (prediction) Source tweet: (s2) @AP I demand you retract the lie that people in #Ferguson were shouting \kill the police", local reporting has refuted your ugly racism 6A similar observation is reported by the best team at Semeval-2017 [KLA17].

Here the misclassi ed tweet is a reply including an explicit expression of agreement with the author of the source tweet (\I join you"). Tweet (s2) is one of the rare cases of source tweets denying the rumor (source tweets in the RumourEval17 dataset are mostly supporting the rumor at issue). Our hypothesis is that it is di cult for a system to detect such kind of stance without a deeper comprehension of the argumentation context (e.g., if the author's stance is denying the rumor, and I agree with him, then I am denying the rumor as well). In general, we observed that when the source tweet is annotated by the deny label, most of denying replies of the thread include features typical of the support class (and vice versa), and this was a criticism.

Mixed cases. Furthermore, we found some borderline mixed cases in the gold standard annotation. See for instance the following case:

Rumour: Ferguson

Misclassi ed tweet: (mx1) @MichaelSkolnik @MediaLizzy Oh do tell where they keep track of "vigilante" stats. That's interesting.

Misclassi cation type: query (gold) comment (prediction) Source tweet: (s3) Every 28 hours a black male is killed in the United States by police or vigilantes. #Ferguson Tweet (mx1) is annotated with a query label rather than as a comment (our system prediction), but we can observe the presence of a comment (\That's interesting") after the request for clari cation, so it seems to be a kind of mixed case, where both labels make sense.

Citation of the source's tweet. We have noticed

many misclassi ed cases of replying tweets with error pattern support (gold) comment (our prediction), where the text contains a literal citation of the source tweet, like in the following tweet: THIS HAS TO END \@MichaelSkolnik: Every 28 hours a black male is killed in the United States by police or vigilantes. #Ferguson" (the text enclosed in quotes is the source tweet). Such kind of mistakes could be maybe addressed by applying some pre-processing to the data, for instance by detecting the literal citation and replacing it with a marker.

Figurative language devices. Finally, the use of

gurative language (e.g., sarcasm) is also an issue that should be considered for the future work. Let us consider for instance the following misclassi ed tweets:

Rumour: Hillary's Illness

Misclassi ed tweets: (fg1) @mitchellvii True, after all she can open a pickle jar. (fg2) @mitchellvii Also, except for having a 24/7 MD by her side giving her Valium injections, Hillary is in good health! https://t.co/GieNxwTXX7 (fg3) @mitchellvii @JoanieChesnutt At the very peak yes, almost time to go down a cli and into the earth.

Misclassi cation type: support (gold) comment (prediction) Source tweet: (s4) Except for the coughing, fainting, apparent seizures and "short-circuits," Hillary is in the peak of health.

All misclassi ed tweets (fg1-fg3) from the Hillary's illness data are replies to a source tweet (s4), which is featured by sarcasm. In such replies authors support the rumor by echoing the sarcastic tone of the source tweet. Such more sophisticated cases, where the supportive attitude is expressed in an implicit way, were challenging for our classi er, and they were quite systematically misclassi ed as simple comments. 5

Conclusion

In this paper we proposed a new classi cation model for rumour stance classi cation. We designed a set of features including structural, conversation-based, a ective and dialogue-act based feature. Experiments on the SemEval-2017 Task 8 Subtask A dataset show that our system based on a limited set of wellengineered features outperforms the state-of-the-art systems in this task, without relying on the use of sophisticated deep learning approaches. Although achieving a very good result, several research challenges related to this task are left open. Class imbalance was recognized as one the main issues in this task. For instance, our system was struggling to detect the deny class in the original dataset distribution, but it performed much better in that respect when we balanced the distribution across the classes.

A re-run of the RumourEval shared task has been proposed at SemEval 20197 and it will be very interesting to participate to the new task with an evolution of the system here described.

Acknowledgements

Endang Wahyu Pamungkas, Valerio Basile and Viviana Patti were partially funded by Progetto di Ateneo/CSP 2016 (Immigrants, Hate and Prejudice in Social Media, S1618 L2 BOSC 01).

7http://alt.qcri.org/semeval2019/ [ADB17]

Ahmet Aker, Leon Derczynski, and Kalina

Bontcheva. Simple open stance classication for rumour analysis. In Proc. of RANLP 2017 , pages 31{39. INCOMA Ltd., 2017 .

Margaret M Bradley and Peter J Lang.

A ective norms for english words (anew): Instruction manual and a ective ratings.

Technical report, Technical Report C-1, The Center for Research in Psychophysiology, University of Florida., 1999. [COR14] Erik Cambria, Daniel Olsher, and Dheeraj Rajagopal. Senticnet 3: a common and common-sense knowledge base for cognition-driven sentiment analysis. In

Proc. of AAAI 2014, 2014. [Ekm92] [KLA17]

[DBL+17] Leon

Derczynski

, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz Zubiaga. Semeval2017 task 8: Rumoureval: Determining rumour veracity and support for rumours . In Proc. of SemEval-2017 , pages 69 { 76 . ACL , 2017 .

[Gra17]

Paul

Ekman . An argument for basic emotions . Cognition & emotion, 6 ( 3 -4): 169 { 200 , 1992 .

David

Graf . Semeval-2017-t8, June 2017 .

Elena

Kochkina , Maria Liakata, and

Isabelle

Augenstein . Turing at SemEval -2017 Task 8: Sequential Approach to Rumour Stance Classi cation with Branch-LSTM.

In Proc. of SemEval-2017 , pages 475 { 480 .

ACL , 2017 .

[LFPR16] Mirko Lai, Delia Irazu Hernandez Far as, Viviana Patti, and Paolo Rosso. Friends and enemies of Clinton and Trump: using context for detecting stance in political tweets . In Proc. of MICAI 2016 , volume 10061 of LNCS , pages 155 { 168 . Springer, 2016 .

[MKS+16] Saif

Mohammad

, Svetlana Kiritchenko, Parinaz Sobhani, Xiao-Dan Zhu , and Colin Cherry . Semeval-2016 task 6: Detecting stance in tweets . In Proc. of SemEval 2016 , pages 31 { 41 . ACL , 2016 .

[MT13] Saif M Mohammad and Peter D Turney .

Crowdsourcing a word{emotion association lexicon . Computational Intelligence , 29 ( 3 ): 436 { 465 , 2013 .

[OST57] [PFB01] [Plu01] [PVV13] Nicole Novielli and

Carlo

Strapparava . The role of a ect analysis in dialogue act identi cation . IEEE Transactions on A ective Computing , 4 ( 4 ): 439 { 451 , 2013 .

C.E.

Osgood ,

G.J.

Suci , and

P.H.

Tenenbaum . The Measurement of meaning . University of Illinois Press, Urbana:, 1957 .

James W Pennebaker , Martha E Francis, and Roger J Booth. Linguistic inquiry and word count (LIWC): LIWC 2001 . Mahway: Lawrence Erlbaum Associates, 2001 .

[PGH+13] Soujanya

Poria

, Alexander Gelbukh, Amir Hussain, Newton Howard, Dipankar Das , and Sivaji Bandyopadhyay . Enhanced senticnet with a ective labels for conceptbased opinion mining . IEEE Intelligent Systems , 28 ( 2 ): 31 { 38 , 2013 .

American scientist, 89 ( 4 ): 344 { 350 , 2001 .

Reading the riots on twitter: methodological innovation for the analysis of big data . International journal of social research methodology , 16 ( 3 ): 197 { 214 , 2013 .

[SSW+17] Kai

Shu

, Amy Sliva, Suhang Wang,

Jiliang

Tang , and Huan Liu. Fake news detection on social media: A data mining perspective . ACM SIGKDD Explorations Newsletter , 19 ( 1 ): 22 { 36 , 2017 .

[VRA18]

Soroush

Vosoughi , Deb Roy, and

Sinan

Aral . The spread of true and false news online . Science , 359 ( 6380 ): 1146 { 1151 , 2018 .

[Whi09]

Cynthia

Whissell . Using the revised dictionary of a ect in language to quantify the emotional undertones of samples of natural language . Psychological reports , 105 ( 2 ): 509 { 521 , 2009 .

[ZLP+15] Arkaitz

Zubiaga

, Maria Liakata, Rob Procter, Kalina Bontcheva, and

Peter

Tolmie . Towards detecting rumours in social media . In AAAI Workshop: AI for Cities , 2015 .

[ZLP+16] Arkaitz

Zubiaga

, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and

Peter

Tolmie . Analysing how people orient to and spread rumours in social media by looking at conversational threads . PloS one , 11 ( 3 ):e0150989, 2016 .