=Paper= {{Paper |id=Vol-2482/paper37 |storemode=property |title=Stance Classification for Rumour Analysis in Twitter: Exploiting Affective Information and Conversation Structure |pdfUrl=https://ceur-ws.org/Vol-2482/paper37.pdf |volume=Vol-2482 |authors=Endang Wahyu Pamungkas,Valerio Basile,Viviana Patti |dblpUrl=https://dblp.org/rec/conf/cikm/PamungkasBP18 }} ==Stance Classification for Rumour Analysis in Twitter: Exploiting Affective Information and Conversation Structure== https://ceur-ws.org/Vol-2482/paper37.pdf
    Stance Classification for Rumour Analysis in Twitter:
     Exploiting Affective Information and Conversation
                           Structure

                           Endang Wahyu Pamungkas, Valerio Basile, Viviana Patti
                                       Dipartimento di Informatica
                                     Università degli Studi di Torino
                                   {pamungka, basile, patti}@di.unito.it



                                                                 1   Introduction
                                                                 Nowadays, people increasingly tend to use social media
                          Abstract                               like Facebook and Twitter as their primary source of
                                                                 information and news consumption. There are several
     Analysing how people react to rumours associ-               reasons behind this tendency, such as the simplicity
     ated with news in social media is an important              to gather and share the news and the possibility of
     task to prevent the spreading of misinforma-                staying abreast of the latest news and updated faster
     tion, which is nowadays widely recognized as                than with traditional media. An important factor is
     a dangerous tendency. In social media con-                  also that people can be engaged in conversations on
     versations, users show different stances and                the latest breaking news with their contacts by us-
     attitudes towards rumourous stories. Some                   ing these platforms. Pew Research Center’s newest
     users take a definite stance, supporting or                 report1 shows that two-thirds of U.S. adults gather
     denying the rumour at issue, while others just              their news from social media, where Twitter is the
     comment it, or ask for additional evidence                  most used platform. However, the absence of a sys-
     on the rumour’s veracity. A shared task has                 tematic approach to do some form of fact and veracity
     been proposed at SemEval-2017 (Task 8, Sub-                 checking may also encourage the spread of rumourous
     Task A), which is focused on rumour stance                  stories and misinformation [PVV13]. Indeed, in social
     classification in English tweets. The goal is               media, unverified information can spread very quickly
     predicting user stance towards emerging ru-                 and becomes viral easily, enabling the diffusion of false
     mours in Twitter, in terms of supporting,                   rumours and fake information.
     denying, querying, or commenting the original                  Within this scenario, it is crucial to analyse peo-
     rumour, looking at the conversation threads                 ple attitudes towards rumours in social media and to
     originated by the rumour. This paper de-                    resolve their veracity as soon as possible. Several ap-
     scribes a new approach to this task, where the              proaches have been proposed to check the rumour ve-
     use of conversation-based and affective-based               racity in social media [SSW+ 17]. This paper focus
     features, covering different facets of affect, is           on a stance-based analysis of event-related rumours,
     explored. Our classification model outper-                  following the approach proposed at SemEval-2017 in
     forms the best-performing systems for stance                the new RumourEval shared task (Task 8, sub-task
     classification at SemEval-2017 showing the ef-              A) [DBL+ 17]. In this task English tweets from conver-
     fectiveness of the feature set proposed.                    sation threads, each associated to a newsworthy event
Copyright © CIKM 2018 for the individual papers by the papers'   and the rumours around it, are provided as data. The
authors. Copyright © CIKM 2018 for the volume as a collection    goal is to determine whether a tweet in the thread
by its editors. This volume and its papers are published under
                                                                 is supporting, denying, querying, or commenting the
                                                                 original rumour which started the conversation. It can
the Creative Commons License Attribution 4.0 International (CC
                                                                 be considered a stance classification task, where we
BY 4.0).
                                                                    1 http://www.journalism.org/2017/09/07/

                                                                 news-use-across-social-media-platforms-2017/
have to predict the user’s stance towards the rumour                          Development Data
from a tweet, in the context of a given thread. This            Rumour                  S    D        Q      C
task has been defined as open stance classification task        Germanwings             69   11       28     173
and is conceived as a key step in rumour resolution,                             Training Data
by providing an analysis of people reactions towards            Rumour                  S    D        Q      C
an emerging rumour [PVV13, ZLP+ 16]. The task is                Charlie Hebdo           239 58        53     721
also different from detecting stance towards a specific         Ebola Essien            6    6        1      21
target entity [MKS+ 16].                                        Ferguson                176 91        99     718
   Contribution We describe a novel classification              Ottawa Shooting         161 76        63     477
approach, by proposing a new feature matrix, which              Prince Toronto          21   7        11     64
includes two new groups: (a) features exploiting the            Putin Missing           18   6        5      33
conversational structure of the dataset [DBL+ 17]; (b)          Sydney Siege            220 89        98     700
affective features relying on the use of a wide range           Total                   841 333       330    2734
of affective resources capturing different facets of sen-                         Testing Data
timent and other affect-related phenomena. We were              Rumour                  S    D        Q      C
also inspired by the fake news study on Twitter in              Ferguson                15   4        17     66
[VRA18], showing that false stories inspire fear, dis-          Ottawa Shooting         10   2        20     91
gust, and surprise in replies, while true stories inspire       Sydney Siege            5    1        12     69
anticipation, sadness, joy, and trust. Meanwhile, from          Charlie Hebdo           9    2        8      74
a dialogue act perspective, the study of [NS13] found           Germanwings             11   5        15     71
that a relationship exists between the use of an affec-         Marina Joyce            5    30       10     110
tive lexicon and the communicative intention of an ut-          Hillary’s Illness       39   27       24     297
terance which includes AGREE-ACCEPT (support),                  Total                   94   71       106    778
REJECT (deny), INFO-REQUEST (question), and
OPINION (comment). They exploited several LIWC
categories to analyse the role of affective content.         Table 1: Semeval-2017 Task 8 (A) dataset distribution.
   Our results show that our model outperforms the
                                                                Dataset2 The data for this task are taken from
state of the art on the Semeval-2017 benchmark
                                                             Twitter conversations about news-related rumours col-
dataset. Feature analysis highlights the contribution
                                                             lected by [ZLP+ 16]. They were annotated using
of the different feature groups, and error analysis is
                                                             four labels (SDQC): support - S (when tweet’s au-
shedding some light on the main difficulties and chal-
                                                             thor support the rumour veracity); deny -D (when
lenges which still need to be addressed.
                                                             tweet’s author denies the rumour veracity); query -
   Outline The paper is organized as follows. Sec-           Q (when tweet’s author ask for additional informa-
tion 2 introduces the SemEval-2017 Task 8. Section 3         tion/evidence); comment -C (when tweet’s author just
describes our approach to deal with open stance classi-      make a comment and does not give important informa-
fication by exploiting different groups of features. Sec-    tion to asses the rumour veracity). The distribution
tion 4 describes the evaluation and includes a quali-        consists of three sets: development, training and test
tative error analysis. Finally, Section 5 concludes the      sets, as summarized in Table 1, where you can see also
paper and points to future directions.                       the label distribution and the news related to the ru-
                                                             mors discussed. Training data consist of 297 Twitter
                                                             conversations and 4,238 tweets in total with related
2    SemEval-2017 Task 8: RumourEval                         direct and nested replies, where conversations are as-
The SemEval-2017 Task 8 Task A [DBL+ 17] has as              sociated to seven different breaking news. Test data
its main objective to determine the stance of the users      consist of 1049 tweets, where two new rumourous top-
in a Twitter thread towards a given rumour, in terms         ics were added.
of support, denying, querying or commenting (SDQC)              Participants Eight teams participated in the task.
on the original rumour. Rumour is defined as a “cir-         The best performing system was developed by Tur-
culating story of questionable veracity, which is appar-     ing (78.4 in accuracy). ECNU, MamaEdha, UWa-
ently credible but hard to verify, and produces sufficient   terloo, and DFKI-DKT utilized ensemble classifier.
skepticism and/or anxiety so as to motivate finding out      Some systems also used deep learning techniques, in-
the actual truth” [ZLP+ 15]. The task was very timing        cluding Turing, IKM, and MamaEdha. Meanwhile,
due to the growing importance of rumour resolution           NileTRMG and IITP used classical classifier (SVM) to
in the breaking news and to the urgency of preventing           2 http://alt.qcri.org/semeval2017/task8/index.php?id=

the spreading of misinformation.                             data-and-tools
build their systems. Most of the participants exploited           3.3    Affective Based Features
word embedding to construct their feature space, be-
                                                                  The idea to use affective features in the context of our
side the Twitter domain features.
                                                                  task was inspired by recent works on fake news detec-
                                                                  tion, focusing on emotional responses to true and false
3     Proposed Method                                             rumors [VRA18], and by the work in [NS13] reflecting
We developed a new model by exploiting several                    on the role of affect in dialogue acts [NS13]. Multi-
stylistic and structural features characterizing Twit-            faceted affective features have been already proven to
ter language. In addition, we propose to utilize                  be effective in some related tasks [LFPR16], including
conversational-based features by exploiting the pecu-             the stance detection task proposed at SemEval-2016
liar tree structure of the dataset. We also explored the          (Task 6).
use of affective based feature by extracting information             We used the following affective resources relying on
from several affective resources including dialogue-act           different emotion models.
inspired features.
                                                                        Emolex: it contains 14,182 words associated
3.1    Structural Features                                              with eight primary emotion based on the Plutchik
                                                                        model [MT13, Plu01].
They were designed taking into account several Twit-
ter data characteristics, and then selecting the most                   EmoSenticNet(EmoSN): it is an enriched ver-
relevant features to improve the classification perfor-                 sion of SenticNet [COR14] including 13,189 words
mance. The set of structural features that we used is                   labeled by six Ekman’s basic emotion [PGH+ 13,
listed below.                                                           Ekm92].
      Retweet Count: The number of retweet of each                      Dictionary of Affect in Language (DAL): in-
      tweet.                                                            cludes 8,742 English words labeled by three scores
                                                                        representing three dimensions: Pleasantness, Ac-
      Question Mark: presence of question mark ”?”;                     tivation and Imagery [Whi09].
      binary value (0 and 1).
                                                                        Affective Norms for English Words
      Question Mark Count: number of question
                                                                        (ANEW): consists of 1,034 English words
      marks present in the tweet.
                                                                        [BL99] rated with ratings based on the Valence-
      Hashtag Presence: this feature has a binary                       Arousal-Dominance (VAD) model [OST57].
      value 0 (if there is no hashtag in the tweet) or 1
                                                                        Linguistic Inquiry and Word Count
      (if there is at least one hashtag in the tweet).
                                                                        (LIWC): this psycholinguistic resource [PFB01]
      Text Length: number of characters after remov-                    includes 4,500 words distributed into 64 emo-
      ing Twitter markers such as hashtags, mentions,                   tional categories including positive (PosEMO)
      and URLs.                                                         and negative (NegEMO).

      URL Count: number of URL links in the tweet.                3.4    Dialogue-Act Features

3.2    Conversation Based Features                                We also included additional 11 categories from bf
                                                                  LIWC, which were already proven to be effective in
These features are devoted to exploit the peculiar char-          dialogue-act task in previous work [NS13]. Basically,
acteristics of the dataset, which have a tree structure           these features are part of the affective feature group,
reflecting the conversation thread3 .                             but we present them separately because we are in-
      Text Similarity to Source Tweet: Jaccard                    terested in exploring the contribution of such feature
      Similarity of each tweet with its source tweet.             set separately. This feature set was obtained by se-
                                                                  lecting 4 communicative goals related to our classes
      Text Similarity to Replied Tweet: the degree                in the stance task: agree-accept (support), reject
      of similarity between the tweet with the previous           (deny), info-request (question), and opinion (com-
      tweet in the thread (the tweet is a reply to that           ment). The 11 LIWC categories include:
      tweet).
                                                                        Agree-accept: Assent, Certain, Affect;
      Tweet Depth: the depth value is obtained by
      counting the node from sources (roots) to each                    Reject: Negate, Inhib;
      tweet in their hierarchy.                                         Info-request: You, Cause;
    3 The implementation of these features is inspired from un-

published shared code [Gra17].                                          Opinion: Future, Sad, Insight, Cogmech.
      No.   Systems                        Accuracy                                S       D       Q       C
      1.    Turing’s System                78.4                    Support         27      0       3       64
      2.    Aker et al. System             79.02                   Deny            2       0       1       68
      3.    Our System                     79.5                    Query           0       0       50      56
            RumourEval Baseline            74.1                    Comment         13      0       8       757


Table 2: Results and comparison with state of the art                       Table 3: Confusion Matrix
4     Experiments, Evaluation and Analy-                                           S       D       Q       C
      sis                                                          Support         39      14      5       13
We used the RumourEval dataset from SemEval-2017                   Deny            8       28      5       30
Task 8 described in Section 2. We defined the rumour               Query           2       3       62      4
stance detection problem as a simple four-way classi-              Comment         14      14      2       41
fication task, where every tweet in the dataset (source
and direct or nested reply) should be classified into one
                                                                Table 4: Confusion Matrix on Balanced Dataset
among four classes: support, deny, query, and com-
ment. We conducted a set of experiments in order to           the scores for each class in order to get a better un-
evaluate and analyze the effectiveness of our proposed        derstanding of our classifier’s performance.
feature set.4 .                                                  Using only conversational, affective, or dialogue-act
    The results are summarized in Table 2, showing            features (without structural features) did not give a
that our system outperforms all of the other systems          good classification result. Set B (conversational fea-
in terms of accuracy. Our best result was obtained            tures only) was not able to detect the query and deny
by a simple configuration with a support vector clas-         classes, while set C (affective features only) and D
sifier with radial basis function (RBF) kernel. Our           (dialogue-act features only) failed to catch the sup-
model performed better than the best-performing sys-          port, query, and deny classes. Conversational features
tems in SemEval 2017 Task 8 Subtask A (Turing team,           were able to improve the classifier performance signif-
[KLA17]), which exploited deep learning approach by           icantly, especially in detecting the support class. Sets
using LTSM-Branch model. In addition, we also got a           E, H, I, and K which utilize conversational features in-
higher accuracy than the system described in [ADB17],         duce an improvement on the prediction of the support
which exploits a Random Forest classifier and word            class (roughly from 0.3 to 0.73 on precision). Mean-
embeddings based features.                                    while, the combination of affective and dialogue-act
    We experimented with several classifiers, including       features was able to slightly improve the classification
Naive Bayes, Decision Trees, Support Vector Machine,          of the query class. The improvement can be seen from
and Random Forest, noting that SVM outperforms the            set E to set K where the F1 -score of query class in-
other classifiers on this task. We explored the pa-           creased from 0.52 to 0.58. Overall, the best result was
rameter space by tuning the SVM hyperparameters,              obtained by the K set which encompasses all sets of
namely the penalty parameter C, kernel type, and class        features. It is worth to be noted that in our best con-
weights (to deal with class imbalance). We tested sev-        figuration system, not all of affective and dialogue-act
eral values for C (0.001, 0.01, 0.1, 1, 10, 100, and 1000),   features were used in our feature vector. After several
four different kernels (linear, RBF, polynomial, and          optimization steps, we found that some features were
sigmoid) and weighted the classes based on their dis-         not improving the system’s performance. Our final list
tribution in the training data. The best result was           of affective and dialogue-act based features includes:
obtained with C=1, RBF kernel, and without class              DAL Activation, ANEW Dominance, Emolex
weighting.                                                    Negative, Emolex Fear, LIWC Assent, LIWC
    An ablation test was conducted to explore the con-        Cause, LIWC Certain and LIWC Sad. There-
tribution of each feature set. Table 5 shows the result       fore, we have only 17 columns of features in the best
of our ablation test, by exploiting several feature sets      performing system covering structural, conversational,
on the same classifier (SVM with RBF kernel) 5 . This         affective and dialogue-act features.
evaluation includes macro-averages of precision, recall          We conducted a further analysis of the classification
and F1 -score as well as accuracy. We also presented          result obtained by the best performing system (79.50
    4 We
                                                              on accuracy). Table 3 shows the confusion matrix of
         built our system by using scikit-learn Python Li-
brary: http://scikit-learn.org/
                                                              our result. On the one hand, the system is able to
   5 Source code is available on the GitHub platform:         detect the comment tweets very well. However, this
https://github.com/dadangewp/SemEval2017-RumourEval           result is biased due to the number of comment data in
   Ablation Test                Overall                  Support                     Query                   Comment
 Set Features            Acc Prec Rec F1 Acc Prec Rec                 F1    Acc    Prec Rec     F1    Acc    Prec Rec F1
  A Structural          0.731 0.41 0.37 0.38 0.18 0.28 0.18          0.22   0.39   0.56 0.39   0.46   0.91   0.78 0.91 0.84
  B Conversational 0.767 0.42 0.31 0.33 0.29 0.93 0.29               0.44     0      0   0       0     1     0.76   1 0.87
  C Affective           0.742 0.19 0.25 0.21 0            0      0     0      0      0   0       0     1     0.74   1 0.85
  D Dialogue-Act        0.742 0.19 0.25 0.21 0            0      0     0      0      0   0       0     1     0.74   1 0.85
  E A+B                 0.783 0.54 0.43 0.45 0.29 0.73 0.29          0.41   0.42   0.62 0.42   0.52   0.96    0.8 0.96 0.87
  F A+C                 0.741 0.42 0.36 0.38 0.14 0.27 0.14          0.18   0.39   0.62 0.39   0.48   0.93   0.77 0.93 0.84
  G A+D                 0.736 0.42 0.37 0.38 0.18 0.3 0.18           0.23   0.37   0.59 0.37   0.45   0.92   0.77 0.92 0.84
  H E+C                 0.788 0.56 0.42 0.46 0.28 0.74 0.28           0.4   0.44    0.7 0.44   0.54   0.97    0.8 0.97 0.87
  I E+D                 0.784 0.53 0.43 0.46 0.3 0.65 0.3            0.41   0.45   0.67 0.45   0.54   0.96    0.8 0.96 0.87
  J F+D                 0.749 0.43 0.36 0.38 0.14 0.33 0.14          0.19   0.38   0.63 0.38   0.47   0.94   0.77 0.94 0.85
  K All Features       0.795 0.57 0.43 0.47 0.29 0.73 0.29           0.41   0.47   0.75 0.47   0.58   0.97    0.8 0.97 0.88
*deny class is not presented, since the score is always zero (0)


                                    Table 5: Ablation test on several feature sets.

the dataset. On the other hand, the system is failing                 tion. she needs help, but in the form of rehab
to detect denying tweets, which were falsely classified               #savemarinajoyce
into comments (68 out of 71)6 . Meanwhile, approxi-
                                                               Tweets like (da1) and (da2) seem to be more inclined
mately two thirds of supporting tweets and almost half
                                                               to show the respondent’s personal hatred towards the
of querying tweets were classified into the correct class
                                                               s1-tweet’s author than to deny the veracity of the ru-
by the system.
                                                               mour. In other words, they represent a peculiar form
   In order to assess the impact of class imbalance on
                                                               of denying the rumour, which is expressed by personal
the learning, we performed an additional experiment
                                                               attack and by showing negative attitudes or hatred
with a balanced dataset using the best performing con-
                                                               towards the rumour’s author. This is different from
figuration. We took a subset of the instances equally
                                                               denying by attacking the source tweet content, and it
distributed with respect to their class from the train-
                                                               was difficult to comprehend for our system, that often
ing set (330 instances for each class) and test set (71
                                                               misclassified such kind of tweets as comments.
instances for each class). As shown in Table 4, our
                                                               Noisy text, specific jargon, very short text. In
classifier was able to correctly predict the underrep-
                                                               (da1) and (da2) (as in many tweets in the test set), we
resented classes much better, although the overall ac-
                                                               also observe the use of noisy text (abbreviations, mis-
curacy is lower (59.9%). The result of this analysis
                                                               spellings, slang words and slurs, question statements
clearly indicates that class imbalance has a negative
                                                               without question mark, and so on) that our classifier
impact on the system performance.
                                                               struggles to handle . Moreover, especially in tweets
4.1    Error analysis                                          from the Marina Joyce rumour’s group, we found some
                                                               very short tweets in the denying class that do not pro-
We conducted a qualitative error analysis on the 215           vide enough information, e.g. tweets like “shut up!”,
misclassified in the test set, to shed some light on the       “delete”, and “stop it. get some help”.
issues and difficulties to be addressed in future work         Argumentation context. We also observed misclas-
and to detect some notable error classes.                      sification cases that seem to be related to a deeper
Denying by attacking the rumour’s author. An                   capability of dealing with the argumentation context
interesting finding from the analysis of the Marina            underlying the conversation thread.
Joyce rumour data is that it contains a lot of deny-
ing tweets including insulting comments towards the                   Rumour: Ferguson
author of the source tweet, like in the following cases:              Misclassified tweet:
                                                                      (arg1)@QuadCityPat @AP I join you in this
      Rumour: Marina Joyce                                            demand. Unconscionable.
      Misclassified tweets:                                           Misclassification type: deny (gold)      com-
      (da1) stfu you toxic sludge                                     ment (prediction)
      (da2) @sampepper u need rehab                                   Source tweet:
      Misclassification type: deny (gold)   com-                      (s2) @AP I demand you retract the lie that
      ment (prediction)                                               people in #Ferguson were shouting “kill the
      Source tweet:                                                   police”, local reporting has refuted your ugly
      (s1) Anyone who knows Marina Joyce per-                         racism
      sonally knows she has a serious drug addic-
  6 A similar observation is reported by the best team at      Here the misclassified tweet is a reply including an ex-
Semeval-2017 [KLA17].                                          plicit expression of agreement with the author of the
source tweet (“I join you”). Tweet (s2) is one of the              a pickle jar.
rare cases of source tweets denying the rumor (source              (fg2) @mitchellvii Also, except for having
tweets in the RumourEval17 dataset are mostly sup-                 a 24/7 MD by her side giving her Val-
porting the rumor at issue). Our hypothesis is that it             ium injections, Hillary is in good health!
is difficult for a system to detect such kind of stance            https://t.co/GieNxwTXX7
without a deeper comprehension of the argumentation                (fg3) @mitchellvii @JoanieChesnutt At the
context (e.g., if the author’s stance is denying the ru-           very peak yes, almost time to go down a cliff
mor, and I agree with him, then I am denying the                   and into the earth.
rumor as well). In general, we observed that when the              Misclassification type: support (gold)
source tweet is annotated by the deny label, most of               comment (prediction)
denying replies of the thread include features typical             Source tweet:
of the support class (and vice versa), and this was a              (s4) Except for the coughing, fainting, appar-
criticism.                                                         ent seizures and ”short-circuits,” Hillary is in
Mixed cases. Furthermore, we found some border-                    the peak of health.
line mixed cases in the gold standard annotation. See
for instance the following case:                             All misclassified tweets (fg1-fg3) from the Hillary’s ill-
                                                             ness data are replies to a source tweet (s4), which is
    Rumour: Ferguson                                         featured by sarcasm. In such replies authors support
    Misclassified tweet:                                     the rumor by echoing the sarcastic tone of the source
    (mx1) @MichaelSkolnik @MediaLizzy Oh                     tweet. Such more sophisticated cases, where the sup-
    do tell where they keep track of ”vigilante”             portive attitude is expressed in an implicit way, were
    stats. That’s interesting.                               challenging for our classifier, and they were quite sys-
    Misclassification type:    query (gold)                  tematically misclassified as simple comments.
    comment (prediction)
    Source tweet:
    (s3) Every 28 hours a black male is killed               5     Conclusion
    in the United States by police or vigilantes.
                                                             In this paper we proposed a new classification model
    #Ferguson
                                                             for rumour stance classification. We designed a set
                                                             of features including structural, conversation-based,
Tweet (mx1) is annotated with a query label rather           affective and dialogue-act based feature. Experi-
than as a comment (our system prediction), but we            ments on the SemEval-2017 Task 8 Subtask A dataset
can observe the presence of a comment (“That’s inter-        show that our system based on a limited set of well-
esting”) after the request for clarification, so it seems    engineered features outperforms the state-of-the-art
to be a kind of mixed case, where both labels make           systems in this task, without relying on the use of
sense.                                                       sophisticated deep learning approaches. Although
Citation of the source’s tweet. We have noticed              achieving a very good result, several research chal-
many misclassified cases of replying tweets with er-         lenges related to this task are left open. Class im-
ror pattern support (gold)        comment (our predic-       balance was recognized as one the main issues in this
tion), where the text contains a literal citation of the     task. For instance, our system was struggling to de-
source tweet, like in the following tweet: THIS HAS          tect the deny class in the original dataset distribution,
TO END “@MichaelSkolnik: Every 28 hours a black              but it performed much better in that respect when we
male is killed in the United States by police or vigi-       balanced the distribution across the classes.
lantes. #Ferguson” (the text enclosed in quotes is the          A re-run of the RumourEval shared task has been
source tweet). Such kind of mistakes could be maybe          proposed at SemEval 20197 and it will be very inter-
addressed by applying some pre-processing to the data,       esting to participate to the new task with an evolution
for instance by detecting the literal citation and replac-   of the system here described.
ing it with a marker.
Figurative language devices. Finally, the use of             Acknowledgements
figurative language (e.g., sarcasm) is also an issue that
should be considered for the future work. Let us con-        Endang Wahyu Pamungkas, Valerio Basile and Vi-
sider for instance the following misclassified tweets:       viana Patti were partially funded by Progetto di Ate-
                                                             neo/CSP 2016 (Immigrants, Hate and Prejudice in So-
    Rumour: Hillary’s Illness                                cial Media, S1618 L2 BOSC 01).
    Misclassified tweets:
    (fg1) @mitchellvii True, after all she can open              7 http://alt.qcri.org/semeval2019/
References                                               [NS13]    Nicole Novielli and Carlo Strapparava. The
                                                                   role of affect analysis in dialogue act iden-
[ADB17]    Ahmet Aker, Leon Derczynski, and Kalina
                                                                   tification. IEEE Transactions on Affective
           Bontcheva. Simple open stance classi-
                                                                   Computing, 4(4):439–451, 2013.
           fication for rumour analysis. In Proc.
           of RANLP 2017, pages 31–39. INCOMA            [OST57]   C.E. Osgood, G.J. Suci, and P.H. Tenen-
           Ltd., 2017.                                             baum. The Measurement of meaning. Uni-
[BL99]     Margaret M Bradley and Peter J Lang.                    versity of Illinois Press, Urbana:, 1957.
           Affective norms for english words (anew):     [PFB01]   James W Pennebaker, Martha E Francis,
           Instruction manual and affective ratings.               and Roger J Booth. Linguistic inquiry and
           Technical report, Technical Report C-1,                 word count (LIWC): LIWC 2001. Mahway:
           The Center for Research in Psychophysi-                 Lawrence Erlbaum Associates, 2001.
           ology, University of Florida., 1999.
                                                         [PGH+ 13] Soujanya Poria, Alexander Gelbukh, Amir
[COR14]    Erik Cambria, Daniel Olsher, and Dheeraj                Hussain, Newton Howard, Dipankar Das,
           Rajagopal.     Senticnet 3: a common                    and Sivaji Bandyopadhyay. Enhanced sen-
           and common-sense knowledge base for                     ticnet with affective labels for concept-
           cognition-driven sentiment analysis. In                 based opinion mining. IEEE Intelligent
           Proc. of AAAI 2014, 2014.                               Systems, 28(2):31–38, 2013.
[DBL+ 17] Leon Derczynski, Kalina Bontcheva, Maria
                                                         [Plu01]   Robert Plutchik. The nature of emotions.
          Liakata, Rob Procter, Geraldine Wong
                                                                   American scientist, 89(4):344–350, 2001.
          Sak Hoi, and Arkaitz Zubiaga. Semeval-
          2017 task 8: Rumoureval: Determining ru-       [PVV13]   Rob Procter, Farida Vis, and Alex Voss.
          mour veracity and support for rumours. In                Reading the riots on twitter: method-
          Proc. of SemEval-2017, pages 69–76. ACL,                 ological innovation for the analysis of big
          2017.                                                    data. International journal of social re-
[Ekm92]    Paul Ekman. An argument for basic emo-                  search methodology, 16(3):197–214, 2013.
           tions. Cognition & emotion, 6(3-4):169–       [SSW+ 17] Kai Shu, Amy Sliva, Suhang Wang, Jiliang
           200, 1992.                                              Tang, and Huan Liu. Fake news detection
[Gra17]    David Graf. Semeval-2017-t8, June 2017.                 on social media: A data mining perspec-
                                                                   tive. ACM SIGKDD Explorations Newslet-
[KLA17]    Elena Kochkina, Maria Liakata, and Is-                  ter, 19(1):22–36, 2017.
           abelle Augenstein. Turing at SemEval-2017
           Task 8: Sequential Approach to Rumour         [VRA18]   Soroush Vosoughi, Deb Roy, and Sinan
           Stance Classification with Branch-LSTM.                 Aral. The spread of true and false news on-
           In Proc. of SemEval-2017, pages 475–480.                line. Science, 359(6380):1146–1151, 2018.
           ACL, 2017.
                                                         [Whi09]   Cynthia Whissell. Using the revised dic-
[LFPR16] Mirko Lai, Delia Irazú Hernández Farı́as,               tionary of affect in language to quantify
         Viviana Patti, and Paolo Rosso. Friends                   the emotional undertones of samples of
         and enemies of Clinton and Trump: us-                     natural language. Psychological reports,
         ing context for detecting stance in political             105(2):509–521, 2009.
         tweets. In Proc. of MICAI 2016, volume
         10061 of LNCS, pages 155–168. Springer,         [ZLP+ 15] Arkaitz Zubiaga, Maria Liakata, Rob Proc-
         2016.                                                     ter, Kalina Bontcheva, and Peter Tolmie.
                                                                   Towards detecting rumours in social me-
[MKS+ 16] Saif Mohammad, Svetlana Kiritchenko,                     dia. In AAAI Workshop: AI for Cities,
          Parinaz Sobhani, Xiao-Dan Zhu, and Colin                 2015.
          Cherry. Semeval-2016 task 6: Detecting
          stance in tweets. In Proc. of SemEval 2016,    [ZLP+ 16] Arkaitz Zubiaga, Maria Liakata, Rob Proc-
          pages 31–41. ACL, 2016.                                  ter, Geraldine Wong Sak Hoi, and Peter
                                                                   Tolmie. Analysing how people orient to
[MT13]     Saif M Mohammad and Peter D Turney.                     and spread rumours in social media by
           Crowdsourcing a word–emotion associa-                   looking at conversational threads. PloS
           tion lexicon. Computational Intelligence,               one, 11(3):e0150989, 2016.
           29(3):436–465, 2013.