=Paper= {{Paper |id=Vol-2696/paper_155 |storemode=property |title=Profiling Fake News Spreaders: Stylometry, Personality, Emotions and Embeddings |pdfUrl=https://ceur-ws.org/Vol-2696/paper_155.pdf |volume=Vol-2696 |authors=Elisabetta Fersini,Justin Armanini,Michael D'Intorni |dblpUrl=https://dblp.org/rec/conf/clef/FersiniAD20 }} ==Profiling Fake News Spreaders: Stylometry, Personality, Emotions and Embeddings== https://ceur-ws.org/Vol-2696/paper_155.pdf
               Profiling fake news spreaders:
      stylometry, personality, emotions and embeddings
                         Notebook for PAN at CLEF 2020

                Elisabetta Fersini, Justin Armanini, and Michael D’Intorni

               University of Milano-Bicocca, Viale Sarca 336, Milan - Italy
       elisabetta.fersini@unimib.it, j.armanini@campus.unimib.it,
                         m.dintorni@campus.unimib.it



        Abstract This paper describes our proposed solution for the Profiling Fake News
        Spreaders on Twitter shared task at PAN 2020 [23]. The task consists in determin-
        ing whether a given author a set of Twitter posts is a fake news spreader or not,
        both for the English and Spanish languages. The proposed approach is based on
        modeling both types of users according to four main types of characteristics, i.e.
        stylometry, personality, emotions and feed embeddings. Our system achieved an
        accuracy of 60% for the English dataset, while 72% for the Spanish one.


1     Introduction

The problem of fake news and rumour detection has gained a lot of attention during the
last years. If users get their news from social media, they encounter the risk of being
exposed to false or misleading content such as hoaxes, rumors and click-bait headlines.
The main advantages for fake news providers is to generate traffic to specific web sites
in order to monetize through advertising [2] or to manipulate politically related facts
[24]. Since the beginning, the massive spread of fake news has been identified as a
major global risk [32]. Countering of fake news is a challenging problem that can be
addressed at two different levels: when fake contents are created and when fake contents
are spread. Concerning the recognition of fake news, several approaches have been
proposed into the state of the art, ranging from unsupervised [33] to semi-supervised
[25] and supervised approaches [11]. On the other hand, to the best of our knowledge,
there isn’t any study for predicting if a given user is more inclined to share a fake or a
real news in online social networks.
    The proposed approach, presented for the "Profiling Fake News Spreaders on Twit-
ter" challenge organized at the PAN@CLEF initiative, is one of the first tentatives for
trying to prevent the intentional or unintentional diffusion of inaccurate information. In
particular, we propose to characterize the profile of fake news and real news spreaders,
by exploiting four types of characteristics, i.e. stylometry, personality, emotions and
embeddings.
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa-
    loniki, Greece.
2     Related work

Fake news is a phenomenon that has started to grow exponentially during the last ten
years [17]. The leading cause is that fake news can be created and shared online very
quickly, in a not expensive way when compared to traditional news media such as news-
papers. Most of the literature available in this research area is mostly related to the
recognition of fake news. In particular, several approaches are nowadays available in
the state of the art, ranging from methods based on stylistic features and patterns, to
those focused on the source credibility and to the ones concentrated to the propagation
dynamics.
     Concerning the approaches grounded on stylistic features, the content is commonly
represented by a set of characteristics [26,19,24,1,6,5,35,7,22], that are consequently
used by any machine learning approach. Examples of such features are readability, psy-
cholinguistic features, punctuation and syntax. For what concerns the methods focused
on source credibility, most of them are based on evaluating the users that have cre-
ated a potential false information by using several cues such as emotions [8,9], users
posting and re-posting behavior [30,15] and content-specific features [10]. Regarding
the approaches focused on the propagation dynamics of fake news, most of the them
are based on epidemic models, which can mathematically model the progression of an
infectious disease [31,16,3]. However, defining proper epidemic model for fake news
is still in its infancy due to their assumptions that in many cases do not match a real
scenario.
     In all of the above mentioned approaches, the users play a key role in the creation
and propagation of fake news by consuming and spreading contents that could be fake
or real. To the best of our knowledge, few studies [27,28,29] focus on profiling possible
fake news spreaders in online social media. In this paper, grounding on the main find-
ings achieved by the former approaches, we aimed at modelling fake news spreaders by
profiling users exploiting the most promising characteristics available in the state of the
art. The proposed method, based on stylometry, personality, emotions and embeddings
is detailed in the following section.


3     Proposed Method

The proposed system aims at distinguishing authors that have shared some fake news
in the past from those that have never done it, by characterizing each user according to
the following features:

    – Stylometry: the writing style of a user can reveal if it is mainly inclined to fake or
      real contents. To this purpose, for each user, we estimated its stylometric profile by
      considering language usage, punctuation adoption and part-of-speech frequencies;
    – Personality: the behavior of users in social media, such the posts that they create
      or the contents that they share, allows us to infer sensitive information such as
      personality. Taking into account the Twitter feeds, each user can be associated to a
      given personality according its communication skills;
 – Emotion: since real news define their contents without attempting to affect the opin-
   ion of the reader and, on the other hand, fake news take advantage of the readers
   sensitivity, we modeled the emotion conveyed by each user through the spread text;
 – Embeddings: since fake news spreaders should be inclined, intentionally or unin-
   tentionally, to share specific topics of interest, an embedding representation of such
   contents have been derived.

A summary of the proposed vector representation for each user is reported in Figure 1,
where 8-features are used to describe the emotion of the user, 1-feature for its person-
ality, 32-features for characterizing its writing style and 512-features for its content
embeddings.




                          Figure 1. Proposed user representation


    In the following sub-sections, the features extracted for modeling fake and real news
spreaders will be detailed. Once all the above mentioned characteristics have been ex-
tracted for each user, a Support Vector Machines (SVM) classifier, with a linear ker-
nel and default parameters according to the scikit-learn implementation [18] has been
adopted for distinguishing between fake and real news spreaders. The proposed model
has been run on a TIRA architecture [21].

3.1   Stylometric characteristics
Stylometric features can be used as baseline characteristics to profile fake news and real
news spreaders, to therefore train a model to distinguish them. The stylometric features
investigated in this paper are the following ones:
 – language usage: average number of sentences, average number of words, average
   number of # symbols, @ symbol, frequency of unique words, frequency of com-
   plex words (more than 5 characters), average of the number of characters in a word,
   frequency of emoji, frequency of offensive words, frequency of stretched words,
   frequency of upper-case words, frequency of words starting with upper case, fre-
   quency of named-entities;
 – part-of-speech: frequency of verbs, auxiliary verbs, adjectives, superlative adjec-
   tives, superlative relative adjectives, comparative adjectives, nouns, conjunctions,
   adverbs, articles, indefinite articles, pronouns, numbers, first singular person pro-
   nouns, first plural person pronouns, second singular person pronouns, second plural
   person pronouns, third singular person pronouns related to male, third singular per-
   son pronouns related to female, third plural person pronouns related to male, third
   plural person pronouns related to female;
 – punctuation marks: frequency of punctuation, colon, semi-comma, exclamation
   mark, question mark, quotes,


3.2   Personality traits

In order to validate the hypothesis that some personality traits are more keen to spread
fake news than others, we exploited the model based on the Myers Briggs Type In-
dicator (MBTI) [14] to predict the personality type of a given user. In particular, 16
distinct personality types are detected by a 4 axis model that compares the following
dichotomies:

 – Introversion (I) vs Extroversion (E)
 – Intuition (N) vs Sensing (S)
 – Thinking (T) vs Feeling (F)
 – Judging (J) vs Perceiving (P)

The choice of the MTBI model has been motivated by the hypothesis that real news
spreaders should belong more likely to the type T and J, for being more predisposed
to reasoning and accurate decision making. On the contrary, we argue that fake news
spreaders should have a personality type E or F, for being more inclined to act according
to their feelings. In order to detect the personality type of each user, we adopted the
MBTI personality classification system that takes the social media posts of a given
user as input and produces as output a prediction of the author’s personality type. To
accomplish this task, only for the English language, we exploited a supervised model
1
  based on a Naive Bayes classifiers trained using a publicly available Kaggle dataset2 .
The MTBI classifiers is based on two main components: (1) pre-processing of Tweeter
feeds posted by the user and (2) training/inference mechanism based on Naive Bayes
for predicting the personality of a user given its posts.
    The pre-processing component, based on NLTK [12], lemmatizes the text, in order
to transform the inflected forms of the same root word to their lemma. Then, through
the Keras word tokenizer, 2500 most common lemmatized words of the lemmatized text
are maintained for the subsequent steps. The n-grams and word vectors for the hashtags,
emoticons and phrases are created by using the TF-IDF representation. Concerning
the training and inference mechanisms, a Naive-Bayes Text Classifier is adopted for
predicting the personality of a user give its Twitter feeds.


3.3   Emotion-related features

Extracting the emotion-related feature for each user is the first step for characterizing its
profile. In order to extract emotions, the "NRC Emotion Lexicon" has been used [13],
which consists of a word list with associated the eight emotions (anger, fear, expecta-
tion, trust, surprise, sadness, joy and disgust) modeled by the Plutchick theory [20]. The
 1
   https://github.com/priyansh19/Classification-of-Personality-
   based-on-Users-Twitter-Data
 2
   https://www.kaggle.com/datasnaek/mbti-type
lexicon is composed of 14182 words, and was created through "crowdsourcing" start-
ing from an idea by Mohammad Saif and Peter Turney. Initially developed using only
English words, in 2017 it was extended for supporting multiple languages. For creating
the emotion-related features, the frequency of words belonging to the eight emotions
have been estimated using all the Twitter feeds of a given user.


3.4   User Embeddings

The last characteristics that we included for representing our users are related to embed-
dings. The hypothesis is that there are some aspects of a similar fake news that could be
expressed in a similar way from a semantic point of view even if they are written in dif-
ferent ways from a lexicographic perspective. In order to capture semantic similarities
of fake news spreaders, each user has been represented by a 512-D vector derived by a
member-wise mean aggregation function on its tweet embedding. To this purpose, we
adopted the Universal Sentence Encoder (v4) [4] developed by Google and available in
the TensorFlow Hub package. In particular, to capture some common characteristics be-
tween the two considered languages, the Multilingual Universal Sentence Encoder [34]
has been adopted. This model is an extension of the Universal Sentence Encoder Large
that includes training on multiple tasks across languages. The process for extracting the
embedding representation of a given user is reported if Figure 2.




                                Figure 2. User embeddings




4     Experiments and Results

4.1   Experimental settings

For evaluating the performance of the systems aimed at distinguishing fake news spread-
ers from real news spreaders, the task organizers provided both training and testing
datasets, for the English and Spanish language. The training datasets are composed of
600 users, perfectly balanced between fake and real news spreaders, with their corre-
sponding 100 messages. Analogous datasets have been provided for the testing phase.
The proposed method has been validated by measuring the accuracy using a 10-cross
validation strategy on the training set, and on the testing set given for the competition.
4.2   Experimental results
We report in the following the results firstly obtained by adopting a 10-cross validation
strategy on the training dataset. Table 1 shows the accuracy obtained on each fold,
together with the average performance and its standard deviation, on the two languages.
We can easily note that the results on the Spanish dataset are a bit higher than the
ones obtained on the English language. This is likely due to the variability of topics
considered in both languages.


                                            Fold
                       1   2     3    4    5    6     7    8    9 10 Avg. Std.
             English 0.70 0.60 0.53 0.80 0.60 0.67 0.63 0.57 0.57 0.57 0.62 0.08
             Spanish 0.77 0.73 0.77 0.67 0.73 0.87 0.80 0.73 0.60 0.80 0.75 0.07
                           Table 1. 10-fold cross-validation results.



    By analysing the performance of the proposed approach, we notice that the features
related to stylometry, emotions and embeddings contribute more to the recognition ca-
pabilities than the personality one. This is due to the inefficacy of the adopted model
to really capture the personality traits of the users given their Twitter feeds. Another
interesting insights is related to the relative contribution given by the stylometric char-
acteristics to the final results. These features contributed to obtain a 5% of improvement
of the accuracy, with respect to use only emotion and embeddings. This reveals that this
set of characteristics can help to better distinguish fake and not-fake writing styles.
    Concerning the results of the shared task, the proposed approach achieved 60% of
accuracy for English and 72% for Spanish.

4.3   Conclusions and future work
This paper has presented the proposed solution for the Profiling Fake News Spreaders
on Twitter shared task at PAN 2020. Our approach, based on modeling both fake and
real news spreaders of users using stylometry, personality, emotions and embeddings,
has shown promising results and pointed out interesting research directions. Concerning
the obtained results, the analysis of the considered characteristics has highlighted that
stylometry could play an important role for characterising both profiles, while person-
ality does not contribute in a significant way. Regarding future work, some additional
characteristics should be considered. For instance, age, geo-location and education level
of the users could contribute to better distinguish between the two profiles. Other re-
search directions are focused on the analysis of the syntactic patters and relationships
between sentences.


References
 1. Biyani, P., Tsioutsiouliklis, K., Blackmer, J.: âĂIJ8 amazing secrets for getting more
    clicksâĂİ: Detecting clickbaits in news streams using article informality. In: Proceedings of
    the Thirtieth AAAI Conference on Artificial Intelligence. p. 94âĂŞ100. AAAIâĂŹ16,
    AAAI Press (2016)
 2. Braun, J.A., Eklund, J.L.: Fake news, real money: Ad tech platforms, profit-driven hoaxes,
    and the business of journalism. Digital Journalism 7(1), 1–21 (2019)
 3. Carchiolo, V., Longheu, A., Malgeri, M., Mangioni, G., Previti, M.: A trust-based news
    spreading model. In: International Workshop on Complex Networks. pp. 303–310. Springer
    (2018)
 4. Cer, D., Yang, Y., Kong, S.y., Hua, N., Limtiaco, N., St. John, R., Constant, N.,
    Guajardo-Cespedes, M., Yuan, S., Tar, C., Strope, B., Kurzweil, R.: Universal sentence
    encoder for English. In: Proceedings of the 2018 Conference on Empirical Methods in
    Natural Language Processing: System Demonstrations. pp. 169–174. Association for
    Computational Linguistics, Brussels, Belgium (Nov 2018)
 5. Feng, S., Banerjee, R., Choi, Y.: Syntactic stylometry for deception detection. In:
    Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics
    (Volume 2: Short Papers). pp. 171–175 (2012)
 6. Ghanem, B., Rosso, P., Rangel, F.: An Emotional Analysis of False Information in Social
    Media and News Articles. ACM Transactions on Internet Technology (TOIT) 20(2), 1–18
    (2020)
 7. Giachanou, A., Ríssola, E.A., Ghanem, B., Crestani, F., Rosso, P.: The role of personality
    and linguistic patterns in discriminating between fake news spreaders and fact checkers. In:
    International Conference on Applications of Natural Language to Information Systems. pp.
    181–192. Springer (2020)
 8. Giachanou, A., Rosso, P., Crestani, F.: Leveraging emotional signals for credibility
    detection. In: Proceedings of the 42nd International ACM SIGIR Conference on Research
    and Development in Information Retrieval. p. 877âĂŞ880. SIGIRâĂŹ19, Association for
    Computing Machinery, New York, NY, USA (2019)
 9. Guo, C., Cao, J., Zhang, X., Shu, K., Liu, H.: Dean: Learning dual emotion for fake news
    detection on social media. arXiv preprint arXiv:1903.01728 (2019)
10. Gupta, A., Kumaraguru, P., Castillo, C., Meier, P.: TweetCred: Real-Time Credibility
    Assessment of Content on Twitter, pp. 228–243. Springer International Publishing, Cham
    (2014)
11. Kaliyar, R.K., Goswami, A., Narang, P., Sinha, S.: Fndnet–a deep convolutional neural
    network for fake news detection. Cognitive Systems Research 61, 32–44 (2020)
12. Loper, E., Bird, S.: Nltk: The natural language toolkit. In: Proceedings of the ACL-02
    Workshop on Effective Tools and Methodologies for Teaching Natural Language
    Processing and Computational Linguistics. pp. 63–70 (2002)
13. Mohammad, S.M., Turney, P.D.: Crowdsourcing a word–emotion association lexicon.
    Computational Intelligence 29(3), 436–465 (2013)
14. Myers, I.B., Myers, P.B.: Gifts Differing: Understanding Personality Type. Hachette UK
    (2010)
15. ODonovan, J., Kang, B., Meyer, G., Höllerer, T., Adalii, S.: Credibility in context: An
    analysis of feature distributions in twitter. In: 2012 International Conference on Privacy,
    Security, Risk and Trust and 2012 International Confernece on Social Computing. pp.
    293–301. IEEE (2012)
16. Oktaviansyah, E., Rahman, A.: Predicting hoax spread in indonesia using sirs model. In:
    Journal of Physics: Conference Series. vol. 1490, p. 012059. IOP Publishing (2020)
17. Parikh, S.B., Patil, V., Atrey, P.K.: On the origin, proliferation and tone of fake news. In:
    2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). pp.
    135–140. IEEE (2019)
18. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
    Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. the
    Journal of machine Learning research 12, 2825–2830 (2011)
19. Pérez-Rosas, V., Kleinberg, B., Lefevre, A., Mihalcea, R.: Automatic detection of fake
    news. In: Proceedings of the 27th International Conference on Computational Linguistics.
    pp. 3391–3401 (2018)
20. Plutchik, R.: Emotions: A general psychoevolutionary theory. Approaches to emotion 1984,
    197–219 (1984)
21. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.
    In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World.
    Springer (Sep 2019)
22. Rangel, F., Franco-Salvador, M., Rosso, P.: A Low Dimensionality Representation for
    Language Variety Identification. In: International Conference on Intelligent Text Processing
    and Computational Linguistics. pp. 156–169. Springer (2016)
23. Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th Author Profiling
    Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: Cappellato, L., Eickhoff,
    C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR
    Workshop Proceedings (Sep 2020), CEUR-WS.org
24. Rashkin, H., Choi, E., Jang, J.Y., Volkova, S., Choi, Y.: Truth of varying shades: Analyzing
    language in fake news and political fact-checking. In: Proceedings of the 2017 conference
    on empirical methods in natural language processing. pp. 2931–2937 (2017)
25. Saini, N., Singhal, M., Tanwar, M., Meel, P.: Multimodal, semi-supervised and
    unsupervised web content credibility analysis frameworks. In: 2020 4th International
    Conference on Intelligent Computing and Control Systems (ICICCS). pp. 948–955. IEEE
    (2020)
26. Schuster, T., Schuster, R., Shah, D.J., Barzilay, R.: The limitations of stylometry for
    detecting machine-generated fake news. Computational Linguistics pp. 1–12 (2020)
27. Shu, K., Wang, S., Liu, H.: Understanding user profiles on social media for fake news
    detection. In: 2018 IEEE Conference on Multimedia Information Processing and Retrieval
    (MIPR). pp. 430–435. IEEE (2018)
28. Shu, K., Wang, S., Liu, H.: Beyond news contents: The role of social context for fake news
    detection. In: Proceedings of the Twelfth ACM International Conference on Web Search
    and Data Mining. p. 312âĂŞ320. WSDM âĂŹ19, Association for Computing Machinery,
    New York, NY, USA (2019)
29. Shu, K., Zhou, X., Wang, S., Zafarani, R., Liu, H.: The role of user profiles for fake news
    detection. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances in
    Social Networks Analysis and Mining. p. 436âĂŞ439. ASONAM âĂŹ19, Association for
    Computing Machinery, New York, NY, USA (2019)
30. Sitaula, N., Mohan, C.K., Grygiel, J., Zhou, X., Zafarani, R.: Credibility-based fake news
    detection. In: Disinformation, Misinformation, and Fake News in Social Media, pp.
    163–182. Springer (2020)
31. Tambuscio, M., Ruffo, G., Flammini, A., Menczer, F.: Fact-checking effect on viral hoaxes:
    A model of misinformation spread in social networks. In: Proceedings of the 24th
    international conference on World Wide Web. pp. 977–982 (2015)
32. Webb, H., Jirotka, M., Stahl, B.C., Housley, W., Edwards, A., Williams, M., Procter, R.,
    Rana, O., Burnap, P.: ’digital wildfires’ a challenge to the governance of social media? In:
    Proceedings of the ACM web science conference. pp. 1–2 (2015)
33. Yang, S., Shu, K., Wang, S., Gu, R., Wu, F., Liu, H.: Unsupervised fake news detection on
    social media: A generative approach. In: Proceedings of the AAAI Conference on Artificial
    Intelligence. vol. 33, pp. 5644–5651 (2019)
34. Yang, Y., Cer, D., Ahmad, A., Guo, M., Law, J., Constant, N., Abrego, G.H., Yuan, S., Tar,
    C., Sung, Y.H., et al.: Multilingual universal sentence encoder for semantic retrieval. arXiv
    preprint arXiv:1907.04307 (2019)
35. Zhou, X., Jain, A., Phoha, V.V., Zafarani, R.: Fake news early detection: A theory-driven
    model. Digital Threats: Research and Practice 1(2), 1–25 (2020)