=Paper=
{{Paper
|id=Vol-3181/paper41
|storemode=property
|title=Don’t Just Drop Them: Function Words as Features in COVID-19 Related Fake
						News Classification on Twitter
|pdfUrl=https://ceur-ws.org/Vol-3181/paper41.pdf
|volume=Vol-3181
|authors=Pascal Schröder
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Schroder21
}}
==Don’t Just Drop Them: Function Words as Features in COVID-19 Related Fake
						News Classification on Twitter==
<pdf width="1500px">https://ceur-ws.org/Vol-3181/paper41.pdf</pdf>
<pre>
Don’t Just Drop Them: Function Words as Features in COVID-19
          Related Fake News Classification on Twitter
                                                                         Pascal Schröder
                                                               Radboud University, Netherlands
                                                                   pascal.schroeder@ru.nl

ABSTRACT                                                                             Table 1: Top 10 function words sorted by p-values of the
                                                                                     𝜒 2 test, with absolute frequencies per class. Italic words are
This research shows that function words can be useful as features
                                                                                     pronouns. Bold denotes the class with most occurrences.
for machine learning models tasked with detecting conspiratorial
content in COVID-19 related Twitter posts. A significance test
exposes that the distribution of function words between fake and                                  Word        P-Value       # No Consp.        # Consp.
legitimate content varies greatly. Further, a support vector machine                              my          1 × 10−8                  77            18
classifier is demonstrated to perform above chance when using                                     is          8 × 10−8                 415           340
function word-only features, achieving a Matthews correlation                                     the         1 × 10−7                 596           443
coefficient of 0.139 on unseen test data.                                                         they        6 × 10−6                 166           140
                                                                                                  used        0.00001                    8            27
1    INTRODUCTION                                                                                 am          0.00003                   53            11
Previous research into detecting conspiratorial online content in-                                during      0.00007                   33             4
dicates that it can be distinguished by the author’s writing style                                their       0.00052                   57            62
[2, 10, 11]. Simultaneously, function words provide a meaningful                                  he          0.00071                   89            40
proxy to an author’s writing style in authorship attribution [1, 12].                             she         0.00074                   16             1
Therefore, function words could be valuable in aiding machine
learning models tasked with detecting conspiratorial content. How-
ever, many approaches in fake news classification still rely on a                    stems from a Russian troll account, which while related, is still not
purely content-based approach, in which function words are ex-                       focused on fake news detection specifically.
cluded as part of the preprocessing [2]. While these approaches                         This research thus aims to expand on the available literature
oftentimes offer impressive performances, it is paramount that                       by analysing how function word usage is distributed between au-
potentially relevant features are not excluded in the process.                       thors of conspiratorial versus legitimate content, and quantifying
   Fortunately, recent years have seen a growing body of research                    whether function words alone can act as sufficient features in the
on the importance of stylistic features in misinforming and con-                     classification of fake news content.
spiratorial content. For instance, Posadas-Durán et al. [9] showed
that better performance levels can be reached for classifiers when                   2    APPROACH
function words are incorporated in the training data, versus when                    The data were provided through the MediaEval 2021 Conference,
they are not. However, like many approaches, they have used on-                      for the task ‘FakeNews: Corona Virus and Conspiracies Multimedia
line news articles as their data [2]. Arguably, the style entertained                Analysis’ [7, 8]. It consists of 1554 Twitter posts related to COVID-
by authors of social media posts will be different, and it is to be                  19 and different conspiracy theories.1 Three different class labels
expected that the results do not generalise.                                         are provided: tweets that do not mention conspiracies (1), tweets
   For social media and especially Twitter data, the literature is                   that discuss conspiracies without actively supporting them (2), and
rather sparse. Del Tredici and Fernández [13] have classified articles               tweets that promote or support conspiracies (3). Note that the classes
shared on Twitter as fake or real, and enhanced their data with                      are imbalanced, with 767 examples for class 1, 271 examples for
the user’s post history and profile description, and found more                      class 2, and 516 examples for class 3.
function words in the latter. However, they have not classified                         To investigate a possible difference in distribution between con-
posts directly, but rather articles linked in posts. Niven et al. [6]                spiratorial and non-conspiratorial content, most of this research
have used function words as a proxy for ‘thoughtfulness’, arguing                    thus focused on classes 1 and 3. First, function words were extracted
that the latter correlates with the fakeness of a post’s content, but                from all data using spaCy.2 To test for significance, a 𝜒 2 test was
have not found a significant difference in distribution between fake                 performed on the distribution of function words between the classes
and legit content. But since they only had available posts from 300                  1 and 3.
different users, individual authors could have possibly skewed the                      Next, the usefulness of the extracted function words was tested.
distribution. Im et al. [4] have found an above chance performance                   As a representative model, a support vector machine (SVM) with
for function word-only features when predicting whether a post                       non-linear kernels was chosen, and evaluated using classification

Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   1 All posts are written in English and were collected between January 17, 2020 and
License Attribution 4.0 International (CC BY 4.0).                                   June 30, 2021.
MediaEval’21, December 13-15 2021, Online                                            2 For the full list of function words, see https://github.com/explosion/spaCy/blob/
                                                                                     master/spacy/lang/en/stop_words.py
MediaEval’21, December 13-15 2021, Online                                                                                                            P. Schröder


accuracy and Matthews correlation coefficient (MCC). Two scenar-                       Table 2: Average performance of SVM classifiers trained on
ios were investigated, one with data containing all three classes,                     function word features for the binary case between classes
and one for data containing the classes 1 and 3 only. To account for                   1 and 3, as well as for all three classes, with random 10%
random effects, 100 runs were computed for both scenarios, each                        validation splits. Test result is for a single SVM trained on
with a random 10% validation split. As a final evaluation, a single                    all available data.
SVM model was fitted on all available data, and evaluated on an
unseen test set. Due to organisational means, this evaluation is only                                             Train            Val                 Test
available for the 3-class case using MCC. Preprocessing of function                                               Mean     Std.    Mean      Std.
words was done using the TfidfVectorizer of SciKit Learn.3
                                                                                           Accuracy (binary)      0.911    0.005   0.637     0.039     -
3    RESULTS AND ANALYSIS                                                                  MCC (binary)           0.818    0.009   0.203     0.088     -
                                                                                           Accuracy (all)         0.819    0.006   0.533     0.034     -
The results of the 𝜒 2 test can be seen in Table 1, showing the 10                         MCC (all)              0.710    0.010   0.165     0.056     0.139
function words with lowest p-value, all of which are below 1%. 5
are pronouns (50%), which is higher than the overall frequency of
pronouns in the function words (≈11.48%). Only 2 out of the 10
words occur more often in class 3 (conspiracy), while the remaining                    authors want to steer the argument away from their own opinion to
8 occur more often in class 1 (no conspiracy).                                         a more general claim, thereby avoiding responsibility. This finding
   The following are all sub 5% function words, 40 in total, sorted                    is somewhat supported by Newman et al. [5], who found higher
from lowest to highest p-value:                                                        usage rates of the pronoun I in people who are lying.
         my, is, the, they, used, am, during, their, he, she,                             Apart from pronouns, the overall majority of function words
         by, this, him, serious, doing, might, his, if, us, but,                       with p-values below 5% belong to the non-conspiratorial class. This
         be, these, all, seem, about, part, her, along, could,                         indicates that authors of fake content use a more simplistic style,
         your, due, have, are, here, using, at, per, when,                             as the complexity of a text correlates with the number of different
         would, now                                                                    function words used.
   Of these words, 12 are pronouns (30%), marked in italic. 7 belong                      An earlier analysis on a smaller subset of the data showed dif-
to class 3, marked in bold, while 33 belong to class 1.                                ferent patterns in the function word distributions, most notably
   Table 2 shows the results of the SVM classifiers.                                   the presence of ‘hedging’ words like quite, rather and somehow.
                                                                                       However, these patterns disappeared when the larger data set was
4    DISCUSSION AND OUTLOOK                                                            released. Therefore, it is important to note that the data set at hand,
                                                                                       with only 1554 total posts, is a very limited subset of all COVID-19
The results of the 𝜒 2 test show that a significant difference in the
                                                                                       related data found on Twitter. Thus, it cannot be ruled out that the
distribution of function words can be observed between conspir-
                                                                                       patterns found in this research, although powerful in predicting on
atorial and non-conspiratorial content, confirming the idea that
                                                                                       the chosen dataset, may not generalise.
such words are indeed important for distinguishing fake from le-
                                                                                          This limitation is extended by the fact that the data analysed
gitimate content. This is further supported by the classification
                                                                                       in this report does not contain author information. As stylistic in-
results, where an above chance performance on unseen test data
                                                                                       formation correlates very strongly with the author of a text, the
was achieved. Unsurprisingly, the classification performance in the
                                                                                       patterns found could, in theory, be caused by a few authors having
binary case was higher than in the full case, since an overlap in
                                                                                       a disproportionately high representation in the data. This effect
style between non-conspiratorial content as well as content which
                                                                                       unfortunately could not be accounted for due to the missing au-
does not actively support conspiracies, but merely discusses them,
                                                                                       thorship information.
is to be expected.
                                                                                          In conclusion, this research has shown that function words are a
    Interestingly, the category of function words most common in the
                                                                                       strong proxy for detecting conspiratorial content in the context of
list of sub 5% p-value words were pronouns, for which the relative
                                                                                       COVID-19 related fake news on Twitter. To address the limitations
frequency was greatly increased compared to their frequency in
                                                                                       of this research, future work should explore in how far these results
all function words. Further, all of these pronouns except their and
                                                                                       generalise to larger corpora of Twitter data, different domains of
these, occurred more often in the non-conspiracy category. Most
                                                                                       conspiracy, as well as other social media platforms.
prominent are third person singular pronouns (he, she, him, his), all
featured more in class 1, which indicates that conspiracy authors
are less likely to talk about a person at length. This could be because                ACKNOWLEDGMENTS
giving extensive detail (e.g. ‘She said X ’) rather than implying makes                I thank Martha Larson for her critical input during the development
their claims falsifiable, which they might be interested to avoid [3].                 of the research question as well as the analysis process, and Lynn
Interestingly, this stands in contrast to Rashkin et al. [10], who                     de Rijk for her input on function word usage and interesting lexical
found a higher frequency of pronouns in conspiratorial content. Of                     patterns in fake news content.
further interest is the fact that the pronoun my displays the most
significance overall. This could again be because conspiratorial                       REFERENCES
3 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.    [1] Shlomo Argamon and Shlomo Levitan. 2005. Measuring the Useful-
TfidfVectorizer.html                                                                        ness of Function Words for Authorship Attribution. Proceeding of the
FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task               MediaEval’21, December 13-15 2021, Online


     Joint Conference on Association for Literary and Linguistic Comput-
     ing/Association Computer Humanities.
 [2] Nicollas R. de Oliveira, Pedro S. Pisa, Martin Andreoni Lopez, Dianne
     Scherly V. de Medeiros, and Diogo M.F. Mattos. 2021. Identifying
     fake news on social networks based on natural language processing:
     Trends and challenges. Information (Switzerland) 12 (Jan. 2021), 1–32.
     Issue 1. https://doi.org/10.3390/info12010038
 [3] Lynn de Rijk. 2020. You Said it? How Mis- and Disinformation Tweets
     Surrounding the Corona-5G-conspiracy Communicate Through Im-
     plying. Proceedings of the MediaEval 2020 Workshop, Online, 14-15
     December 2020 (2020).
 [4] Jane Im, Eshwar Chandrasekharan, Jackson Sargent, Paige Lightham-
     mer, Taylor Denby, Ankit Bhargava, Libby Hemphill, David Jurgens,
     and Eric Gilbert. 2020. Still out There: Modeling and Identifying Rus-
     sian Troll Accounts on Twitter. In 12th ACM Conference on Web Science
     (WebSci ‘20). Association for Computing Machinery, New York, NY,
     USA, 1–10. https://doi.org/10.1145/3394231.3397889
 [5] Matthew Newman, James Pennebaker, Diane Berry, and Jane Richards.
     2003. Lying Words: Predicting Deception from Linguistic Styles. Per-
     sonality & Social Psychology Bulletin 29 (June 2003), 665–75. https:
     //doi.org/10.1177/0146167203029005010
 [6] Timothy Niven, Hung-Yu Kao, and Hsin-Yang Wang. 2020. Profil-
     ing Spreaders of Disinformation on Twitter: IKMLab and Softbank
     Submission. In CLEF 2020.
 [7] Konstantin Pogorelov, Daniel Thilo Schroeder, Stefan Brenner, and
     Johannes Langguth. 2021. FakeNews: Corona Virus and Conspira-
     cies Multimedia Analysis Task at MediaEval 2021. Proceedings of the
     MediaEval 2021 Workshop, Online, 13-15 December 2021.
 [8] Konstantin Pogorelov, Daniel Thilo Schroeder, Petra Filkuková, Ste-
     fan Brenner, and Johannes Langguth. 2021. WICO Text: A Labeled
     Dataset of Conspiracy Theory and 5G-Corona Misinformation Tweets.
     Proceedings of the 2021 Workshop on Open Challenges in Online Social
     Networks, pp. 21-25 (2021).
 [9] Juan Pablo Posadas-Durán, Helena Gomez-Adorno, Grigori Sidorov,
     and Jesús Jaime Moreno Escobar. 2019. Detection of fake news in a
     new corpus for the Spanish language. Journal of Intelligent and Fuzzy
     Systems 36 (May 2019), 4868–4876. Issue 5. https://doi.org/10.3233/
     JIFS-179034
[10] Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana Volkova, and
     Yejin Choi. 2017. Truth of Varying Shades: Analyzing Language in Fake
     News and Political Fact-Checking. In Proceedings of the 2017 Conference
     on Empirical Methods in Natural Language Processing. Association
     for Computational Linguistics, Copenhagen, Denmark, 2931–2937.
     https://doi.org/10.18653/v1/D17-1317
[11] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017.
     Fake News Detection on Social Media: A Data Mining Perspective.
     Special Interest Group on Knowledge Discovery in Data: Explorations
     Newsletter 19 (Aug. 2017). https://doi.org/10.1145/3137597.3137600
[12] Efstathios Stamatatos. 2009. A Survey of Modern Authorship Attribu-
     tion Methods. Journal of the Association for Information Science and
     Technology 60 (March 2009), 538–556. https://doi.org/10.1002/asi.21001
[13] Marco Del Tredici and Raquel Fernández. 2020. Words are the Win-
     dow to the Soul: Language-based User Representations for Fake
     News Detection. In Proceedings of the 28th International Conference
     on Computational Linguistics. International Committee on Compu-
     tational Linguistics, Barcelona, Spain (Online), 5467–5479. https:
     //doi.org/10.18653/v1/2020.coling-main.477

</pre>