=Paper=
{{Paper
|id=Vol-3181/paper41
|storemode=property
|title=Don’t Just Drop Them: Function Words as Features in COVID-19 Related Fake
News Classification on Twitter
|pdfUrl=https://ceur-ws.org/Vol-3181/paper41.pdf
|volume=Vol-3181
|authors=Pascal Schröder
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Schroder21
}}
==Don’t Just Drop Them: Function Words as Features in COVID-19 Related Fake
News Classification on Twitter==
Don’t Just Drop Them: Function Words as Features in COVID-19 Related Fake News Classification on Twitter Pascal Schröder Radboud University, Netherlands pascal.schroeder@ru.nl ABSTRACT Table 1: Top 10 function words sorted by p-values of the 𝜒 2 test, with absolute frequencies per class. Italic words are This research shows that function words can be useful as features pronouns. Bold denotes the class with most occurrences. for machine learning models tasked with detecting conspiratorial content in COVID-19 related Twitter posts. A significance test exposes that the distribution of function words between fake and Word P-Value # No Consp. # Consp. legitimate content varies greatly. Further, a support vector machine my 1 × 10−8 77 18 classifier is demonstrated to perform above chance when using is 8 × 10−8 415 340 function word-only features, achieving a Matthews correlation the 1 × 10−7 596 443 coefficient of 0.139 on unseen test data. they 6 × 10−6 166 140 used 0.00001 8 27 1 INTRODUCTION am 0.00003 53 11 Previous research into detecting conspiratorial online content in- during 0.00007 33 4 dicates that it can be distinguished by the author’s writing style their 0.00052 57 62 [2, 10, 11]. Simultaneously, function words provide a meaningful he 0.00071 89 40 proxy to an author’s writing style in authorship attribution [1, 12]. she 0.00074 16 1 Therefore, function words could be valuable in aiding machine learning models tasked with detecting conspiratorial content. How- ever, many approaches in fake news classification still rely on a stems from a Russian troll account, which while related, is still not purely content-based approach, in which function words are ex- focused on fake news detection specifically. cluded as part of the preprocessing [2]. While these approaches This research thus aims to expand on the available literature oftentimes offer impressive performances, it is paramount that by analysing how function word usage is distributed between au- potentially relevant features are not excluded in the process. thors of conspiratorial versus legitimate content, and quantifying Fortunately, recent years have seen a growing body of research whether function words alone can act as sufficient features in the on the importance of stylistic features in misinforming and con- classification of fake news content. spiratorial content. For instance, Posadas-Durán et al. [9] showed that better performance levels can be reached for classifiers when 2 APPROACH function words are incorporated in the training data, versus when The data were provided through the MediaEval 2021 Conference, they are not. However, like many approaches, they have used on- for the task ‘FakeNews: Corona Virus and Conspiracies Multimedia line news articles as their data [2]. Arguably, the style entertained Analysis’ [7, 8]. It consists of 1554 Twitter posts related to COVID- by authors of social media posts will be different, and it is to be 19 and different conspiracy theories.1 Three different class labels expected that the results do not generalise. are provided: tweets that do not mention conspiracies (1), tweets For social media and especially Twitter data, the literature is that discuss conspiracies without actively supporting them (2), and rather sparse. Del Tredici and Fernández [13] have classified articles tweets that promote or support conspiracies (3). Note that the classes shared on Twitter as fake or real, and enhanced their data with are imbalanced, with 767 examples for class 1, 271 examples for the user’s post history and profile description, and found more class 2, and 516 examples for class 3. function words in the latter. However, they have not classified To investigate a possible difference in distribution between con- posts directly, but rather articles linked in posts. Niven et al. [6] spiratorial and non-conspiratorial content, most of this research have used function words as a proxy for ‘thoughtfulness’, arguing thus focused on classes 1 and 3. First, function words were extracted that the latter correlates with the fakeness of a post’s content, but from all data using spaCy.2 To test for significance, a 𝜒 2 test was have not found a significant difference in distribution between fake performed on the distribution of function words between the classes and legit content. But since they only had available posts from 300 1 and 3. different users, individual authors could have possibly skewed the Next, the usefulness of the extracted function words was tested. distribution. Im et al. [4] have found an above chance performance As a representative model, a support vector machine (SVM) with for function word-only features when predicting whether a post non-linear kernels was chosen, and evaluated using classification Copyright 2021 for this paper by its authors. Use permitted under Creative Commons 1 All posts are written in English and were collected between January 17, 2020 and License Attribution 4.0 International (CC BY 4.0). June 30, 2021. MediaEval’21, December 13-15 2021, Online 2 For the full list of function words, see https://github.com/explosion/spaCy/blob/ master/spacy/lang/en/stop_words.py MediaEval’21, December 13-15 2021, Online P. Schröder accuracy and Matthews correlation coefficient (MCC). Two scenar- Table 2: Average performance of SVM classifiers trained on ios were investigated, one with data containing all three classes, function word features for the binary case between classes and one for data containing the classes 1 and 3 only. To account for 1 and 3, as well as for all three classes, with random 10% random effects, 100 runs were computed for both scenarios, each validation splits. Test result is for a single SVM trained on with a random 10% validation split. As a final evaluation, a single all available data. SVM model was fitted on all available data, and evaluated on an unseen test set. Due to organisational means, this evaluation is only Train Val Test available for the 3-class case using MCC. Preprocessing of function Mean Std. Mean Std. words was done using the TfidfVectorizer of SciKit Learn.3 Accuracy (binary) 0.911 0.005 0.637 0.039 - 3 RESULTS AND ANALYSIS MCC (binary) 0.818 0.009 0.203 0.088 - Accuracy (all) 0.819 0.006 0.533 0.034 - The results of the 𝜒 2 test can be seen in Table 1, showing the 10 MCC (all) 0.710 0.010 0.165 0.056 0.139 function words with lowest p-value, all of which are below 1%. 5 are pronouns (50%), which is higher than the overall frequency of pronouns in the function words (≈11.48%). Only 2 out of the 10 words occur more often in class 3 (conspiracy), while the remaining authors want to steer the argument away from their own opinion to 8 occur more often in class 1 (no conspiracy). a more general claim, thereby avoiding responsibility. This finding The following are all sub 5% function words, 40 in total, sorted is somewhat supported by Newman et al. [5], who found higher from lowest to highest p-value: usage rates of the pronoun I in people who are lying. my, is, the, they, used, am, during, their, he, she, Apart from pronouns, the overall majority of function words by, this, him, serious, doing, might, his, if, us, but, with p-values below 5% belong to the non-conspiratorial class. This be, these, all, seem, about, part, her, along, could, indicates that authors of fake content use a more simplistic style, your, due, have, are, here, using, at, per, when, as the complexity of a text correlates with the number of different would, now function words used. Of these words, 12 are pronouns (30%), marked in italic. 7 belong An earlier analysis on a smaller subset of the data showed dif- to class 3, marked in bold, while 33 belong to class 1. ferent patterns in the function word distributions, most notably Table 2 shows the results of the SVM classifiers. the presence of ‘hedging’ words like quite, rather and somehow. However, these patterns disappeared when the larger data set was 4 DISCUSSION AND OUTLOOK released. Therefore, it is important to note that the data set at hand, with only 1554 total posts, is a very limited subset of all COVID-19 The results of the 𝜒 2 test show that a significant difference in the related data found on Twitter. Thus, it cannot be ruled out that the distribution of function words can be observed between conspir- patterns found in this research, although powerful in predicting on atorial and non-conspiratorial content, confirming the idea that the chosen dataset, may not generalise. such words are indeed important for distinguishing fake from le- This limitation is extended by the fact that the data analysed gitimate content. This is further supported by the classification in this report does not contain author information. As stylistic in- results, where an above chance performance on unseen test data formation correlates very strongly with the author of a text, the was achieved. Unsurprisingly, the classification performance in the patterns found could, in theory, be caused by a few authors having binary case was higher than in the full case, since an overlap in a disproportionately high representation in the data. This effect style between non-conspiratorial content as well as content which unfortunately could not be accounted for due to the missing au- does not actively support conspiracies, but merely discusses them, thorship information. is to be expected. In conclusion, this research has shown that function words are a Interestingly, the category of function words most common in the strong proxy for detecting conspiratorial content in the context of list of sub 5% p-value words were pronouns, for which the relative COVID-19 related fake news on Twitter. To address the limitations frequency was greatly increased compared to their frequency in of this research, future work should explore in how far these results all function words. Further, all of these pronouns except their and generalise to larger corpora of Twitter data, different domains of these, occurred more often in the non-conspiracy category. Most conspiracy, as well as other social media platforms. prominent are third person singular pronouns (he, she, him, his), all featured more in class 1, which indicates that conspiracy authors are less likely to talk about a person at length. This could be because ACKNOWLEDGMENTS giving extensive detail (e.g. ‘She said X ’) rather than implying makes I thank Martha Larson for her critical input during the development their claims falsifiable, which they might be interested to avoid [3]. of the research question as well as the analysis process, and Lynn Interestingly, this stands in contrast to Rashkin et al. [10], who de Rijk for her input on function word usage and interesting lexical found a higher frequency of pronouns in conspiratorial content. Of patterns in fake news content. further interest is the fact that the pronoun my displays the most significance overall. This could again be because conspiratorial REFERENCES 3 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text. [1] Shlomo Argamon and Shlomo Levitan. 2005. Measuring the Useful- TfidfVectorizer.html ness of Function Words for Authorship Attribution. Proceeding of the FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task MediaEval’21, December 13-15 2021, Online Joint Conference on Association for Literary and Linguistic Comput- ing/Association Computer Humanities. [2] Nicollas R. de Oliveira, Pedro S. Pisa, Martin Andreoni Lopez, Dianne Scherly V. de Medeiros, and Diogo M.F. Mattos. 2021. Identifying fake news on social networks based on natural language processing: Trends and challenges. Information (Switzerland) 12 (Jan. 2021), 1–32. Issue 1. https://doi.org/10.3390/info12010038 [3] Lynn de Rijk. 2020. You Said it? How Mis- and Disinformation Tweets Surrounding the Corona-5G-conspiracy Communicate Through Im- plying. Proceedings of the MediaEval 2020 Workshop, Online, 14-15 December 2020 (2020). [4] Jane Im, Eshwar Chandrasekharan, Jackson Sargent, Paige Lightham- mer, Taylor Denby, Ankit Bhargava, Libby Hemphill, David Jurgens, and Eric Gilbert. 2020. Still out There: Modeling and Identifying Rus- sian Troll Accounts on Twitter. In 12th ACM Conference on Web Science (WebSci ‘20). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3394231.3397889 [5] Matthew Newman, James Pennebaker, Diane Berry, and Jane Richards. 2003. Lying Words: Predicting Deception from Linguistic Styles. Per- sonality & Social Psychology Bulletin 29 (June 2003), 665–75. https: //doi.org/10.1177/0146167203029005010 [6] Timothy Niven, Hung-Yu Kao, and Hsin-Yang Wang. 2020. Profil- ing Spreaders of Disinformation on Twitter: IKMLab and Softbank Submission. In CLEF 2020. [7] Konstantin Pogorelov, Daniel Thilo Schroeder, Stefan Brenner, and Johannes Langguth. 2021. FakeNews: Corona Virus and Conspira- cies Multimedia Analysis Task at MediaEval 2021. Proceedings of the MediaEval 2021 Workshop, Online, 13-15 December 2021. [8] Konstantin Pogorelov, Daniel Thilo Schroeder, Petra Filkuková, Ste- fan Brenner, and Johannes Langguth. 2021. WICO Text: A Labeled Dataset of Conspiracy Theory and 5G-Corona Misinformation Tweets. Proceedings of the 2021 Workshop on Open Challenges in Online Social Networks, pp. 21-25 (2021). [9] Juan Pablo Posadas-Durán, Helena Gomez-Adorno, Grigori Sidorov, and Jesús Jaime Moreno Escobar. 2019. Detection of fake news in a new corpus for the Spanish language. Journal of Intelligent and Fuzzy Systems 36 (May 2019), 4868–4876. Issue 5. https://doi.org/10.3233/ JIFS-179034 [10] Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana Volkova, and Yejin Choi. 2017. Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 2931–2937. https://doi.org/10.18653/v1/D17-1317 [11] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake News Detection on Social Media: A Data Mining Perspective. Special Interest Group on Knowledge Discovery in Data: Explorations Newsletter 19 (Aug. 2017). https://doi.org/10.1145/3137597.3137600 [12] Efstathios Stamatatos. 2009. A Survey of Modern Authorship Attribu- tion Methods. Journal of the Association for Information Science and Technology 60 (March 2009), 538–556. https://doi.org/10.1002/asi.21001 [13] Marco Del Tredici and Raquel Fernández. 2020. Words are the Win- dow to the Soul: Language-based User Representations for Fake News Detection. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Compu- tational Linguistics, Barcelona, Spain (Online), 5467–5479. https: //doi.org/10.18653/v1/2020.coling-main.477