=Paper=
{{Paper
|id=Vol-2699/paper38
|storemode=property
|title=The Ebb and Flow of the COVID-19 Misinformation Themes
|pdfUrl=https://ceur-ws.org/Vol-2699/paper38.pdf
|volume=Vol-2699
|authors=Thomas Marcoux,Ester Mead,Nitin Agarwal
|dblpUrl=https://dblp.org/rec/conf/cikm/MarcouxMA20
}}
==The Ebb and Flow of the COVID-19 Misinformation Themes==
The Ebb and Flow of the COVID-19 Misinformation Themes Thomas Marcoux Esther Mead Nitin Agarwal txmarcoux@ualr.edu elmead@ualr.edu nxagarwal@ualr.edu University of Arkansas at Little Rock, Little Rock, AR 72204, USA otal entity in determining how each nation responds to the crisis. We have seen a variety of contradictory Abstract statements on the national and international scene in- fluencing opinions, in some cases politically polarizing The COVID-19 pandemic has seen the the issue of how to respond to the pandemic. But we emergence of unique misinformation have also seen cases of direct, physical - i.e. direct mail narratives in various outlets, through scams - attempts at preying on the uninformed or vul- social media, blogs, etc. This on- nerable such as personal protective equipment (PPE) line misinformation has been proven to marketing schemes. In both cases, it is obvious that spread in a viral manner and has a di- information has a very real impact on the lives and rect impact on public safety. In an effort livelihood of many. As such, we propose a study of to improve public understanding, we cu- the themes and chronological dynamics of the spread- rated a corpus of 543 misinformation ing of misinformation about COVID-19. Our corpus pieces whittled down to 243 unique mis- is a collection of unique misinformation stories1 man- information narratives along with third ually curated by our team. To highlight and visualize party proofs debunking these stories. these misinformation themes, we use topic modeling, Building upon previous applications of and introduce a tool to visualize the evolution of these topic modeling to COVID-19 related themes chronologically. material, we developed a tool leveraging topic modeling to create a chronological visualization of these stories. From our 2 Literature Review corpus of misinformation stories, this The information community has been tackling the is- tool has shown to accurately represent sue of misinformation surrounding the COVID-19 pan- the ground truth reported by our cu- demic since early in the outbreak. We base the claims rator team. This highlights some of found in this paper on the findings that misinforma- the misinformation narratives unique to tion spreads in a viral fashion and that consumers of the COVID-19 pandemic and provides a misinformation tend to fail at recognizing it as such quick method to monitor and assess mis- [Pen+20]. In addition to this, we believe this research information diffusion, enabling policy- is essential as rampant misinformation constitutes a makers to identify themes to focus on danger to public safety [Kou+20]. We also believe this for communication campaigns. research is helpful in curbing misinformation since re- searchers have found that simply recognizing the ex- 1 Introduction istence of misinformation and improving our under- Following the discovery and subsequent spread of the standing of it can enhance the larger public’s ability to COVID-19 pandemic, information has become one piv- recognize misinformation as such [Pen+20]. In order to better understand the misinformation surrounding Title of the Proceedings: “Proceedings of the CIKM 2020 Work- the pandemic, we look at previous research that has shops, October 19-20, Galway, Ireland” Editors of the Proceed- leveraged topic models to understand online discus- ings: Stefan Conrad, Ilaria Tiddi sions surrounding this crisis. Research has shown the 2020 Copyright c for this paper by its authors: Use permitted under Creative Commons License Atrribution 4.0 International 1 Stories can be explored at our official website (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) https://cosmos.ualr.edu/covid-19 benefits of using this technique to understand fluctu- ated by EUvsDisinfo in March of 2020 [EUv20]. EU- ating Twitter narratives [Sha+20] over time, and also vsDisinfo’s database, however, was primarily focused in understanding the significance of media outlets in on “pro-Kremlin disinformation efforts on the novel health communications [Liu+20]. coronavirus”. Most of these items represented false To implement topic modeling, we use the Latent narratives that were communicating political, mili- Dirichlet Allocation model. Within the realm of natu- tary, and healthcare conspiracy theories in an at- ral language processing (NLP), topic modeling is a sta- tempt to sow confusion, distrust, and public discord. tistical technique designed to categorize a set of doc- Subsequently, misinformation stories were continually uments within a number of abstract “topics”[BLS09]. gleaned from publicly available aggregators, such as A “topic” is defined as a set of words outlining a gen- POLITIFACT2 , Truth or Fiction3 , FactCheck.org4 , eral underlying theme. For each document, which POLYGRAPH.info5 , Snopes6 , Full Fact7 , AP Fact in this case, is an individual item of misinformation Check8 , Poynter9 , and Hoax-Slayer10 . The follow- in our data set, a probability is assigned that desig- ing data points were collected or each misinforma- nates its “belongingness” to a certain topic. In this tion item: title, summary, debunking date, debunking study, we use the popular LDA topic model due to source, misinformation source(s), theme, and dissemi- its widespread use and proved performances [BNJ03]. nation platform(s). The time period of our data set is One point of debate within the topic modeling com- from January 22, 2020 to July 22, 2020. The data set munity is the elimination of stop-words: i.e., should is comprised of 548 unique misinformation items. For analysts filter common words from their corpus before many of the items, multiple platforms were used to training a model. Following recent research claiming spread the misinformation. For example, oftentimes that the use of custom stop-words adds little benefits a misinformation item will be posted on Facebook, [SMM17], we followed the researchers’ recommenda- Twitter, YouTube, and as an article on a website. For tion and removed common words after the model had our data set, the top-used platforms used for spread- been trained. ing misinformation were websites, Facebook, Twitter, Our model choice has seen use in previous research YouTube, and Instagram, respectively. using LDA for short texts, specifically for short so- cial media texts such as tweets [ZML17]. Some other 3.2 Topic Modeling social media research using homogeneous social me- In order to derive lexical meaning from this corpus, we dia sources such as tweets or blog posts use associated built a pipeline executing the following steps. First, we hashtags to provide further context to topic models processed each document in our text corpus. All that [ARL17]. This is a promising lead to expend this re- is needed is a text field identified by a date. Because search towards big data social media corpora. in most cases of word of mouth or social media it is In this paper, we propose to leverage topic models impossible to pinpoint the exact date the idea first to understand the main underlying themes of misinfor- emerged, we use the date of publication of the cor- mation and their evolution over time using a manually responding third party “debunk piece”. We trained curated corpus of known fake narratives. our LDA model using the Python tool Gensim11 us- ing the methodology and pre-processing best practices 3 Methodology as described by its author [ŘS10] as well as best stop This study uses a two-step methodology to produce words practices as described earlier [SMM17]. In this relevant topic streams. First, through a manual cu- study, we found that generating 20 different topics rating process, we aggregate different misinformation best matched the ground truth as reported by the re- narratives for later processing. We consider misinfor- searchers curating the misinformation stories. mation narratives, any narrative pushed through a va- Once the model was trained, we ordered the docu- riety of outlets (social media, radio, physical mail, etc.) ments by date and created a numpy matrix where each that has been or is later believably disproved by a third document is given a score for each topic produced by party. This corpus constitutes our input data. Sec- the model. This score describes the probability that ondly, we use this corpus to train an LDA topic model 2 https://www.politifact.com/coronavirus/ and to generate subsequent topic streams for analysis. 3 https://www.truthorfiction.com/ 4 https://www.factcheck.org/ We describe these two steps in more details in the next 5 https://www.polygraph.info/ sections. 6 https://www.snopes.com/fact-check/ 7 https://fullfact.org/health/coronavirus/#coronavirus 3.1 Collection of Misinformation Stories 8 https://apnews.com/APFactCheck 9 https://www.poynter.org/ifcn-covid-19-misinformation Initially, the misinformation stories in our data set 10 https://www.hoax-slayer.net/category/covid-19/ were obtained from a publicly available database cre- 11 https://radimrehurek.com/gensim/ the given document is categorized as being part of a potential COVID-19 vaccine, and items promoting the topic, i.e. if a score is high enough (here, a 10% prob- use of hydroxychloroquine. During the month of June, ability), the document is considered part of the topic. the prominent theme shifted significantly to attempts This allowed us to leverage the Python Pandas12 li- to convince citizens that face masks are either more brary to plot a chronological graph for each individual harmful than not wearing one, and how to avoid rules topic. We averaged topic distribution per day and used that required their use. Phishing scams also remained a moving average window size of 20. This helped in prominent during June. During the month of July, the highlighting the overarching patterns of the different dominant themes of the misinformation items shifted narratives. The tool is publicly available and can be back to attempts to downplay the deadliness of the found in the footnotes13 . novel coronavirus. Another prominent theme in July were attempts to convince the public that COVID-19 4 Results testing is inflating the results. In this section, we discuss the thoughts of our data collection team and the ground truth as they were ob- 4.2 Topic Streams served, and compare these with the results obtained After using the tool described in 3.2, we generated the through our topic modeling visualization tool. graphs and tables described and discussed in this sec- tion. Our data contains 243 unique misinformation 4.1 Prominent Misinformation Themes Over narratives spanning from January 2020 to June 2020. Time The data was curated by our research team through Although a variety of misinformation themes were the process described in the methodology. Each en- identified, particularly dominant themes stood out, try contains, among other fields, a “date” used as a changing over time. These themes were considered chronological identifier, a “title” describing the gen- as dominant based on a simple sum of their frequency eral idea the misinformation is attempting to convey, of occurrence in our data set. During the month of and a “theme” field putting the story in a concisely March, the prominent misinformation theme was the described category. For example, a story given the ti- promotion of remedies and techniques to supposedly tle “US Department of Defense has a secret biological prevent, treat, or kill the novel coronavirus. Dur- laboratory in Georgia” is categorized in the following ing the month of April, the prominent themes still theme: “Western countries are likely to be purpose- included the promotion of remedies and techniques, ful creators of the new virus.” Each topic was repre- but additional prominent themes began to stand out. sented by an identification number up to 20 and a set For example, several misinformation stories attempted of 10 words. We picked the three most relevant words to downplay the deadliness of the novel coronavirus. that best represented the general idea of each topic. Others discussed the anti-malaria drug hydroxychloro- Notably, obvious words such as covid or coronavirus quine. Others promoted the idea that the virus was a were removed from the topic descriptions since they hoax meant to defeat President Donald Trump. Oth- are common for every topic. ers consisted of various attempts to attribute false In Tables 1 and 2, we described some of the twenty claims to high-profile people, such as politicians and topics found by each of our LDA models. These topics representatives of health organizations. Also in April, were chosen because they each described a precise nar- although first signs of these were seen in March, the rative and have a low topic distribution (or proportion idea that 5G caused the novel coronavirus began to within the corpus). A low proportion is desirable be- become more prevalent. During the month of May, cause this indicates the detection of a unique narrative the prominent themes shifted to predominantly false within the corpus; as opposed to an overarching topic claims made by high-profile people, followed by at- including general words such as “world”, “outbreak”, tempts to convince citizens that face masks are ei- or “pandemic”. Do note that topic inclusiveness is ther more harmful than not wearing one, or are in- not exclusive and documents can be part of multiple effective at preventing COVID-19, and how to avoid topics. rules that required their use. The number and variety This becomes apparent in the tables below: from of identity theft phishing scams also increased during our topic model, we found a dominant topic encom- May. Misinformation items attempting to attribute passing 68% of narratives. It includes words such as false claims to high-profile people continued through- “Trump”, “outbreak”, “president”, etc. Some other out May. Also becoming prominent in May were mis- narratives also included words such as “flu”, “news”, information items attempting to spread fear about a or “fake”. Because the evolution of these narratives are 12 https://pandas.pydata.org/ consistent across the corpus and show little temporal 13 https://github.com/thomas-marcoux/TopicStreamsTools fluctuation, we chose not to report on them further. For these reasons, the narratives we focused on below show a low percentage of distribution. Table 1: Most frequent dominant topics from titles. ID Word 1 Word 2 Word 3 Proportion 10 china chinese spread 2% 12 scam hydroxy... health 2% 17 state donald trump 2% 18 vaccine gates bill 5% Table 2: Most frequent dominant topics from themes. Figure 1: Topic distribution of titles for topic 10 (key- ID Word 1 Word 2 Word 3 Proportion words: china, chinese, spread) 3 fear spread western 2% 9 predicted pandemic vaccine 2% online narratives that focused on the provenance of the 16 phishing hydroxy... vaccine 2% virus during the early stages. Figure 2 shows the evolution of Topic 12, the topic describing narratives related to health, home reme- 4.2.1 Using narrative titles as a corpus dies, and general hoaxes and scams stemming from the panic. We can see it was consistent with the rise The general narratives described by the topics were of cases in the United States and panic increased as thus: with the spread of the virus. It is interesting to note that this figure roughly coincides with the daily num- • Topic 10 described the narratives related to the ber of confirmed cases for this time period [Rit+20]. Chinese government and its responsibility in the spread of the virus. These stories represented an estimated 2% of the 243 stories collected. • Topic 12 described the narratives related to per- sonal health and scams or misinformation such as the benefits of hydroxychloroquine. These stories represented an estimated 2% of the 243 stories collected. • Topic 17 described the narratives related to the re- sponse of Donald Trump and his administration. These stories represented an estimated 2% of the 243 stories collected. • Topic 18 described the narratives related to the involvement of Bill Gates in various conspiracies, mostly linked to vaccines. These stories repre- Figure 2: Topic distribution of titles for topic 12 (key- sented an estimated 4% of the 243 stories col- words: hydroxychloroquine, health, scam) lected. Figure 1 shows the evolution of Topic 10, the topic Figure 3 shows the evolution of Topic 17. This topic describing China-related narratives. It shows that described stories related to Donald Trump and his ad- these narratives were already in full force from the be- ministration. These stories generally referred to claims ginning of our corpus and slowly came to a near halt that the virus was manufactured as a political strat- during the month of April. We notice a short spike egy, or claims that various public figures were speaking again towards the end of the corpus during the month out against the response of the Trump administration. of June. This is consistent with the ground truth of Figure 4 shows the evolution of Topic 18. This 68% of narratives as well. This time including words such as “attempt”, “countries”, and “purposeful”. As for section 4.2.1, we chose not to report on that topic as well as other smaller but general topics showing little fluctuation. Therefore, the narratives we focused on below show a low percentage of distribution. The general narratives described by the topics are thus: • Topic 3 described the narratives related to the speculations on the spread of the virus, especially in an international relations context. These sto- ries represented an estimated 2% of the 243 stories collected. • Topic 9 described the narratives related to sto- ries claiming the creation and propagation of the Figure 3: Topic distribution of titles for topic 17 (key- virus were either designed or predicted, along with words: donald, trump, state) voices claiming a vaccine already exists. These stories represented an estimated 3% of the 243 topic described stories such as Bill Gates and his stories collected. perceived involvement with an hypothetical vaccine, • Topic 16 described the narratives related to per- and other theories describing the virus’ appearance sonal health and scams or misinformation such as and spread as an orchestrated effort. As with Figure the benefits of hydroxychloroquine. These stories 1, these narratives were especially strong early on represented an estimated 2% of the 243 stories (albeit this narrative remained active for a slightly collected. longer time), before coming to a near halt. Figure 5 shows the evolution of Topic 3. It is linked We notice that as theories about the origins of the to early fear of the virus and presented narratives as virus slowed down, hoaxes and scams on personal pro- opposing the western block with the East, notably tection increased as shown on Figure 2. China. It matched closely with Figure 1 and its China- related narratives. In both cases, we see an early dom- inance of the topic followed by a near halt as the virus touched the United States. Figure 4: Topic distribution of titles for topic 18 (key- words: bill, gates, vaccine) Figure 5: Topic distribution of themes for topic 3 (key- 4.2.2 Using narrative themes as a corpus words: fear, spread, western) For this section, we inputted narrative themes as the corpus. Note that the topic IDs are independent from Figure 6 describes the evolution of narratives claim- the previous set of topics using titles. Similarly to ing the virus was predicted or even designed. This section 4.2.1, we found a dominant topic encompassing figure is consistent with the results shown by Figure 4 which shows claims regarding Bill Gates, early vac- We have shown the potential of using topic modeling cines, etc. They both showed stories of early knowl- visualization to get a bird’s eye view of the fluctuating edge of the virus and peaked early, appearing more or narratives and an ability to quickly gain a better un- less sporadically as time goes on and as cases increased. derstanding of the evolution of individual stories. We have seen that the tool is efficient to chronologically represent actual narratives pushed to various outlets, as confirmed by the ground truth observed by our mis- information curating team. This work illustrates a rel- atively quick technique for allowing policy makers to monitor and assess the diffusion of misinformation on online social networks in real-time, which will enable them to take a proactive approach in crafting impor- tant theme-based communication campaigns to their respective citizen constituents. We have also seen in this study that using carefully curated “themes” - which offer a lexical value close to the abstract topics provided by the LDA model - yields similar results to using misinformation narratives “ti- tle”. This paves the way for scaling this method with much larger corpora such as a set of news headlines, Figure 6: Topic distribution of themes for topic 9 (key- blog titles, or social media posts. words: predicted, pandemic, vaccine) LDA is generally viewed as more reliable due to the control one can have over the number of topics. Find- Figure 7 is parallel to Figure 2. Both showed hoax ing an optimal level of granularity through trial and er- stories promoting scams and health-related misinfor- ror tends to perform well when tailored to the use-case. mation. We noticed an early rise on Figure 7, most Because the LDA topic model may become difficult to likely due to the inclusion of the keyword “vaccines” scale, however, we consider using the HDP (Hierarchi- in the topic, which caused some overlap with Topic 9 cal Dirichlet Process) model for future works involving as shown in Figure 6. multiple larger corpora. This model attempts to infer the number of topics computationally, which may be- come more scalable on large sets of documents with an unknown number of topics. Acknowledgements This research is funded in part by the U.S. National Science Foundation (OIA-1946391, OIA-1920920, IIS-1636933, ACI-1429160, and IIS-1110868), U.S. Office of Naval Research (N00014-10-1-0091, N00014- 14-1-0489, N00014-15-P-1187, N00014-16-1-2016, N00014-16-1-2412, N00014-17-1-2675, N00014-17-1- 2605, N68335-19-C-0359, N00014-19-1-2336, N68335- 20-C-0540), U.S. Air Force Research Lab, U.S. Army Research Office (W911NF-17-S-0002, W911NF-16- 1-0189), U.S. Defense Advanced Research Projects Agency (W31P4Q-17-C-0059), Arkansas Research Figure 7: Topic distribution of themes for topic 16 Alliance, the Jerry L. Maulden/Entergy Endowment (keywords: hydroxychloroquine, vaccine, phishing) at the University of Arkansas at Little Rock, and the Australian Department of Defense Strategic Policy Grants Program (SPGP) (award number: 2020-106- 5 Conclusion 094). Any opinions, findings, and conclusions or recommendations expressed in this material are those This study has highlighted some of the narratives that of the authors and do not necessarily reflect the surfaced during the COVID-19 pandemic. We col- views of the funding organizations. The researchers lected 243 unique misinformation narratives over six gratefully acknowledge the support. months and proposed a tool to observe their evolution. References 19118. url: http://www.ncbi.nlm.nih. gov/pubmed/32302966. [BNJ03] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation”. In: [Pen+20] Gordon Pennycook et al. “Fighting Journal of Machine Learning Research 3 COVID-19 Misinformation on Social (2003), pp. 993–1022. Media: Experimental Evidence for a Scal- able Accuracy-Nudge Intervention”. In: [BLS09] David M. Blei, John D. Lafferty, and Psychological Science 31.7 (2020). eprint: Ashok N. Srivastava. Text Mining: Clas- https://doi.org/10.1177/0956797620939054, sification, Clustering, and Applications. pp. 770–780. doi: 10 . 1177 / CRC Press, 2009, pp. 71–88. 0956797620939054. url: https : [ŘS10] Radim Řehůřek and Petr Sojka. “Software //doi.org/10.1177/0956797620939054. Framework for Topic Modelling with Large [Rit+20] Hannah Ritchie et al. United States: Coro- Corpora”. In: May 2010, pp. 45–50. doi: navirus Pandemic - Our World in Data. 10.13140/2.1.2393.1847. 2020. url: https : / / ourworldindata . [ARL17] Md. Hijbul Alam, Woo-Jong Ryu, and org / coronavirus / country / united - SangKeun Lee. “Hashtag-based topic evo- states ? country = ~USA. (accessed: lution in social media”. In: World Wide 07.29.2020). Web 20.6 (Nov. 2017), pp. 1527–1549. [Sha+20] Hao Sha et al. Dynamic topic model- issn: 1573-1413. doi: 10 . 1007 / s11280 - ing of the COVID-19 Twitter narrative 017-0451-3. url: https://doi.org/10. among U.S. governors and cabinet execu- 1007/s11280-017-0451-3. tives. 2020. arXiv: 2004.11692 [cs.SI]. [SMM17] A. Schofield, M. Magnusson, and D. Mimno. “Pulling Out the Stops: Rethink- ing Stopword Removal for Topic Models”. In: 15th Conference of the European Chap- ter of the Association for Computational Linguistics. Vol. 2. Association for Com- putational Linguistics. 2017, pp. 432–436. [ZML17] Y. Zhang, W. Mao, and J. Lin. “Model- ing Topic Evolution in Social Media Short Texts”. In: 2017 IEEE International Con- ference on Big Knowledge (ICBK). 2017, pp. 315–319. [EUv20] EUvsDisinfo. EUvsDisinfo. March 16, 2020. The Kremlin and Disinformation About Coronavirus. 2020. url: https:// euvsdisinfo . eu / the - kremlin - and - disinformation - about - coronavirus/. (accessed: 07.16.2020). [Kou+20] Ramez Kouzy et al. “Coronavirus Goes Viral: Quantifying the COVID-19 Mis- information Epidemic on Twitter”. eng. In: Cureus 12.3 (Mar. 2020). Publisher: Cureus, e7255–e7255. issn: 2168-8184. doi: 10.7759/cureus.7255. url: https: //pubmed.ncbi.nlm.nih.gov/32292669. [Liu+20] Qian Liu et al. “Health Communica- tion Through News Media During the Early Stage of the COVID-19 Outbreak in China: Digital Topic Modeling Approach”. In: J Med Internet Res 22.4 (Apr. 2020), e19118. issn: 1438-8871. doi: 10 . 2196 /