Challenges in Combating COVID-19 Infodemic - Data, Tools, and Ethics Kaize Ding Kai Shu Yichuan Li Arizona State University Illinois Institute of Technology Arizona State University kaize.ding@asu.edu kshu@iit.edu yichuan1@asu.edu Amrita Bhattacharjee Huan Liu Arizona State University Arizona State University abhatt43@asu.edu huan.liu@asu.edu 1 Introduction Abstract Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome While the COVID-19 pandemic continues its coronavirus 2 (SARS-CoV-2). The World Health Or- global devastation, numerous accompanying ganization (WHO) recently declared the COVID-19 challenges emerge. One important challenge outbreak a Public Health Emergency of International we face is to efficiently and e↵ectively use Concern (PHEIC) and a pandemic due to its high recently gathered data and find computa- morbidity and mortality rates. As of April 15, 2020, tional tools to combat the COVID-19 info- more than 2.04 million cases have been reported across demic, a typical information overloading prob- 210 countries and territories, resulting in over 133,000 lem. Novel coronavirus presents many ques- deaths1 . These numbers are continuing to rise and the tions without ready answers; its uncertainty health systems in many countries are overwhelmed to and our eagerness in search of solutions of- provide treatment. Concomitant with the pandemic fer a fertile environment for infodemic. It is are many unknowns that create a conducive environ- thus necessary to combat the infodemic and ment for misinformation, fake news, political disinfor- make a concerted e↵ort to confront COVID-19 mation campaigns, scams, etc. Those malicious con- and mitigate its negative impact in all walks tents instigate fears or anger, capitalize on human of life when saving lives and maintaining nor- vulnerability, and exploit human emotion, kindness, mal orders during trying times. In this posi- and/or wishes for miracles. tion paper of combating the COVID-19 info- As the coronavirus spreads like fire in the world, demic, we illustrate its need by providing real- disinformation machines also accelerate their cam- world examples of rampant conspiracy the- paigns on various fronts, rendering a new infodemic ories, misinformation, and various types of battlefield. Social media platforms such as Face- scams that take advantage of human kindness, book/Instagram, Twitter, and Google/YouTube have fear, and ignorance. We present three key been abused to disseminate erroneous contents. When challenges in this fight against the COVID-19 the whole world is scrambling to fight the COVID-19 infodemic where researchers and practitioners pandemic, governments and WHO also have to combat instinctively want to contribute and help. We an infodemic, which is defined as “an overabundance demonstrate that these challenges can and will of information — some accurate and some not—that be e↵ectively addressed by collective wisdom, makes it hard for people to find trustworthy sources crowd sourcing, and collaborative research. and reliable guidance when they need it” [Don20]. The COVID-19 infodemic causes confusion, sows di- Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International 1 https://en.wikipedia.org/wiki/Coronavirus_disease_ (CC BY 4.0). 2019 Title of the Proceedings: "Proceedings of the CIKM 2020 Workshops October 19-20, Galway, Ireland" Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi vision, incites hatred, promotes unproven cures, and public figures. According to the report in [KG20], provokes social panic, which directly impacts emer- “among outlets that repeatedly share false content, gency response, treatment, recovery, and financial and eight of the top 10 most engaged-with sites are running mental health during the difficult time of self-isolation. coronavirus stories.” For instance, there are plenty of Therefore, combating the COVID-19 infodemic is a supposed “cures” on social media that will likely mis- challenging yet imperative task to solve. lead people to risk their lives for quick fixes. Disre- In this paper, we first present some COVID-19 re- garding the National Institutes of Health (NIH) warn- lated examples to illustrate the variety and range of in- ing of many hearsay cures without evidence of curing fodemic cases in representative categories: conspiracy being e↵ective, there are endless claims such as herbs theories and misinformation, and scams and security and teas, or something of the sort that can prevent the attacks to reinforce the urgency and need for address- coronavirus. Recently, some wireless towers were dam- ing the COVID-19 infodemic via scalable and timely aged in the UK due to a false claim that radio waves solutions. We then discuss the essential challenges in sent by 5G technology are causing small changes to designing and developing corresponding AI solutions people’s bodies that make them succumb to the virus. from three perspectives: data, computational tools, and ethics. The last challenge of ethics is particu- larly easy to overlook when we rush to confront the 2.2 Scam, spam, phishing, and malware immediate threats. Therefore, it is important to un- derstand unintended consequences when developing AI As more and more people start working or study- solutions to ensure sustainable and healthy use and de- ing from home, cyber criminals recently shift focus ployment. Last, we use some current e↵orts to demon- to target remote workers. Di↵erent attacks such as strate the feasibility of addressing the three challenges scam, spam, phishing and malware, which prey on in combating the COVID-19 infodemic; Meanwhile, people’s willingness to help, fear of supply shortage, by understanding the challenges and what we have, and moments of weakness, have become increasingly we also appreciate the importance of collaborative active. Researchers have found that the volume of research for e↵ectively and efficiently combating the coronavirus email scams nearly tripled in one week, COVID-19 infodemic. with almost 3% of all global spam now estimated to be COVID-19 related. During the coronavirus pandemic, 2 Examples of COVID-19 Infodemic as state governments and hospitals have scrambled to obtain masks and other medical supplies, scammers at- To illustrate what the COVID-19 infodemic looks like, tempted to sell a fake stockpile of 39 million masks to a how expansive, active, and devastating it is, and why it California labor union. According to The Hill [Mil20] is important to thwart or mitigate its present threats, , “Hackers are taking advantage of the increased re- we first present various examples regarding conspir- liance on networks to target critical organizations such acy theories and misinformation, and scam and other as health care groups and members of the public, steal- security attacks. ing and profiting o↵ sensitive information and putting lives at risk.” 2.1 Conspiracy theories and misinformation With the spread of COVID-19 pandemic, the World Health Organization (WHO) recently warned of an 3 Data, Tool, and Ethics Challenges “infodemic” of rampant conspiracy theories about the coronavirus. Those conspiracy theories have appeared The scale, volume, and reach of the COVID-19 info- in both social media and mainstream news outlets and demic entails the reliance on AI and machine learn- are often intertwined with geopolitics. One example is ing (ML) algorithms to react promptly and respond about how the new coronavirus originated: according rapidly. The success of AI and ML algorithms re- to a Pew Research Center survey, nearly three-in-ten quires large amounts of multi-modal data for their Americans believe COVID-19 was a bio-weapon made efficiency and e↵ectiveness, which introduces a data in the lab. Some top 10 conspiracy theories include challenge. Data extraction and curation from multi- SARS-CoV-2 virus was created as a biologic weapon source data needs di↵erent computational tools to ac- from a lab, GMOs are the culprit, COVID-19 actu- curately categorize and sort out various types of data, ally doesn’t exist, and coronavirus is a plot by big which presents a tool challenge. When we rush to deal Pharma [Lyn20]. with present threats, we should be aware of poten- Coronavirus misinformation is also flooding the in- tial side-e↵ects, unexpected consequences, and biases ternet through social media, text messages, and prop- of our solutions, which suggests an ethic challenge. In agated by celebrities, politicians, or other prominent this section, we will discuss these challenges in detail. 3.1 Data challenge namically changing list of malicious URLs, with new sites being generated everyday. Therefore, it is neces- Though numerous COVID-19 data sources are avail- sary to develop AI/ML identifiers that can learn from able online, their datasets are available on various web- the old malicious sites for estimating the threats of sites for di↵erent needs. The major data challenge new ones. of isolated data sources is the awareness of their ex- istence. Another related issue is that they are col- lected from di↵erent sources or under di↵erent crawl 3.3 Ethics challenge settings. For example, Allen Institute for AI (AI2) The COVID-19 pandemic is ushering in a new era of released the scholarly articles dataset2 collected from digital surveillance since governments are employing PMC, medRxiv and bioRxiv; LitCovid [CAL20] col- tools that track and monitor individuals. South Ko- lected the scientific information from PubMed. Com- rea and Israel, for instance, have demonstrated the ef- bining di↵erent data sources leads to higher quality of fectiveness of harnessing di↵erent digital surveillance data and better coverage. tools. However, such a new practice can breach data To address the data challenges, we need to over- privacy in the meantime and may even remain in use come some shortcomings: disorganization – most of after the pandemic. In this section, we discuss the po- them merely list all the collected datasets on their web- tential privacy concerns, trade-o↵s between stringent sites without information summarizing the relation- disease monitoring and patient privacy and ethical is- ships among them; specificity – data collected for a sues behind the disruption of civil liberties. specific topic, for example, Amazon provides the epi- Gauging the war-like severity of the coronavirus demic dataset on cloud3 and COVID-19 GIS Hub4 pandemic, academics, researchers, companies and non- only contain the academic findings and geospatial- profits alike have come forward to contribute in any related datasets respectively; and inconvenience – possible way. However, given the rapid nature of such most sites merely provide the reference links to the responses and the subsequent lack of policy checks, source datasets and do not provide data utility tools these otherwise novel endeavors may have ethical loop- like covid19datahub [GA20] for easy access. holes. In an attempt to provide a transparent view of the degree of infection and prevent community spread 3.2 Computational tool challenge of the virus, many counties and states in the United States have decided to publicly release data corre- There are existing resources that can assist users to sponding to cases, including the number of cases per identify malicious intent in websites. Google’s Safe zip-code [Mal20]. Smartphone applications with geo- Browsing API, for instance, allows the user to enter a locating capabilities have come out for users to log URL and check it against Google’s constantly updated their symptoms. But the use of such applications has lists of unsafe web resources. Similar resources in- significant privacy concerns 5 [Wet20]. Contact trac- clude isitPhishing.org, malwareurl.com, and antivirus ing has been identified as an e↵ective way to control software, among many others. Additionally, users can the spread of the virus in communities where the in- check malicious domain lists through di↵erent sources fection is not yet widespread or has slowed down sig- such as phishtank.com or the aforementioned Google’s nificantly, and companies including Google and Apple Safe Browsing lists. As many malicious sites use URL are currently developing applications to make this pos- shorteners to disguise themselves, to counteract po- sible. Only when a sufficient number of people use the tential attacks, it would be safe to first use URL ex- application and voluntarily report their cases can it be panders to figure out what they are before clicking used as a reliable tool of tracking. In this situation, them. Despite the easy access of those computational there is an obvious trade-o↵ between user health pri- tools, they are not available conveniently in a single vacy and data transparency and it is challenging to place where di↵erent tools can be called up whenever identify well-defined ethical boundaries when it comes needed. to public health during a pandemic. The success of The awareness of these existing tools and efficient such an app requires a majority of the population to use of them for quick response is vital for combat- download and use it. ing COVID-19. An associate issue is the requirement for current and frequently updated black-lists [SLH17]. As we know, it is infeasible to manually maintain a dy- 4 Feasibility Discussion 2 https://allenai.org/data/cord-19 In this section, we present some current e↵orts that 3 https://aws.amazon.com/blogs/big-data/a-public- address the aforementioned challenges and show that data-lake-for-analysis-of-covid-19-data/ 4 https://coronavirus-disasterresponse.hub.arcgis. 5 https://privacyinternational.org/examples/apps-and- com/ covid-19 the three challenges are solvable with collaborative re- search. For the data challenge, we collect the publicly avail- Figure 2: The current components of TellMe. able COVID-19 datasets and cluster them into several detection, we consider the relationships among pub- groups6 . Under each group, researchers can reference lishers, news pieces, and consumers, which is moti- complete datasets from di↵erent sources or settings. vated by existing sociological studies on journalism For example, in social media data, we gather available on the correlation between the partisan bias of pub- tweet corpus on COVID-19 [BTW+ 20][CLF20] with lishers, the credibility of consumers, and the veracity di↵erent query keywords and time spans. The hierar- degree of news content; and explore various auxiliary chy cluster structure in Figure 1 helps the researchers information from these relations to help detect fake to quickly locate the dataset. Lastly our data reposi- news [SWL19]. Second, for explainable fake news de- tory includes areas in academics, news, social media, tection, we aim to derive explanation of prediction re- and epidemic reports for multi-disciplinary research. sults to help decision makers and practitioners; we at- For example, if a researcher wants to analyze the in- tempt to explore user comments as a source and mine fluence of the news or academic findings on social me- informative and relevant pieces to help explain why a dia like Twitter, s/he can use the data in academic or piece of news is predicted as fake, and pinpoint more news topics and social media. fictional text in news text simultaneously [SCW+ 19]. Published To tackle the ethics challenge due to the increase Academic Social Media Twitter Pre-print in government surveillance and prevalence of smart- COVID-19 Datasets Epidemic Report Resource Report phone apps to collect and gather user/patient data, Rumor Case Report Fact Checked News we need to take into account legitimate concerns re- Geo-Spatial Mobility garding privacy and the degree to which such a regime of monitoring and enforcement will a↵ect democracy Figure 1: A taxonomy of collected datasets after the pandemic ends. It requires us to understand and acknowledge the fact that there is a clear di↵er- To help a researcher easily access the datasets in ence between standard biomedical ethics versus pri- the repository, we build a data-loader7 . It is a Python vacy concerns and ethics during a public health crisis. package with a pandas Dataframe [pdt20] by calling Governments and public health officials may need to data = DataLoader().download(url). This widely used take certain measures aimed at minimizing the dam- data format can help the downstream data analysis. age caused by the virus and for the common good To tackle the tool challenge, we develop TellMe, during this trying time, which under normal circum- a computational tool that provides an estimate if a stances might have been inappropriate. Nevertheless, piece of news or text is disinformation. Its input in- measures could be taken to avoid potential misuse of cludes URLs and text, and its output is a score based data. One possible way to have better guarantees on on di↵erent functions of TellMe as shown in Figure 2: user privacy would be to make these contact tracing URL Checker, Fake News Classifier, Website Matcher, smartphone applications communicate in an encrypted Credibility and Trusty. The Trusty [MYL09] and peer to peer way rather than storing all the data in a Credibility [AL13] scores are based on contents’ social central server. These technologies should also be de- engagements that malicious users share more similar- ployed in a way that is as transparent as possible, so ity than general users. The fake news score is returned that the user is fully aware of what and how much from a state-of-the-art fake news detector [SZL+ 20]. personal information he/she permits the application The website matcher compares the input URL with to use. Furthermore, there is significant ongoing dis- websites that publish false information about the virus cussion among experts, researchers and policy-makers found by NewsGuard [BC19]. In addition, we are also regarding a steady recovery into a normal function- in the process of developing and integrating more com- ing society. For example, the ethics research group at ponents (e.g., advertisement tracker, source attribu- Harvard University makes e↵orts at finding solutions tor) and algorithms [DLBL19, DLL19, DLD+ 19] into without compromising user privacy to keep civil lib- the TellMe system. erty and democracy at the forefront. Now, we use fake news as an example to illustrate our attempts to learn with weak social supervision to detect COVID-19 disinformation more e↵ectively 5 Looking Ahead and with explainability. First, for e↵ective fake news The significance of combating the COVID-19 info- 6 https://github.com/bigheiniu/awesome-coronavirus19- demic lies at protecting people from falling victims to dataset the pandemic in this unexpected front and from dis- 7 https://github.com/bigheiniu/COVID-19-Dataloaders rupting otherwise already inconvenient daily routines so as to improve our resilience in our fight to con- [CLF20] Emily Chen, Kristina Lerman, and Emilio tain the pandemic. In this position paper, we show Ferrara. Covid-19: The first public coron- a good number of problems posed by the COVID-19 avirus twitter dataset, 2020. infodemic, the vast amounts of data generated in the world’s e↵ort to contain the pandemic, and the need [DLBL19] Kaize Ding, Jundong Li, Rohit for concerted e↵orts at various levels to efficiently and Bhanushali, and Huan Liu. Deep anomaly e↵ectively deal with current and future challenges in detection on attributed networks. In medical and information fronts. SDM, 2019. It is evident that (1) we face both immediate and fu- [DLD+ 19] Kaize Ding, Jundong Li, Shivam Dhar, ture challenges in this unprecedented fight, (2) existing Shreyash Devan, and Huan Liu. Interspot: data will grow fast, and existing computational tools interactive spammer detection in social are insufficient to contain and mitigate the COVID- media. In Proceedings of the 28th Inter- 19 infodemic, and (3) short-term solutions can have national Joint Conference on Artificial In- potential long-term impact. Therefore, when we face telligence, pages 6509–6511. AAAI Press, hard choices, we need to resist the temptation to trade- 2019. o↵ so as to minimize long-term negative impact; when we search for solutions, we should consider those em- [DLL19] Kaize Ding, Jundong Li, and Huan ploying crowdsourcing and take long views for fair- Liu. Interactive anomaly detection on at- ness and responsibility; when we design methods, we tributed networks. In Proceedings of the should rely on collective wisdom and diversity to aim Twelfth ACM International Conference on for robustness; and when we form teams, we should Web Search and Data Mining, pages 357– give priority to multi-disciplinary collaboration and 365, 2019. preemptively address hidden biases. Our future will always be uncertain, but with the advancement in sci- [Don20] Joan Donovan. Here’s how so- ence and technology and with our preparedness trained cial media can combat the coron- and tested in our concerted e↵orts to contain the pan- avirus ‘infodemic’, 2020. https: demic in all fronts, our future will surely be brighter //www.technologyreview.com/s/ and healthier. 615368/facebook-twitter-social- media-infodemic-misinformation/. Acknowledgements [GA20] Emanuele Guidotti and David Ardia. This work is, in part, supported by Global Security Covid-19 data hub, 04 2020. Initiative (GSI) at ASU and by NSF grants (2029044 [KG20] Kornbluh and Ellen P. Goodman. and 1614576). We would like to thank Denis Liu for Safeguarding digital democracy, 2020. helping develop earlier versions of TellMe and for care- http://www.gmfus.org/publications/ fully proofreading an earlier version of this paper. safeguarding-democracy-against- disinformation. References [AL13] Mohammad-Ali Abbasi and Huan Liu. [Lyn20] Mark Lynas. Covid: Top 10 current Measuring user credibility in social media. conspiracy theories, 2020. https: In SBP-BRiMS, 2013. //allianceforscience.cornell.edu/ blog/2020/04/covid-top-10-current- [BC19] S Brille and G Crovitz. Newsguard now conspiracy-theories/. available on microsoft edge mobile apps for ios and android, 2019. [Mal20] Laurel Mallory. Sc health officials up- date list of confirmed and estimated [BTW+ 20] Juan M. Banda, Ramya Tekumalla, coronavirus cases by zip code, 2020. Guanyu Wang, Jingyuan Yu, Tuo Liu, https://www.wtoc.com/2020/04/10/ Yuning Ding, and Gerardo Chowell. A dhec-releases-number-confirmed- large-scale covid-19 twitter chatter dataset estimated-coronavirus-cases-by- for open scientific research – an interna- zip-code/. tional collaboration, 2020. [Mil20] Maggie Miller. Virtual army rising up to [CAL20] Q. Chen, A. Allot, and Z. Lu. Keep up protect health care groups from hackers, with the latest coronavirus research. Na- 2020. https://thehill.com/policy/ ture, 2020. cybersecurity/493997-virtual-army- rising-up-to-protect-healthcare- groups-from-hackers/. [MYL09] Sai T Moturu, Jian Yang, and Huan Liu. Quantifying utility and trustworthiness for advice shared on online social media. In CSE, 2009. [pdt20] The pandas development team. pandas- dev/pandas: Pandas, February 2020. [SCW+ 19] Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. defend: Ex- plainable fake news detection. In KDD, 2019. [SLH17] Doyen Sahoo, Chenghao Liu, and Steven CH Hoi. Malicious url detection using machine learning: A survey. arXiv preprint arXiv:1701.07179, 2017. [SWL19] Kai Shu, Suhang Wang, and Huan Liu. Be- yond news contents: The role of social con- text for fake news detection. In WSDM, 2019. [SZL+ 20] Kai Shu, Guoqing Zheng, Yichuan Li, Subhabrata Mukherjee, Ahmed Hassan Awadallah, Scott Ruston, and Huan Liu. Leveraging multi-source weak social su- pervision for early detection of fake news. arXiv preprint arXiv:2004.01732, 2020. [Wet20] Nicole Wetsman. Personal privacy matters during a pandemic — but less than it might at other times, 2020. https://www.theverge.com/2020/ 3/12/21177129/personal-privacy- pandemic-ethics-public-health- coronavirus/.