=Paper= {{Paper |id=Vol-2699/paper33 |storemode=property |title=Challenges in Combating COVID-19 Infodemic - Data, Tools, and Ethics |pdfUrl=https://ceur-ws.org/Vol-2699/paper33.pdf |volume=Vol-2699 |authors=Kaize Ding,Kai Shu,Yichuan Li,Amrita Bhattacharjee,Huan Liu |dblpUrl=https://dblp.org/rec/conf/cikm/DingSLB020 }} ==Challenges in Combating COVID-19 Infodemic - Data, Tools, and Ethics== https://ceur-ws.org/Vol-2699/paper33.pdf
           Challenges in Combating COVID-19 Infodemic
                     - Data, Tools, and Ethics

            Kaize Ding                                    Kai Shu                               Yichuan Li
   Arizona State University   Illinois Institute of Technology    Arizona State University
     kaize.ding@asu.edu                  kshu@iit.edu                yichuan1@asu.edu
                   Amrita Bhattacharjee                   Huan Liu
                         Arizona State University                   Arizona State University
                            abhatt43@asu.edu                           huan.liu@asu.edu



                                                                1    Introduction
                       Abstract                                 Coronavirus disease 2019 (COVID-19) is an infectious
                                                                disease caused by severe acute respiratory syndrome
    While the COVID-19 pandemic continues its                   coronavirus 2 (SARS-CoV-2). The World Health Or-
    global devastation, numerous accompanying                   ganization (WHO) recently declared the COVID-19
    challenges emerge. One important challenge                  outbreak a Public Health Emergency of International
    we face is to efficiently and e↵ectively use                Concern (PHEIC) and a pandemic due to its high
    recently gathered data and find computa-                    morbidity and mortality rates. As of April 15, 2020,
    tional tools to combat the COVID-19 info-                   more than 2.04 million cases have been reported across
    demic, a typical information overloading prob-              210 countries and territories, resulting in over 133,000
    lem. Novel coronavirus presents many ques-                  deaths1 . These numbers are continuing to rise and the
    tions without ready answers; its uncertainty                health systems in many countries are overwhelmed to
    and our eagerness in search of solutions of-                provide treatment. Concomitant with the pandemic
    fer a fertile environment for infodemic. It is              are many unknowns that create a conducive environ-
    thus necessary to combat the infodemic and                  ment for misinformation, fake news, political disinfor-
    make a concerted e↵ort to confront COVID-19                 mation campaigns, scams, etc. Those malicious con-
    and mitigate its negative impact in all walks               tents instigate fears or anger, capitalize on human
    of life when saving lives and maintaining nor-              vulnerability, and exploit human emotion, kindness,
    mal orders during trying times. In this posi-               and/or wishes for miracles.
    tion paper of combating the COVID-19 info-                     As the coronavirus spreads like fire in the world,
    demic, we illustrate its need by providing real-            disinformation machines also accelerate their cam-
    world examples of rampant conspiracy the-                   paigns on various fronts, rendering a new infodemic
    ories, misinformation, and various types of                 battlefield. Social media platforms such as Face-
    scams that take advantage of human kindness,                book/Instagram, Twitter, and Google/YouTube have
    fear, and ignorance. We present three key                   been abused to disseminate erroneous contents. When
    challenges in this fight against the COVID-19               the whole world is scrambling to fight the COVID-19
    infodemic where researchers and practitioners               pandemic, governments and WHO also have to combat
    instinctively want to contribute and help. We               an infodemic, which is defined as “an overabundance
    demonstrate that these challenges can and will              of information — some accurate and some not—that
    be e↵ectively addressed by collective wisdom,               makes it hard for people to find trustworthy sources
    crowd sourcing, and collaborative research.                 and reliable guidance when they need it” [Don20].
                                                                The COVID-19 infodemic causes confusion, sows di-
Copyright © 2020 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 International       1 https://en.wikipedia.org/wiki/Coronavirus_disease_

(CC BY 4.0).                                                    2019
Title of the Proceedings: "Proceedings of the CIKM 2020
Workshops October 19-20, Galway, Ireland"
Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi
vision, incites hatred, promotes unproven cures, and         public figures. According to the report in [KG20],
provokes social panic, which directly impacts emer-          “among outlets that repeatedly share false content,
gency response, treatment, recovery, and financial and       eight of the top 10 most engaged-with sites are running
mental health during the difficult time of self-isolation.   coronavirus stories.” For instance, there are plenty of
Therefore, combating the COVID-19 infodemic is a             supposed “cures” on social media that will likely mis-
challenging yet imperative task to solve.                    lead people to risk their lives for quick fixes. Disre-
   In this paper, we first present some COVID-19 re-         garding the National Institutes of Health (NIH) warn-
lated examples to illustrate the variety and range of in-    ing of many hearsay cures without evidence of curing
fodemic cases in representative categories: conspiracy       being e↵ective, there are endless claims such as herbs
theories and misinformation, and scams and security          and teas, or something of the sort that can prevent the
attacks to reinforce the urgency and need for address-       coronavirus. Recently, some wireless towers were dam-
ing the COVID-19 infodemic via scalable and timely           aged in the UK due to a false claim that radio waves
solutions. We then discuss the essential challenges in       sent by 5G technology are causing small changes to
designing and developing corresponding AI solutions          people’s bodies that make them succumb to the virus.
from three perspectives: data, computational tools,
and ethics. The last challenge of ethics is particu-
larly easy to overlook when we rush to confront the          2.2   Scam, spam, phishing, and malware
immediate threats. Therefore, it is important to un-
derstand unintended consequences when developing AI          As more and more people start working or study-
solutions to ensure sustainable and healthy use and de-      ing from home, cyber criminals recently shift focus
ployment. Last, we use some current e↵orts to demon-         to target remote workers. Di↵erent attacks such as
strate the feasibility of addressing the three challenges    scam, spam, phishing and malware, which prey on
in combating the COVID-19 infodemic; Meanwhile,              people’s willingness to help, fear of supply shortage,
by understanding the challenges and what we have,            and moments of weakness, have become increasingly
we also appreciate the importance of collaborative           active. Researchers have found that the volume of
research for e↵ectively and efficiently combating the        coronavirus email scams nearly tripled in one week,
COVID-19 infodemic.                                          with almost 3% of all global spam now estimated to be
                                                             COVID-19 related. During the coronavirus pandemic,
2     Examples of COVID-19 Infodemic                         as state governments and hospitals have scrambled to
                                                             obtain masks and other medical supplies, scammers at-
To illustrate what the COVID-19 infodemic looks like,        tempted to sell a fake stockpile of 39 million masks to a
how expansive, active, and devastating it is, and why it     California labor union. According to The Hill [Mil20]
is important to thwart or mitigate its present threats,      , “Hackers are taking advantage of the increased re-
we first present various examples regarding conspir-         liance on networks to target critical organizations such
acy theories and misinformation, and scam and other          as health care groups and members of the public, steal-
security attacks.                                            ing and profiting o↵ sensitive information and putting
                                                             lives at risk.”
2.1   Conspiracy theories and misinformation
With the spread of COVID-19 pandemic, the World
Health Organization (WHO) recently warned of an              3     Data, Tool, and Ethics Challenges
“infodemic” of rampant conspiracy theories about the
coronavirus. Those conspiracy theories have appeared         The scale, volume, and reach of the COVID-19 info-
in both social media and mainstream news outlets and         demic entails the reliance on AI and machine learn-
are often intertwined with geopolitics. One example is       ing (ML) algorithms to react promptly and respond
about how the new coronavirus originated: according          rapidly. The success of AI and ML algorithms re-
to a Pew Research Center survey, nearly three-in-ten         quires large amounts of multi-modal data for their
Americans believe COVID-19 was a bio-weapon made             efficiency and e↵ectiveness, which introduces a data
in the lab. Some top 10 conspiracy theories include          challenge. Data extraction and curation from multi-
SARS-CoV-2 virus was created as a biologic weapon            source data needs di↵erent computational tools to ac-
from a lab, GMOs are the culprit, COVID-19 actu-             curately categorize and sort out various types of data,
ally doesn’t exist, and coronavirus is a plot by big         which presents a tool challenge. When we rush to deal
Pharma [Lyn20].                                              with present threats, we should be aware of poten-
   Coronavirus misinformation is also flooding the in-       tial side-e↵ects, unexpected consequences, and biases
ternet through social media, text messages, and prop-        of our solutions, which suggests an ethic challenge. In
agated by celebrities, politicians, or other prominent       this section, we will discuss these challenges in detail.
3.1   Data challenge                                        namically changing list of malicious URLs, with new
                                                            sites being generated everyday. Therefore, it is neces-
Though numerous COVID-19 data sources are avail-
                                                            sary to develop AI/ML identifiers that can learn from
able online, their datasets are available on various web-
                                                            the old malicious sites for estimating the threats of
sites for di↵erent needs. The major data challenge
                                                            new ones.
of isolated data sources is the awareness of their ex-
istence. Another related issue is that they are col-
lected from di↵erent sources or under di↵erent crawl        3.3    Ethics challenge
settings. For example, Allen Institute for AI (AI2)         The COVID-19 pandemic is ushering in a new era of
released the scholarly articles dataset2 collected from     digital surveillance since governments are employing
PMC, medRxiv and bioRxiv; LitCovid [CAL20] col-             tools that track and monitor individuals. South Ko-
lected the scientific information from PubMed. Com-         rea and Israel, for instance, have demonstrated the ef-
bining di↵erent data sources leads to higher quality of     fectiveness of harnessing di↵erent digital surveillance
data and better coverage.                                   tools. However, such a new practice can breach data
   To address the data challenges, we need to over-         privacy in the meantime and may even remain in use
come some shortcomings: disorganization – most of           after the pandemic. In this section, we discuss the po-
them merely list all the collected datasets on their web-   tential privacy concerns, trade-o↵s between stringent
sites without information summarizing the relation-         disease monitoring and patient privacy and ethical is-
ships among them; specificity – data collected for a        sues behind the disruption of civil liberties.
specific topic, for example, Amazon provides the epi-          Gauging the war-like severity of the coronavirus
demic dataset on cloud3 and COVID-19 GIS Hub4               pandemic, academics, researchers, companies and non-
only contain the academic findings and geospatial-          profits alike have come forward to contribute in any
related datasets respectively; and inconvenience –          possible way. However, given the rapid nature of such
most sites merely provide the reference links to the        responses and the subsequent lack of policy checks,
source datasets and do not provide data utility tools       these otherwise novel endeavors may have ethical loop-
like covid19datahub [GA20] for easy access.                 holes. In an attempt to provide a transparent view of
                                                            the degree of infection and prevent community spread
3.2   Computational tool challenge                          of the virus, many counties and states in the United
                                                            States have decided to publicly release data corre-
There are existing resources that can assist users to
                                                            sponding to cases, including the number of cases per
identify malicious intent in websites. Google’s Safe
                                                            zip-code [Mal20]. Smartphone applications with geo-
Browsing API, for instance, allows the user to enter a
                                                            locating capabilities have come out for users to log
URL and check it against Google’s constantly updated
                                                            their symptoms. But the use of such applications has
lists of unsafe web resources. Similar resources in-
                                                            significant privacy concerns 5 [Wet20]. Contact trac-
clude isitPhishing.org, malwareurl.com, and antivirus
                                                            ing has been identified as an e↵ective way to control
software, among many others. Additionally, users can
                                                            the spread of the virus in communities where the in-
check malicious domain lists through di↵erent sources
                                                            fection is not yet widespread or has slowed down sig-
such as phishtank.com or the aforementioned Google’s
                                                            nificantly, and companies including Google and Apple
Safe Browsing lists. As many malicious sites use URL
                                                            are currently developing applications to make this pos-
shorteners to disguise themselves, to counteract po-
                                                            sible. Only when a sufficient number of people use the
tential attacks, it would be safe to first use URL ex-
                                                            application and voluntarily report their cases can it be
panders to figure out what they are before clicking
                                                            used as a reliable tool of tracking. In this situation,
them. Despite the easy access of those computational
                                                            there is an obvious trade-o↵ between user health pri-
tools, they are not available conveniently in a single
                                                            vacy and data transparency and it is challenging to
place where di↵erent tools can be called up whenever
                                                            identify well-defined ethical boundaries when it comes
needed.
                                                            to public health during a pandemic. The success of
    The awareness of these existing tools and efficient
                                                            such an app requires a majority of the population to
use of them for quick response is vital for combat-
                                                            download and use it.
ing COVID-19. An associate issue is the requirement
for current and frequently updated black-lists [SLH17].
As we know, it is infeasible to manually maintain a dy-     4     Feasibility Discussion
  2 https://allenai.org/data/cord-19                        In this section, we present some current e↵orts that
  3 https://aws.amazon.com/blogs/big-data/a-public-         address the aforementioned challenges and show that
data-lake-for-analysis-of-covid-19-data/
   4 https://coronavirus-disasterresponse.hub.arcgis.          5 https://privacyinternational.org/examples/apps-and-

com/                                                        covid-19
the three challenges are solvable with collaborative re-
search.
   For the data challenge, we collect the publicly avail-                                    Figure 2: The current components of TellMe.
able COVID-19 datasets and cluster them into several
                                                                                         detection, we consider the relationships among pub-
groups6 . Under each group, researchers can reference
                                                                                         lishers, news pieces, and consumers, which is moti-
complete datasets from di↵erent sources or settings.
                                                                                         vated by existing sociological studies on journalism
For example, in social media data, we gather available
                                                                                         on the correlation between the partisan bias of pub-
tweet corpus on COVID-19 [BTW+ 20][CLF20] with
                                                                                         lishers, the credibility of consumers, and the veracity
di↵erent query keywords and time spans. The hierar-
                                                                                         degree of news content; and explore various auxiliary
chy cluster structure in Figure 1 helps the researchers
                                                                                         information from these relations to help detect fake
to quickly locate the dataset. Lastly our data reposi-
                                                                                         news [SWL19]. Second, for explainable fake news de-
tory includes areas in academics, news, social media,
                                                                                         tection, we aim to derive explanation of prediction re-
and epidemic reports for multi-disciplinary research.
                                                                                         sults to help decision makers and practitioners; we at-
For example, if a researcher wants to analyze the in-
                                                                                         tempt to explore user comments as a source and mine
fluence of the news or academic findings on social me-
                                                                                         informative and relevant pieces to help explain why a
dia like Twitter, s/he can use the data in academic or
                                                                                         piece of news is predicted as fake, and pinpoint more
news topics and social media.
                                                                                         fictional text in news text simultaneously [SCW+ 19].
  Published                                                                                  To tackle the ethics challenge due to the increase
                 Academic                     Social Media       Twitter
  Pre-print                                                                              in government surveillance and prevalence of smart-
                          COVID-19 Datasets
                                              Epidemic Report
                                                                       Resource Report
                                                                                         phone apps to collect and gather user/patient data,
       Rumor                                                           Case Report

  Fact Checked
                   News                                                                  we need to take into account legitimate concerns re-
                                               Geo-Spatial      Mobility
                                                                                         garding privacy and the degree to which such a regime
                                                                                         of monitoring and enforcement will a↵ect democracy
       Figure 1: A taxonomy of collected datasets                                        after the pandemic ends. It requires us to understand
                                                                                         and acknowledge the fact that there is a clear di↵er-
   To help a researcher easily access the datasets in
                                                                                         ence between standard biomedical ethics versus pri-
the repository, we build a data-loader7 . It is a Python
                                                                                         vacy concerns and ethics during a public health crisis.
package with a pandas Dataframe [pdt20] by calling
                                                                                         Governments and public health officials may need to
data = DataLoader().download(url). This widely used
                                                                                         take certain measures aimed at minimizing the dam-
data format can help the downstream data analysis.
                                                                                         age caused by the virus and for the common good
   To tackle the tool challenge, we develop TellMe,
                                                                                         during this trying time, which under normal circum-
a computational tool that provides an estimate if a
                                                                                         stances might have been inappropriate. Nevertheless,
piece of news or text is disinformation. Its input in-
                                                                                         measures could be taken to avoid potential misuse of
cludes URLs and text, and its output is a score based
                                                                                         data. One possible way to have better guarantees on
on di↵erent functions of TellMe as shown in Figure 2:
                                                                                         user privacy would be to make these contact tracing
URL Checker, Fake News Classifier, Website Matcher,
                                                                                         smartphone applications communicate in an encrypted
Credibility and Trusty. The Trusty [MYL09] and
                                                                                         peer to peer way rather than storing all the data in a
Credibility [AL13] scores are based on contents’ social
                                                                                         central server. These technologies should also be de-
engagements that malicious users share more similar-
                                                                                         ployed in a way that is as transparent as possible, so
ity than general users. The fake news score is returned
                                                                                         that the user is fully aware of what and how much
from a state-of-the-art fake news detector [SZL+ 20].
                                                                                         personal information he/she permits the application
The website matcher compares the input URL with
                                                                                         to use. Furthermore, there is significant ongoing dis-
websites that publish false information about the virus
                                                                                         cussion among experts, researchers and policy-makers
found by NewsGuard [BC19]. In addition, we are also
                                                                                         regarding a steady recovery into a normal function-
in the process of developing and integrating more com-
                                                                                         ing society. For example, the ethics research group at
ponents (e.g., advertisement tracker, source attribu-
                                                                                         Harvard University makes e↵orts at finding solutions
tor) and algorithms [DLBL19, DLL19, DLD+ 19] into
                                                                                         without compromising user privacy to keep civil lib-
the TellMe system.
                                                                                         erty and democracy at the forefront.
   Now, we use fake news as an example to illustrate
our attempts to learn with weak social supervision
to detect COVID-19 disinformation more e↵ectively                                        5   Looking Ahead
and with explainability. First, for e↵ective fake news                                   The significance of combating the COVID-19 info-
   6 https://github.com/bigheiniu/awesome-coronavirus19-                                 demic lies at protecting people from falling victims to
dataset                                                                                  the pandemic in this unexpected front and from dis-
   7 https://github.com/bigheiniu/COVID-19-Dataloaders                                   rupting otherwise already inconvenient daily routines
so as to improve our resilience in our fight to con-        [CLF20]    Emily Chen, Kristina Lerman, and Emilio
tain the pandemic. In this position paper, we show                     Ferrara. Covid-19: The first public coron-
a good number of problems posed by the COVID-19                        avirus twitter dataset, 2020.
infodemic, the vast amounts of data generated in the
world’s e↵ort to contain the pandemic, and the need         [DLBL19] Kaize Ding,      Jundong Li,      Rohit
for concerted e↵orts at various levels to efficiently and            Bhanushali, and Huan Liu. Deep anomaly
e↵ectively deal with current and future challenges in                detection on attributed networks.    In
medical and information fronts.                                      SDM, 2019.
   It is evident that (1) we face both immediate and fu-    [DLD+ 19] Kaize Ding, Jundong Li, Shivam Dhar,
ture challenges in this unprecedented fight, (2) existing             Shreyash Devan, and Huan Liu. Interspot:
data will grow fast, and existing computational tools                 interactive spammer detection in social
are insufficient to contain and mitigate the COVID-                   media. In Proceedings of the 28th Inter-
19 infodemic, and (3) short-term solutions can have                   national Joint Conference on Artificial In-
potential long-term impact. Therefore, when we face                   telligence, pages 6509–6511. AAAI Press,
hard choices, we need to resist the temptation to trade-              2019.
o↵ so as to minimize long-term negative impact; when
we search for solutions, we should consider those em-       [DLL19]    Kaize Ding, Jundong Li, and Huan
ploying crowdsourcing and take long views for fair-                    Liu. Interactive anomaly detection on at-
ness and responsibility; when we design methods, we                    tributed networks. In Proceedings of the
should rely on collective wisdom and diversity to aim                  Twelfth ACM International Conference on
for robustness; and when we form teams, we should                      Web Search and Data Mining, pages 357–
give priority to multi-disciplinary collaboration and                  365, 2019.
preemptively address hidden biases. Our future will
always be uncertain, but with the advancement in sci-       [Don20]    Joan Donovan.       Here’s how so-
ence and technology and with our preparedness trained                  cial media can combat the coron-
and tested in our concerted e↵orts to contain the pan-                 avirus ‘infodemic’, 2020.    https:
demic in all fronts, our future will surely be brighter                //www.technologyreview.com/s/
and healthier.                                                         615368/facebook-twitter-social-
                                                                       media-infodemic-misinformation/.
Acknowledgements                                            [GA20]     Emanuele Guidotti and David Ardia.
This work is, in part, supported by Global Security                    Covid-19 data hub, 04 2020.
Initiative (GSI) at ASU and by NSF grants (2029044
                                                            [KG20]     Kornbluh and Ellen P. Goodman.
and 1614576). We would like to thank Denis Liu for
                                                                       Safeguarding digital democracy, 2020.
helping develop earlier versions of TellMe and for care-
                                                                       http://www.gmfus.org/publications/
fully proofreading an earlier version of this paper.
                                                                       safeguarding-democracy-against-
                                                                       disinformation.
References
[AL13]      Mohammad-Ali Abbasi and Huan Liu.               [Lyn20]    Mark Lynas. Covid: Top 10 current
            Measuring user credibility in social media.                conspiracy theories, 2020.   https:
            In SBP-BRiMS, 2013.                                        //allianceforscience.cornell.edu/
                                                                       blog/2020/04/covid-top-10-current-
[BC19]      S Brille and G Crovitz. Newsguard now                      conspiracy-theories/.
            available on microsoft edge mobile apps for
            ios and android, 2019.                          [Mal20]    Laurel Mallory. Sc health officials up-
                                                                       date list of confirmed and estimated
[BTW+ 20] Juan M. Banda, Ramya Tekumalla,                              coronavirus cases by zip code, 2020.
          Guanyu Wang, Jingyuan Yu, Tuo Liu,                           https://www.wtoc.com/2020/04/10/
          Yuning Ding, and Gerardo Chowell. A                          dhec-releases-number-confirmed-
          large-scale covid-19 twitter chatter dataset                 estimated-coronavirus-cases-by-
          for open scientific research – an interna-                   zip-code/.
          tional collaboration, 2020.
                                                            [Mil20]    Maggie Miller. Virtual army rising up to
[CAL20]     Q. Chen, A. Allot, and Z. Lu. Keep up                      protect health care groups from hackers,
            with the latest coronavirus research. Na-                  2020.    https://thehill.com/policy/
            ture, 2020.                                                cybersecurity/493997-virtual-army-
            rising-up-to-protect-healthcare-
            groups-from-hackers/.

[MYL09]     Sai T Moturu, Jian Yang, and Huan Liu.
            Quantifying utility and trustworthiness for
            advice shared on online social media. In
            CSE, 2009.
[pdt20]     The pandas development team. pandas-
            dev/pandas: Pandas, February 2020.
[SCW+ 19] Kai Shu, Limeng Cui, Suhang Wang,
          Dongwon Lee, and Huan Liu. defend: Ex-
          plainable fake news detection. In KDD,
          2019.

[SLH17]     Doyen Sahoo, Chenghao Liu, and
            Steven CH Hoi. Malicious url detection
            using machine learning: A survey. arXiv
            preprint arXiv:1701.07179, 2017.
[SWL19]     Kai Shu, Suhang Wang, and Huan Liu. Be-
            yond news contents: The role of social con-
            text for fake news detection. In WSDM,
            2019.
[SZL+ 20]   Kai Shu, Guoqing Zheng, Yichuan Li,
            Subhabrata Mukherjee, Ahmed Hassan
            Awadallah, Scott Ruston, and Huan Liu.
            Leveraging multi-source weak social su-
            pervision for early detection of fake news.
            arXiv preprint arXiv:2004.01732, 2020.
[Wet20]     Nicole Wetsman.      Personal privacy
            matters during a pandemic — but less
            than it might at other times, 2020.
            https://www.theverge.com/2020/
            3/12/21177129/personal-privacy-
            pandemic-ethics-public-health-
            coronavirus/.