=Paper= {{Paper |id=Vol-2657/xproceedings |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-2657/xproceedings.pdf |volume=Vol-2657 }} ==None== https://ceur-ws.org/Vol-2657/xproceedings.pdf
Proceedings of the
ACM SIGKDD Workshop on
Knowledge-infused Mining and Learning for Social Impact
Editors: Manas Gaur, Alejandro Jaimes, Fatma Özcan, Srinivasan Parthasarathy,
Sameena Shah, Amit Sheth & Biplav Srivastava




 KiML 2020
                        AUGUST 24
                            SAN DIEGO , CA




        First International Workshop on Advancing Decision
           making in Health, Crisis Response, and Finance




  Co-located with
                                               http://kiml2020.aiisc.ai/
  26th ACM Conference on
  Knowledge Discovery and Data Mining
  KDD 2020, San Diego, California
Proceedings of the
ACM SIGKDD Workshop on
Knowledge-infused Mining and Learning (KiML)




Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee.

These proceedings are not included in the ACM Digital Library.

KiML’20, August 24, 2020, San Diego, California, USA.

Copyright (c) 2020 held by the author(s). In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © The Authors, 2020.



ACM SIGKDD Workshop on
Knowledge-infused Mining and Learning (KiML)
Organizers:
Manas Gaur (AI Institute, University of South Carolina)
Alejandro(Alex) Jaimes (Dataminr Inc. NYC)
Fatma Özcan (IBM Research Almaden)
Srinivasan Parthasarathy (Ohio State University)
Sameena Shah (JP Morgan NYC)
Amit Sheth (AI Institute, University of South Carolina)
Biplav Srivastava (IBM Chief Analytics Office, NYC)

Program Committee:
Nitin Agarwal (University of Arkansas)
Amanuel Alambo (Kno.e.sis Center)
Shreyansh Bhatt (Amazon)
Vasilis Efthymiou (IBM Research)
Utkarshani Jaimini (AI Institute, University of South Carolina)
Ugur Kurşuncu (AI Institute, University of South Carolina)
Sarasi Lalithsena (IBM Watson)
Chuan Lei (IBM Research)
Quanzhi Li (Alibaba Group)
Xiaomo Liu (S&P Global Ratings)
Yong Liu (Outreach.io)
Raghava Mutharaju (IIIT Delhi)
Arindam Pal (Data61, CSIRO)
Sujan Perera (Amazon)
Hemant Purohit (George Mason University)
Kaushik Roy (AI Institute, University of South Carolina)
Valerie Shalin (Wright State University)
Kai Shu (Arizona State University)
Nikhita Vedula (Ohio State University)
Ruwan Wickramarachchi (AI Institute, University of South Carolina)
Ke Zhang (Dataminr Inc.)
Jinjin Zhao (Amazon)

Webmaster:
Vishal Pallagani (AI Institute, University of South Carolina)
Ibrahim Salman (AI Institute, University of South Carolina)
Copyright (c) 2020 held by the author(s). In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Preface
Research in artificial intelligence and data science is accelerating rapidly due to an unprecedented
explosion in the amount of information on the web. In parallel, we noticed immense growth in the
construction and utility of the knowledge network from Google, Netflix, NSF, and NIH. However, current
methods risk an unsatisfactory ceiling of applicability due to shortcomings in bringing homogeneity
between knowledge graphs, data mining, and deep learning. In this changing world, retrospective studies
for building state-of-the-art AI and Data science systems have raised concerns on trust, traceability, and
interactivity for prospective applications in healthcare, finance, and crisis response. We believe the
paradigm of knowledge-infused mining and learning would account for both pieces of knowledge that
accrue from domain expertise and guidance from physical models. Further, it will allow the community to
design new evaluation strategies that assess robustness and fairness across all comparable
state-of-the-art algorithms.

The Workshop on Knowledge-infused Mining and Learning for Social Impact was centered around the
following thematic components: (a) Data Management: includes resource management, resource
discovery across heterogeneous and inconsistent data resources. (b) Data Usage: includes methods and
systems for visualization, representations, reasoning, and interaction. (c) Evaluation: will bring together
researchers involved at the intersection of databases, semantic web, information systems, and AI to
create new approaches and tools to benefit a broad range of policymakers (e.g. mental health professions,
education practitioners, emergency responders, and economists).

The workshop will bring together researchers and practitioners from both academia and industry who
are interested in the creation and use of knowledge graphs in understanding online conversations on
crisis response (e.g., COVID-19), public health (e.g., social network analysis for mental health insights),
and finance (e.g., mining insights on the financial impact (recession, unemployment) of COVID-19 using
twitter or organizational data). Additionally, we encourage researchers and practitioners from the areas
of human-centered computing, interaction and reasoning, statistical relational mining and learning,
intelligent agent systems, semantic social network analysis, deep graph learning, and recommendation
systems.

The main program of KiML’20 consist of seven papers, selected out of thirteen submissions, covering
topics related to knowledge-enabled feature elicitation, adversarial learning, crisis response, public
health, and COVID-19. We sincerely thank the authors of the submissions as well as the attendees of the
workshop. We wish to thank the members of our program committee for their help in selecting
high-quality papers. Furthermore, we are grateful to Manuela Veloso, Sriraam Natarajan, Jose Ambite, and
Pieter De Leenheer for giving keynote presentations on their recent work on Symbiotic Autonomy,
Human Allied Probabilistic Learning, Biomedical Data Science, and Data Intelligence.

                                               Manas Gaur, Alejandro Jaimes, Fatma Özcan, Srinivasan Parthasarathy,
                                                                   Sameena Shah, Amit Sheth, and Biplav Srivastava
                                                                                                       August 2020
Copyright (c) 2020 held by the author(s). In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Table of Contents
Invited Talks

Symbiotic Autonomy: Knowing When and What to Learn from Experience
Manuela M. Veloso ​……………………………………………………………………………………... 1

Human Allied Probabilistic Learning
Sriraam Natarajan​ …………………………………………………………………………………….. 2

Data Intelligence in the 2020s
Pieter De Leenheer​ …………………………………………………………………………………….. 3

Semantics in Biomedical Data Science
Jose Luis Ambite ....​……………………………………………………………………………………... 4

Research Papers

Textual Evidence for the Perfunctoriness of Independent Medical Reviews
Adrian Brasoveanu, Megan Moodie and Rakshit Agrawal ​……………………………………………5

Knowledge Intensive Learning of Generative Adversarial Networks
Devendra Dhami, Mayukh Das and Sriraam Natarajan ​……………………………………………. 14

Depressive, Drug Abusive, or Informative: Knowledge-aware Study of News Exposure during
COVID-19 Outbreak
Amanuel Alambo, Manas Gaur and Krishnaprasad Thirunarayan​………………………………….20

Cost Aware Feature Elicitation
Srijita Das, Rishabh Iyer and Sriraam Natarajan​……………………………………………………26

A New Delay Differential Equation Model for COVID-19
B Shayak, Mohit Manoj Sharma and Manas Gaur​…………………………………………………....32

Public Health Implications of a delay differential equation model for COVID19
Mohit Manoj Sharma and B Shayak……………....​…………………………………………………....36

Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of
Crisis Tweets
Jitin Krishnan, Hemant Purohit, Huzefa Rangwala​…………………………………………………..42
                                    Keynote Talk 1
     Symbiotic Autonomy: Knowing When and What to Learn from Experience

                                     Manuela M. Veloso
                                Head, JPMorgan AI Research
              Herbert A. Simon University Professor, School of Computer Science
                                 Carnegie Mellon University
                               manuela.veloso@jpmchase.com

Abstract:

The talk will present work on novel human-AI interaction, in which humans and AI complement
each other in their knowledge and learning. I will discuss examples in autonomous mobile
service robots and in the financial domain. I will conclude with a brief discussion of multiple
forms of available knowledge for AI systems that continuously learn from experience.


Bio:
Manuela M. Veloso is the Head of J.P. Morgan AI Research, which pursues fundamental research
in areas of core relevance to financial services, including data mining and cryptography, machine
learning, explainability, and human-AI interaction. J.P. Morgan AI Research partners with
applied data analytics teams across the firm as well as with leading academic institutions
globally. Professor Veloso is on leave from Carnegie Mellon University as the Herbert A. Simon
University Professor in the School of Computer Science, and the past Head of the Machine
Learning Department. With her students, she had led research in AI, with a focus on robotics and
machine learning, having concretely researched and developed a variety of autonomous robots,
including teams of soccer robots, and mobile service robots. Her robot soccer teams have been
RoboCup world champions several times, and the CoBot mobile robots have autonomously
navigated for more than 1,000km in university buildings. Professor Veloso is the Past President
of AAAI, (the Association for the Advancement of Artificial Intelligence), and the co-founder,
Trustee, and Past President of RoboCup. Professor Veloso has been recognized with multiple
honors, including being a Fellow of the ACM, IEEE, AAAS, and AAAI. She is the recipient of
several best paper awards, the Einstein Chair of the Chinese Academy of Science, the
ACM/SIGART Autonomous Agents Research Award, an NSF Career Award, and the Allen Newell
Medal for Excellence in Research. Professor Veloso earned a Bachelor and Master of Science
degrees in Electrical and Computer Engineering from Instituto Superior Tecnico in Lisbon,
Portugal, a Master of Arts in Computer Science from Boston University, and Master of Science
and     Ph.D.    in    Computer      Science      from    Carnegie     Mellon    University.  See
www.cs.cmu.edu/~mmv/Veloso.html for her scientific publications.
                                    Keynote Talk 2
                           Human Allied Probabilistic Learning

                                       Sriraam Natarajan
                             Director, Center for Machine Learning
                   Erik Jonsson School of Engineering and Computer Science
                                The University of Texas at Dallas
                                sriraam.natarajan@utdallas.edu

Abstract:

Historically, Artificial Intelligence has taken a symbolic route for representing and reasoning
about objects at a higher-level or a statistical route for learning complex models from large data.
To achieve true AI, it is necessary to make these different paths meet and enable seamless
human interaction. First, I briefly will introduce learning from rich, structured, complex, and
noisy data. Next, I will present the recent progress that allows for more reasonable human
interaction where the human input is taken as “advice” and the learning algorithm combines this
advice with data. The advice can be in the form of qualitative influences, preferences over
labels/actions, privileged information obtained during training, or simple precision-recall
trade-off. Finally, I will outline our recent work on "closing-the-loop" where information is
solicited from humans as needed that allows for seamless interactions with the human expert.
While I will discuss these methods primarily in the context of probabilistic and relational
learning, I will also present our results on reinforcement learning and inverse reinforcement
learning.

Bio:

Dr. Sriraam Natarajan is an Associate Professor and the Director of the Center for ML at the
Department of Computer Science at the University of Texas Dallas. He was previously an
Associate Professor and earlier an Assistant Professor at Indiana University, Wake Forest School
of Medicine, a post-doctoral research associate at the University of Wisconsin-Madison, and had
graduated with his Ph.D. from Oregon State University. His research interests lie in the field of
Artificial Intelligence, with emphasis on Machine Learning, Statistical Relational Learning and AI,
Reinforcement Learning, Graphical Models, and Biomedical Applications. He has received the
Young Investigator award from US Army Research Office, Amazon Faculty Research Award, Intel
Faculty Award, XEROX Faculty Award, Verisk Faculty Award, and the IU Trustees Teaching
Award from Indiana University. He is the program co-chair of SDM 2020 and ACM CoDS-COMAD
2020 conferences. He is the specialty chief editor of Frontiers in ML and AI journal, an editorial
board member of MLJ, JAIR, and DAMI journals and is the electronics publishing editor of JAIR.
                                    Keynote Talk 3
                      Data Intelligence in the Age of Accountability

                                     Pieter De Leenheer
                       Senior Research Fellow, Harvard Business School
                       Co-Founder and Chief Science Officer, Collibra Inc.
                                    pdeleenheer@hbs.edu


Abstract:

Knowledge graphs, machine learning and distributed ledgers are just a few of the emerging
intelligent technologies that unlock new options to innovate business models, augment scientific
knowledge and self-understanding, and enhance decision making. Data being a critical driver for
intelligent systems implies machine calculation may supplant human decision making in many
scenarios. The accessibility, quality and currency of data are necessary criteria to ensure these
systems produce viable innovation options that can be accounted for. But are these criteria
sufficient?

Bio:

Pieter is a senior research fellow at Harvard Business School and serves as adjunct faculty at
Columbia University. He is a cofounder and former Chief Science Officer of Collibra, a unicorn
venture in data intelligence, that spun off his PhD research on community-based ontology
management. Pieter writes, teaches and advises on computing and management aspects of data
innovation, accountability and citizenship. He serves as an expert to the European Commission
and several governments; and as board member of several startups such as Gluetech.com and
Yesse.tech. Prior to cofounding the company, Pieter was a professor at VU University of
Amsterdam. He lives in New York City with his family.
                                     Keynote Talk 4
                           Semantics in Biomedical Data Science

                                       Jose Luis Ambite
                     Research Team Leader, Information Sciences Institute
                 Associate Research Professor, University of Southern California
                                        ambite@isi.edu

Abstract:
There is an explosion of biomedical data that promises to enable novel discoveries, treatments,
and the ultimate goal of personalized medicine. These data are generated in a great variety of
forms, ranging from sensor data, to imaging, to genetics, and all types of clinical data. Moreover,
the data are often scattered across organizations, and even for the same data type are
represented in diverse structures. Thus, the need to provide a semantically consistent view, so
that the data can be meaningfully analyzed is critical. I will describe core data integration and
knowledge graph construction techniques, namely entity linkage and formal schema mappings,
with illustrative biomedical data integration applications, highlighting some novel neural
semantic similarity methods and some surprising applications of record linkage techniques, such
as efficiently finding genetically related individuals. I will discuss architectures for large scale
data integration and analysis, including sensor data. Finally, I will discuss how we can analyze
distributed datasets when the data cannot be shared for privacy or security reasons, and thus
cannot be integrated. I will describe our recent work on Heterogeneous Federated Learning that
learns common neural models from siloed data.


Bio:
Dr. Jose Luis Ambite is an Associate Research Professor at the Computer Science Department,
and a Research Team Leader at the Information Sciences Institute, at the University of Southern
California. His core expertise is on information integration, including query rewriting under
constraints, learning schema mappings, and entity linkage. Dr. Ambite research interests include
databases, knowledge representation, semantic web, semantic similarity, scientific workflows,
and biomedical data science. He has published widely in these topics. He regularly serves as
reviewer for funding organizations, journals and major conferences. In the last years, he has
focused on developing novel approaches for integration, analysis, and dissemination of
biomedical and genetic data within several large NIH-funded projects, such as ​PRISMS-study​,
NIMH Repository and Genetics Resource​, ​SchizConnect​, ​Population Architecture using Genomics
and Epidemiology​, and ​Education Resource Discovery Index​.
           Textual Evidence for the Perfunctoriness of Independent
                              Medical Reviews
              Adrian Brasoveanu                                             Megan Moodie                                         Rakshit Agrawal
               abrsvn@ucsc.edu                                        mmoodie@ucsc.edu                                          ragrawal@camio.com
      University of California Santa Cruz                      University of California Santa Cruz                                   Camio Inc.
                Santa Cruz, CA                                           Santa Cruz, CA                                            San Mateo, CA
ABSTRACT                                                                                ACM Reference Format:
We examine a database of 26,361 Independent Medical Reviews                             Adrian Brasoveanu, Megan Moodie, and Rakshit Agrawal. 2020. Textual Ev-
                                                                                        idence for the Perfunctoriness of Independent Medical Reviews. In Proceed-
(IMRs) for privately insured patients, handled by the California
                                                                                        ings of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20).
Department of Managed Health Care (DMHC) through a private                              , 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
contractor. IMR processes are meant to provide protection for pa-
tients whose doctors prescribe treatments that are denied by their
health insurance (either private insurance or the insurance that is
                                                                                        1 INTRODUCTION
part of their worker comp; we focus on private insurance here).                         1.1 Origin and structure of IMRs
Laws requiring IMR were established in California and other states                      Independent Medical Review (IMR) processes are meant to provide
because patients and their doctors were concerned that health in-                       protection for patients whose doctors prescribe treatments that are
surance plans deny coverage for medically necessary services. We                        denied by their health insurance – either private insurance or the
analyze the text of the reviews and compare them closely with a                         insurance that is part of their workers’ compensation. In this paper,
sample of 50000 Yelp reviews [19] and the corpus of 50000 IMDB                          we focus exclusively on privately insured patients. Laws requiring
movie reviews [10]. Despite the fact that the IMDB corpus is twice                      IMR processes were established in California and other states in
as large as the IMR corpus, and the Yelp sample contains almost                         the late 1990s because patients and their doctors were concerned
twice as many reviews, we can construct a very good language                            that health insurance plans deny coverage for medically necessary
model for the IMR corpus using inductive sequential transfer learn-                     services to maximize profit.1
ing, specifically ULMFiT [8], as measured by the quality of text                           As aptly summarized in [1], IMR is regularly used to settle dis-
generation, as well as low perplexity (11.86) and high categorical                      putes between patients and their health insurers over what is medi-
accuracy (0.53) on unseen test data, compared to the larger Yelp                        cally necessary or experimental/investigational care. Medical ne-
and IMDB corpora (perplexity: 40.3 and 37, respectively; accuracy:                      cessity disputes occur between health plans and patients because
0.29 and 0.39). We see similar trends in topic models [17] and clas-                    the health plan disagrees with the patient’s doctor about the ap-
sification models predicting binary IMR outcomes and binarized                          propriate standard of care or course of treatment for a specific
sentiment for Yelp and IMDB reviews. We also examine four other                         condition. Under the current system of managed care in the U.S.,
corpora (drug reviews [6], data science job postings [9], legal case                    services rendered by a health care provider are reviewed to de-
summaries [5] and cooking recipes [11]) to show that the IMR re-                        termine whether the services are medically necessary, a process
sults are not typical for specialized-register corpora. These results                   referred to as utilization review (UR). UR is the oversight mech-
indicate that movie and restaurant reviews exhibit a much larger                        anism through which private insurers control costs by ensuring
variety, more contentful discussion, and greater attention to detail                    that only medically necessary care, covered under the contractual
compared to IMR reviews, which points to the possibility that a                         terms of a patient’s insurance plan, is provided. Services that are
crucial consumer protection mandated by law fails a sizeable class                      not deemed medically necessary or fall outside a particular plan
of highly vulnerable patients.                                                          are not covered.
                                                                                           Procedures or treatment protocols are deemed experimental or
CCS CONCEPTS                                                                            investigational because the health plan – but not necessarily the
  • Computing methodologies → Latent Dirichlet allocation;                              patient’s doctor, who in many cases has enough clinical confidence
Neural networks.                                                                        in a treatment to order it – considers them non-routine medical
                                                                                        care, or takes them to be scientifically unproven to treat the specific
KEYWORDS
                                                                                        condition, illness, or diagnosis for which their use is proposed.
   AI for social good, state-managed medical review processes,                             It is important to realize that the IMR process is usually the
language models, topic models, sentiment classification                                 third and final stage in the medical review process. The typical
                                                                                        progression is as follows. After in-person and possibly repeated
                                                                                        examination of the patient, the doctor recommends a treatment,
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego,           1 For California, see the Friedman-Knowles Act of 1996, requiring California health
California, USA, August 24, 2020. Use permitted under Creative Commons License          plans to provide external independent medical review (IMR) for coverage denials. As
Attribution 4.0 International (CC BY 4.0).                                              of late 2002, 41 states and the District of Columbia had passed legislation creating an
KiML’20, August 24, 2020, San Diego, California, USA,                                   IMR process. In 34 of these states, including California, the decision resulting from the
© 2020 Copyright held by the author(s).                                                 IMR is binding to the health plan. See [1, 15] for summaries of the political and legal
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                 history of the IMR system, and [2] for an early partial survey of the DMHC IMR data.
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                      Brasoveanu, Moodie and Agrawal


which is then submitted for approval to the patient’s health plan.          to maximize profit, rather than simply maintain cost effectiveness,
If the treatment is denied in this first stage, both the doctor and         seems to emerge. Typically, the argument for denial contends that
the patient may file an appeal with the health plan, which triggers         the evidence for the beneficial effects of the treatment fails the
a second stage of reviews by the health-insurance provider, for             prevailing standard of scientific evidence. This prevailing standard
which a patient can supply additional information and a doctor              invoked by IMR reviewers is usually randomized control trials
may engage in what is known as a “peer to peer” discussion with a           (RCTs), which are expensive, time-consuming trials that are run by
health-insurance representative. If these second reviews uphold the         large pharmaceutical companies only if the treatment is ultimately
initial denial, the only recourse the patient has is the state-regulated    estimated to be profitable.
IMR process, and per California law, an IMR grievance form (and                RCTs, however, have known limits: they “require minimal as-
some additional information) is included with the denial letter.            sumptions and can operate with little prior knowledge [which] is
    An IMR review must be initiated by the patient and submitted to         an advantage when persuading distrustful audiences, but it is a
the California Department of Managed Health Care (DMHC), which              disadvantage for cumulative scientific progress, where prior knowl-
manages IMRs for privately-insured patients. Motivated treating             edge should be built upon, not discarded.” [3] Inflexibly applying
physicians may provide statements of support for inclusion in the           the RCT “gold standard” in the IMR process is often a way to ig-
documentation provided to DMHC by the patient, but in theory                nore the doctors’ knowledge and experience in a way that seems
the IMR creates a new relationship of care between the review-              superficially well-reasoned and scientific. “RCTs can play a role in
ing physician(s) hired by a private contractor on behalf of DMHC,           building scientific knowledge and useful predictions” – and we add,
and the patient in question. The reviewing physicians’ decision is          treatment recommendations – “only [. . . ] as part of a cumulative
supposed to be made based on what is in the best interest of the pa-        program, [in combination] with other methods.” [3]
tient, not on cost concerns. It is this relation of care that constitutes      Notably, the experimental/investigational category of treatments
the consumer protection for which IMR processes were legislated.            that get denied often includes promising treatments that have not
Understandably, given that the patients in question may be ill or           been fully tested in clinical RCTs – because the treatment is new or
disabled or simply discouraged by several layers of cumbersome              the condition is rare in the population, so treatment development
bureaucratic processes, there is a very high attrition from the initial     costs might not ultimately be recovered. Another common category
review to the final, IMR, stage. That is, only the few highly moti-         of experimental/investigational denials involves “off-label” drug
vated and knowledgeable patients – or the extremely desperate –             uses, that is, uses of FDA-approved pharmaceuticals for a purpose
get as far as the IMR process.                                              other than the narrow one for which the drug was approved.
    The IMR process is regulated by the state, but it is actually con-
ducted by a third party. At this time (2019), the provider in Cali-         1.2     Main argument and predictions
fornia and several other states across the US is MAXIMUS Federal            Recall that these ‘experimental’ treatments or off-label uses are rec-
Services, Inc.2 The costs associated with the IMR review, at least          ommended by the patient’s doctor, and therefore their potential
in California, are covered by health insurers. It is DMHC’s and             benefits are taken to outweigh their possible negative effects. The
MAXIMUS’s responsibility to collect all the documentation from              recommending doctor is likely very familiar with the often lengthy,
the patient, the patient’s doctor(s) and the health insurer. There          tortuous and highly specific medical history of the patient, and with
are no independent checks that all the documentation has actually           the list of ‘less experimental’ treatments that have been proven
been collected, however, and patients do not see a final list of what       unsuccessful or have been removed from consideration for patient-
has been provided to the reviewer prior to the IMR decision itself          specific reasons. It is also important to remember that many rare
(a post facto list of file contents is mailed to patients along with the    conditions have no “on-label” treatment options available, since ex-
final, binding, decision; it is unclear what recourse a patient may         pensive RCTs and treatment approval processes are not undertaken
have if they find pertinent information was missing from the review         if companies do not expect to recover their costs, which is likely if
file). Once the documentation is assembled, MAXIMUS forwards it             the potential ‘market’ is small (few people have the rare condition).
to anywhere from one to three reviewers, who remain anonymous,                  Therefore, our main line of argumentation is as follows.
but are certified by MAXIMUS to be appropriately credentialed
and knowledgeable about the treatment(s) and condition(s) under                   • Since IMRs are the final stage in a long bureaucratic process
review. The reviewer submits a summary of the case, and also a ra-                  in which health insurance companies keep denying coverage
tionale and evidence in support of their decision, which is a binary                for a treatment repeatedly recommended by a doctor as
Upheld/Overturned decision about the medical service. IMR review-                   medically necessary, we expect that the issue of medical
ers do not enter a consultative relationship with the patient, doctor               necessity is non-trivial when that specific patient and that
or health plan – they must render an uphold/overturn decision                       specific treatment are carefully considered.
based solely on the provided medical records. However, as noted                   • We should therefore expect the text of the IMRs, which justi-
above, they are in an implied relationship of care to the patient, a                fies the final determination, to be highly individualized and
point to which we return in the Discussion section below (§4).                      argue for that final decision (whether congruent with the
    While insurance carriers do not provide statistics about the per-               health plan’s decision or not) in a way that involves the par-
centage of requested treatments that are denied in the initial stage,               ticulars of the treatment and the particulars of the patient’s
looking at the process as a whole, a pattern of service denial aimed                medical history and conditions.
                                                                              Thus, we expect a reasoned, thoughtful IMR to not be highly
2 https://www.maximus.com/capability/appeals-imr                            generic and templatic / predictable in nature. For instance, legal
                                                                                                   KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews


documents may be highly templatic as they discuss the application            The goal in this paper is to investigate to what extent Natu-
of the same law or policy across many different cases, but a response     ral Language Processing (NLP) / Machine Learning (ML) meth-
carefully considering the specifics of a medical case reaching the        ods that are able to extract insights from large corpora point in
IMR stage is not likely to be similar to many other cases. We only        the same direction, thus mitigating cherry-picking biases that are
expect high similarity and ‘templaticity’ for IMR reviews if they are     sometimes associated with qualitative investigations. In addition
reduced to a more or less automatic application of some prespecified      to the IMR text, we perform a comparative study with additional
set of rules (rubber-stamping).                                           English-language datasets in an attempt to eliminate data-specific
                                                                          and problem-specific biases.

                                                                              • We analyze the text of the IMR reviews and compare them
                                                                                with a sample of 50,000 Yelp reviews [19] and the corpus of
1.3     Main results, and their limits                                          50,000 IMDB movie reviews [10].
Concomitantly with this quantitative study, we conducted prelim-              • As the size of data has significant consequences for language-
inary qualitative research with a focus on pain management and                  model training, and NLP/ML models more generally, we
chronic conditions. We investigated the history of the IMR process,             expect models trained on the Yelp and IMDB corpora to
in addition to having direct experience with it. We had detailed                outperform models trained on the IMR corpus, given that
conversations with doctors in Northern California and on private                the IMDB corpus is twice as large as the IMR corpus, and
social media groups formed around chronic conditions and pain                   the Yelp samples contain almost twice as many reviews.
management. This preliminary research reliably points towards the             • In this paper, we instead demonstrate that we were able
possibility that IMR reviews are perfunctory, and that this crucial             to construct a very good language model for the IMR cor-
consumer protection mandated by law seems to fail for a sizeable                pus using inductive sequential transfer learning, specifically
class of highly vulnerable patients. In this paper, we focus on the             ULMFiT [8], as measured by the quality of text generation.
text of the IMR decisions and attempt to quantify the evidence for            • In addition, the model achieves a much lower perplexity
the perfunctoriness of the IMR process that they provide.                       (11.86) and a higher categorical accuracy (0.53) on unseen
   The text of the IMR findings does not provide unambiguous                    test data, compared to models trained on the larger Yelp
evidence about the quality and appropriateness of the IMR process.              and IMDB corpora (perplexity: 40.3 and 37, respectively;
If we had access to the full, anonymized patient files submitted to             categorical accuracy: 0.29 and 0.39).
the IMR reviewers (in addition to the final IMR decision and the              • We see similar trends in topic models [17] and classifica-
associated text), we might have been able to provide much stronger              tion models predicting binary IMR outcomes and binarized
evidence that IMRs should have a significantly higher percentage of             sentiment for Yelp and IMDB reviews.
overturns, and that the IMR process should be improved in various
ways, e.g., (i) patients should be able to check that all the relevant        These results indicate that movie and restaurant reviews ex-
documentation has been collected and will be reviewed, and (ii)           hibit a much larger variety, more contentful discussion, and greater
the anonymous reviewers should be held to higher standards of             attention to detail compared to IMR reviews. In an attempt to mit-
doctor-patient care. At the very least, one would want to compare         igate confirmation bias, as well as potentially significant register
the reports/letters produced by the patient’s doctor(s) and the IMR       differences between IMRs and movie or restaurant reviews, we
texts. However, such information is not available and there are no        examine four additional corpora: drug reviews [6], data science
visible signs suggesting potential availability in the near future.       job postings [9], legal case summaries [5] and cooking recipes [11].
The information that is made available by DMHC constitutes the            These specialized-register corpora are potentially more similar to
IMR decision – whether to uphold or overturn the health plan              IMRs than IMDB or Yelp: the texts are more likely to be highly
decision –, the anonymized decision letter, and information about         similar, include boilerplate text and have a templatic/standardized
the requested treatment category (also available in the letter). We,      structure. We find that predictability of IMR texts, as measured by
therefore, had to limit ourselves to the text of the DMHC-provided        language-model perplexity and categorical accuracy, is higher than
IMR findings in our empirical analysis.                                   all the comparison datasets by a good margin.
   A qualitative inspection of the corpus of IMR decisions made               Based on these empirical comparisons, we conclude that we
available by the California DMHC site as of June 2019 (a total of         have strong evidence that the IMR reviews are perfunctory and,
26,631 cases spanning the years 2001-2019) indicates that the re-         therefore, that a crucial consumer protection mandated by law
views – as documented in the text of the findings – focus more            seems to fail for a sizeable class of highly vulnerable patients. The
on the review procedure and associated legalese than on the ac-           paper is structured as follows. In Section 2, we discuss the datasets
tual medical history of the patient and the details of the case. For      in detail, with a focus on the nature and characteristics of the IMR
example, decisions for chronic pain management seem to mostly             data. In Section 3, we discuss the models we use to analyze the IMR,
rubber-stamp the Medical Treatment Utilization Schedule (MTUS)            Yelp and IMDB datasets, as well as the four auxiliary corpora (drug
guidelines, with very little consideration of the rarity of the un-       reviews, data science jobs, legals cases and recipes). The section also
derlying condition(s) (see our comments about RCTs above), or             compares and discusses the results of these models. Section 4 puts
a thoughtful evaluation of the risk/benefit profile of the denied         all the results together into an argument for the perfunctoriness of
treatment relative to the specific medical history of the patient         the IMRs. Section 5 concludes the paper and outlines directions for
(assuming this history was adequately documented to begin with).          future work.
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                             Brasoveanu, Moodie and Agrawal


2 THE DATASETS                                                                            Table 2: Outcome counts and percentages by year

2.1 The IMR dataset                                                                       ReportYear     Total # of cases   Overturned     Upheld
The IMR dataset was obtained from the DMHC website in June                                2001                         28       7 (25%)        21
20193 and was minimally preprocessed. It contains 26,361 cases /                          2002                        695     243 (35%)       452
observations and 14 variables, 4 of which are the most relevant:                          2003                        738     280 (38%)       458
                                                                                          2004                        788     305 (39%)       483
     • TreatmentCategory: the main treatment category;
                                                                                          2005                        959     313 (33%)       646
     • ReportYear: year the case was reported;                                            2006                       1080     442 (41%)       638
     • Determination: indicates if the determination was upheld or                        2007                       1342     571 (43%)       771
       overturned;                                                                        2008                       1521     678 (45%)       843
     • Findings: a summary of the case findings.                                          2009                       1432     641 (45%)       791
   The top 14 treatment categories (with percentages of total ≥ 2%),                      2010                       1453     661 (45%)       792
together with their raw counts and percentages are provided in                            2011                       1435     684 (48%)       751
                                                                                          2012                       1203     589 (49%)       614
Table 1.
                                                                                          2013                       1197     487 (41%)       710
                                                                                          2014                       1433     549 (38%)       884
                Table 1: Top 14 treatment categories                                      2015                       2079    1070 (51%)      1009
                                                                                          2016                       3055    1714 (56%)      1341
            TreatmentCategory           Case count      % of total                        2017                       2953    1391 (47%)      1562
            Pharmacy                          6480            25%                         2018                       2545    1218 (48%)      1327
            Diag Imag & Screen                4187            16%                         2019                        425     209 (49%)       216
            Mental Health                     2599            10%
            DME                               1714             7%
            Gen Surg Proc                     1227             5%
            Orthopedic Proc                   1173             5%
            Rehab/ Svc - Outpt                1157             4%
            Cancer Care                       1029             4%
            Elect/Therm/Radfreq                828             3%
            Reconstr/Plast Proc                825             3%
            Autism Related Tx                  767             3%
            Emergency/Urg Care                 582             2%
            Diag/ MD Eval                      573             2%
            Pain Management                    527             2%                   Figure 1: % Overturned claimed on DMHC site (June 2019)


   The breakdown of cases by patient gender (not recorded for all                   2.2    The comparison datasets
cases) is as follows: Female – 14823 (56%), Male – 10836 (41%), Other
                                                                                    As comparison datasets, we use the IMDB movie-review dataset [10],
– 11 (0.0004%).
                                                                                    which has 50,000 reviews and a binary positive/negative sentiment
   The breakdown by determination (the outcome of the IMR) is:
                                                                                    classification associated with each review. This dataset will be par-
Upheld – 14309 (54%), Overturned – 12052 (46%).
                                                                                    ticularly useful as a baseline for our ULMFiT transfer-learning
   The outcome counts and percentages by year are provided in
                                                                                    language models (and subsequent transfer-learning classification
Table 2. The number of cases for 2019 include only the first 5 months
                                                                                    models), where we show that we obtain results for the IMDB dataset
of the year plus a subset of June 2019.
                                                                                    that are similar to the ones in the original ULMFiT paper [8].
   Interestingly, the DMHC website featured a graphic in June 2019
                                                                                       There are 50,000 movie reviews in the IMDB dataset, evenly split
(Figure 1) that reports the percentage of Overturned outcomes to be
                                                                                    into negative and positive reviews. The histogram of text lengths
64%, a figure that does not accord with any of our data summaries.
                                                                                    for IMDB reviews is provided in Figure 2. The reviews contain a
We intend to follow up on this issue and see if the DMHC can share
                                                                                    total of 11,557,297 words. The mean length of a review is 231.15
their data-analysis pipeline so that we can pinpoint the source(s)
                                                                                    words, with an SD of 171.32.
of this difference.
                                                                                       We select a sample of 50,000 Yelp (mainly restaurant) reviews [19],
   Given that our main goal here is to investigate the text of the
                                                                                    with associated binarized negative/positive evaluations, to provide
IMR findings and its predictiveness with respect to IMR outcomes,
                                                                                    a comparison corpus intermediate between our DMHC dataset and
we provide some general properties of this corpus. The histogram
                                                                                    the IMDB dataset. From a total of 560,000 reviews (evenly split be-
of word counts for the IMR findings (the text associated with each
                                                                                    tween negative and positive), we draw a weighted random sample
case) is provided in Figure 2. There are 26,361 texts, with a total of
                                                                                    with the weights provided by the histogram of text lengths for the
5,584,280 words. Words are identified by splitting texts on white
                                                                                    IMR corpus. The resulting sample contains 25,809 (52%) negative
space (sufficient for our purposes here). The mean length of a text
                                                                                    reviews and 24,191 (48%) positive reviews. The histogram of text
is 211.84 words, with a standard deviation (SD) of 120.58.
                                                                                    lengths for Yelp reviews is also provided in Figure 2. The reviews
3 https://data.chhs.ca.gov/dataset/independent-medical-review-imr-determinations-   contain a total of 7,038,467 words. The mean length of a review is
trend.                                                                              140.77 words, with an SD of 71.09.
                                                                                                                                                                                                                                                                                                                                                                                                                                                           KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews




                                                                                (Normalized) # of texts of a given length




                                                                                                                                                                                                                                                (Normalized) # of texts of a given length




                                                                                                                                                                                                                                                                                                                                                                                               (Normalized) # of texts of a given length
                                                                                                                            0.004                                                                                                                                                                                                                                                                                                          0.008
                                                                                                                                                                                                                                                                                            0.005

                                                                                                                            0.003                                                                                                                                                           0.004                                                                                                                                          0.006
                                                                                                                                                                                                                                                                                            0.003
                                                                                                                            0.002                                                                                                                                                                                                                                                                                                          0.004
                                                                                                                                                                                                                                                                                            0.002
                                                                                                                            0.001                                                                                                                                                                                                                                                                                                          0.002
                                                                                                                                                                                                                                                                                            0.001

                                                                                                                            0.000                                                                                                                                                           0.000                                                                                                                                          0.000
                                                                                                                                       0      200        400                                               600      800       1000      1200                                                        0   500       1000                                               1500   2000      2500                                                           0       200          400                                                 600     800
                                                                                                                                                    IMR text length (# of words)                                                                                                                        IMDB-review text length (# of words)                                                                                                                Yelp-review text length (# of words)


                                                                                                                                                     (a) IMR                                                                                                                                               (b) IMDB                                                                                                                                                (c) Yelp

                                                                 Figure 2: Histograms of text lengths (numbers of words per text) for the IMR, IMDB and Yelp corpora
  (Normalized) # of texts of a given length




                                                                                                                                                               (Normalized) # of texts of a given length




                                                                                                                                                                                                                                                                                                                      (Normalized) # of texts of a given length




                                                                                                                                                                                                                                                                                                                                                                                                                                                                          (Normalized) # of texts of a given length
                                              0.010                                                                                                                                                        0.0020                                                                                                                                                 0.00014
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.007
                                                                                                                                                                                                                                                                                                                                                                  0.00012
                                              0.008                                                                                                                                                                                                                                                                                                                                                                                                                                                                   0.006
                                                                                                                                                                                                           0.0015                                                                                                                                                 0.00010
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.005
                                              0.006                                                                                                                                                                                                                                                                                                               0.00008
                                                                                                                                                                                                           0.0010                                                                                                                                                                                                                                                                                                     0.004
                                              0.004                                                                                                                                                                                                                                                                                                               0.00006                                                                                                                                             0.003
                                                                                                                                                                                                           0.0005                                                                                                                                                 0.00004                                                                                                                                             0.002
                                              0.002
                                                                                                                                                                                                                                                                                                                                                                  0.00002                                                                                                                                             0.001
                                              0.000                                                                                                                                                        0.0000                                                                                                                                                 0.00000                                                                                                                                             0.000
                                                      0   250      500    750                                               1000    1250   1500   1750                                                                    0   500     1000 1500 2000 2500 3000 3500                                                                                                         0      20000       40000                                               60000   80000                                                              0     500         1000      1500       2000   2500
                                                                Drug-review text length (# of words)                                                                                                                                 DS-job text length (# of words)                                                                                                               Legal-case text length (# of words)                                                                                                                    Recipe text length (# of words)


                                                          (a) Drug Reviews                                                                                                                                                          (b) DS Jobs                                                                                                                                 (c) Legal cases                                                                                                                                       (d) Recipes

                                                                                        Figure 3: Histograms of text lengths (numbers of words per text) for the auxiliary datasets


2.3                                                   Four auxiliary datasets                                                                                                                                                                                                                                                                                          The histogram of text lengths for drug reviews is provided in
We will also analyze four other specialized-register corpora: drug                                                                                                                                                                                                                                                                                                  Figure 3. The reviews contain a total of 11,015,248 words, with a
reviews [6], data science (DS) job postings [9], legal case reports [5]                                                                                                                                                                                                                                                                                             mean length of 83.26 words per review (significantly shorter than
and cooking recipes [11]. The modeling results for these specialized-                                                                                                                                                                                                                                                                                               the IMR/IMDB/Yelp texts) and an SD of 45.73.
register corpora will enable us to better contextualize and evaluate                                                                                                                                                                                                                                                                                                   The DS corpus includes 6,953 job postings (about a quarter of
the modeling results for the IMR, IMDB and Yelp corpora, since                                                                                                                                                                                                                                                                                                      the texts in the IMR corpus), with a total of 3,731,051 words. The
these four auxiliary datasets might be seen as more similar to the                                                                                                                                                                                                                                                                                                  histogram of text lengths is provided in Figure 3. The mean length
IMR corpus than movie or restaurant reviews. The drug-review                                                                                                                                                                                                                                                                                                        of a job posting is 536.61 words (more than twice as long as the
corpus contains reviews of pharmaceutical products, which are                                                                                                                                                                                                                                                                                                       IMR/IMDB/Yelp texts), with an SD of 254.06.
closer in subject matter to IMRs than movie/restaurant reviews.                                                                                                                                                                                                                                                                                                        There are 3,890 legal-case reports (even fewer than DS job post-
The other three corpora are all highly specialized in register, just                                                                                                                                                                                                                                                                                                ings), with a total of 25,954,650 words (about 5 times larger than
like the IMRs, with two of them (DS jobs and legal cases) particularly                                                                                                                                                                                                                                                                                              the IMR corpus). The histogram of text lengths for the legal-case re-
similar to the IMRs in that they involve templatic texts containing                                                                                                                                                                                                                                                                                                 ports is provided in Figure 3. The mean length of a report is 6,672.15
information aimed at a specific professional sub-community.                                                                                                                                                                                                                                                                                                         words (a degree of magnitude longer than IMR/IMDB/Yelp), with a
   These four corpora are very different from each other and from                                                                                                                                                                                                                                                                                                   very high SD of 11,997.98.
the IMR corpus in terms of (i) the number of texts that they contain                                                                                                                                                                                                                                                                                                   Finally, the recipe corpus includes more than 1 million texts:
and (ii) the average text length (number of words per text). Because                                                                                                                                                                                                                                                                                                there are 1,029,719 recipes, with a total of 117,563,275 words (very
of this, there was no obvious way to sample from them and from                                                                                                                                                                                                                                                                                                      large compared to our other corpora). The histogram of text lengths
the IMR, IMDB and Yelp corpora in such a way that the resulting                                                                                                                                                                                                                                                                                                     for the recipes is provided in Figure 3. The mean length of a recipe
samples were both roughly comparable with respect to the total                                                                                                                                                                                                                                                                                                      is 114.17 words (close to the length of a drug review, and roughly
number of texts and average text length, and also large enough to                                                                                                                                                                                                                                                                                                   half of an IMR), with an SD of 90.54.
obtain reliable model estimates. We therefore analyzed these four
corpora as a whole.                                                                                                                                                                                                                                                                                                                                                 3       THE MODELS
   The drug-review corpus includes 132,300 drugs reviews – more                                                                                                                                                                                                                                                                                                     In this section, we analyze the text of the IMR findings and its
than the double the number of texts in the IMDB and Yelp datasets,                                                                                                                                                                                                                                                                                                  predictiveness with respect to IMR outcomes. We systematically
and more than 4 times the number of texts in the IMR dataset. From                                                                                                                                                                                                                                                                                                  compare these results with the corresponding ones for the IMDB
the original corpus of 215,063 reviews, we only retained the reviews                                                                                                                                                                                                                                                                                                and Yelp corpora. The datasets were split into training (80%), vali-
associated with a rating of 10, which we label as positive reviews,                                                                                                                                                                                                                                                                                                 dation (10%) and test (10%) sets. Test sets were only used for the
and a rating of 1 through 5, which we label as negative reviews.4                                                                                                                                                                                                                                                                                                   final model evaluation.
4 We did this so that we have a fairly balanced dataset (68,005 positive drug reviews and
64,295 negative reviews) to estimate classification models like the ones we report for
the IMR, IMDB and Yelp corpora in the next section. For completeness, the drug-review                                                                                                                                                                                                                                                                               accuracy: 77.89%; accuracy of multilayer perceptron with a 1,000-unit hidden layer
classification results on previously unseen test data are as follows: logistic regression                                                                                                                                                                                                                                                                           and a ReLU non-linearity: 83.18%; ULMFiT classification model accuracy: 96.12%.
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                   Brasoveanu, Moodie and Agrawal


   We start with baseline classification models (logistic regressions       We see that the text of the findings / reviews is highly predictive
and logistic multilayer perceptrons with one hidden layer) to es-        of the associated binary outcomes, with the highest accuracy for the
tablish that the reviews in all three datasets under consideration       IMR dataset despite the fact that it contains half the observations
are highly predictive of the associated binary outcomes. Once the        of the other two data sets. We can therefore turn to a more in-
predictiveness, hence, relevance, of the text is established, we turn    depth analysis of the texts to understand what kind of textual
to an in-depth analysis of the texts themselves by means of topic        justification is used to motivate the IMR binary decisions. To that
and language models. We see that the text of the IMR reviews is          end, we examine and compare the results of two unsupervised/self-
significantly different (more predictable, less diverse / contentful)    supervised types of models: topic models and language models.
when compared to movie and restaurant reviews. We then turn to
a final set of classification models that leverage transfer learning     3.2    Topic models
from the language models to see how predictive the texts can re-
                                                                         Topic modeling [17] is an unsupervised method that distills se-
ally be with respect to the associated binary outcomes. Finally, we
                                                                         mantic properties of words and documents in a corpus in terms of
report the results of estimating language models for the 4 auxiliary
                                                                         probabilistic topics. The most widespread measure for topic model
datasets introduced in the previous section.
                                                                         evaluation is the coherence score [14]. Typically, as we increase
   The main conclusion of this extensive series of models is that
                                                                         the number of topics from very few, say, 4 topics, to more of them,
the IMR corpus is an outlier, and it would be easy to make the
                                                                         we see an increase in coherence score that tends to level out after
IMR process fully automatic: it is pretty straightforward to train
                                                                         a certain number of topics. When modeling the IMDB and Yelp
models that generate high-quality, realistic IMR reviews and gen-
                                                                         datasets, we see exactly this behavior, as shown in Figure 4.
erate binary decisions that are very reliably associated with these
                                                                            In contrast, the 4-topic model has the highest coherence score
reviews. In contrast, movie and restaurant reviews produced by
                                                                         (0.56) for the IMR data set, also shown in Figure 4. Furthermore,
unpaid volunteers (as well as the 4 auxiliary datasets) exhibit more
                                                                         as we add more topics, the coherence score drops. As the word
human-like depth, sophistication and attention to detail, so current
                                                                         clouds for the 4-topic model in Figure 5 show, these 4 topics mostly
NLP models do not perform as well on them.
                                                                         reflect the legalese associated with the IMR review procedure and
                                                                         very little, if anything, of the treatments and conditions that were
3.1     Classification models                                            the main point of the review. In contrast, the corresponding high-
We regress outcomes (Upheld/Overturned for IMR or negative/positive      scoring topic models for the IMDB and Yelp datasets reflect actual
sentiment for IMDB/Yelp) against the text of the corresponding           features of movies, e.g., family-life movies, westerns, musicals etc.,
findings / reviews. For the purposes of these basic classification       or breakfast/lunch places, restaurants, shops, bars, hotels etc.
models, as well as the topics models discussed in the following sub-        Recall that IMRs are the legally-mandated last resort for patients
section, the texts were preprocessed as follows. First, we removed       seeking treatments (usually) ordered by their doctors, and which
stop words; for the IMR dataset, we also removed the following           their health plan refuses to cover. The reviews are conducted ex-
high-frequency words: patient, treatment, reviewer, request, medi-       clusively based on documentation. Putting aside the fact that it is
cal and medically, and for the IMDB dataset, we also removed the         unclear how much effort is taken to ensure that the documentation
words film and movie. After part-of-speech tagging, we retained          is complete, especially for patients with extensive and complicated
only nouns, adjectives, verbs and adverbs, since lexical meanings        health records, we see that relatively little specific information
provide the most useful information for logistic (more generally,        about a patients’ medical history, condition(s), or the recommended
feed-forward) models and topic models. The resulting dictionary          treatments are reflected in the text of these decisions. The text seems
for the IMR dataset had 23,188 unique words. We ensured that             to consist largely of legalese about the IMR process, the health plan
the dictionaries for the IMDB and Yelp datasets were also between        / providers, basic demographic information about the patient, and
23,000 and 24,000 words by eliminating infrequent words. Bounding        generalities about the medical service or therapy requested for the
the dictionaries for each dataset to a similar range helps mitigate      enrollee’s condition.
dataset-specific modeling biases: having differently-sized vocabu-
laries leads to differently-sized parameter spaces for the models.
                                                                         3.3    Language models with transfer learning
   We extracted features by converting each text into sparse bag-of-
words vectors of dictionary length, which recorded how many times        Language models, specifically using neural networks, are usually
each token occurred in the text. These feature representations were      recurrent-network or transformer based architectures designed
the input to all the classifier models we consider in this subsection.   to learn textual distributional patterns in an unsupervised or self-
The multilayer perceptron model had a single hidden layer with           supervised manner. Recurrent-network models – on which we
1,000 units and a ReLU non-linearity. The classification accuracies      focus here – commonly use Long Short-Term Memory (LSTM) [7]
on the test data for all three datasets are provided in Table 3.         “cells,” which are able to learn long-term dependencies in sequences.
                                                                         Representing text as a sequence of words, language models build
                                                                         rich representations of the words, sentences, and their relations
       Table 3: Classification accuracy for basic models
                                                                         within a certain language. We estimate a language model for the
                                                                         IMR corpus using inductive sequential transfer learning, specifically
                                        IMR       IMDB       Yelp        ULMFiT [8]. Just as [8], we use the AWD-LSTM model [12], a vanilla
          logistic regression         90.75%      86.30%   87.62%        LSTM with 4 kinds of dropout regularization, embedding size of
          multilayer perceptron       90.94%      87.14%   88.92%        400, 3 LSTM layers (1,150 units per layer), and a BPTT of size 70.
                                                                                                                                                                                                       KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews



                                                                      Coherence scores                       0.40       Coherence scores                                                       0.510       Coherence scores
                          0.55                                                                                                                                                                 0.505
                                                                                                             0.38                                                                              0.500
                          0.54
        Coherence score




                                                                                           Coherence score




                                                                                                                                                                             Coherence score
                                                                                                                                                                                               0.495
                                                                                                             0.36                                                                              0.490
                          0.53
                                                                                                                                                                                               0.485
                          0.52                                                                               0.34                                                                              0.480
                                                                                                                                                                                               0.475
                          0.51                                                                               0.32                                                                              0.470
                                  4   6     8   10      12      14    16     18    20                               4     6      8     10      12      14   16    18   20                              4     6      8     10      12      14   16   18   20
                                                     Num Topics                                                                             Num Topics                                                                         Num Topics

                                                (a) IMR                                                                               (b) IMDB                                                                            (c) Yelp

                                          Figure 4: Coherence scores for topic models (𝑥-axis: number of topics; 𝑦-axis: coherence score)


                                                                                                                                                               for treatment of the patient ’s behavioral health condition
                                                                                                                                                               . The American Psychiatric Association ( APA ) treatment
                                                                                                                                                               guidelines for patients with eating disorders also consider
                                                                                                                                                               PHP acute care to be the most appropriate setting for treat-
                                                                                                                                                               ment , and suggest that patients should be treated in the least
                                                                                                                                                               restrictive setting which is likely to be safe and effective .
                                                                                                                                                               The PHP was initially recommended for patients who were
                                                                                                                                                               based on their own medical needs , but who were
                                                                                                                                                             • The patient was admitted to a skilled nursing facility (
                                                                                                                                                               SNF ) on 12 / 10 / 04 . The submitted documentation states
                                                                                                                                                               the patient was discharged from the hospital on 12 / 22 /
                                                                                                                                                               04 . The following day the patient ’s vital signs were sta-
                                                                                                                                                               ble . The patient had been ambulating to the community
                                                                                                                                                               with assistance with transfers , but has not had any recent
                                                                                                                                                               medical or rehabilitation therapy . The patient had no new
                                                                                                                                                               medical problems and was discharged in stable condition .
       Figure 5: Word clouds for the 4-topic IMR model                                                                                                         The patient has requested reimbursement for the inpatient
                                                                                                                                                               acute rehabilitation services provided
                                                                                                                                                          We see that the IMR language model is highly performant, de-
   The AWD-LSTM model is pretrained on Wikitext-103 [13], con-
                                                                                                                                                      spite the simple model architecture we used, the modest size of
sisting of 28, 595 preprocessed Wikipedia articles, with a total of 103
                                                                                                                                                      the pretraining corpus, and the small size of the IMR corpus. The
million words. This pretrained model is fairly simple (no attention,
                                                                                                                                                      quality of the generated text is also very high, particularly given
skip connections etc.), and the pretraining corpus is of modest size.
                                                                                                                                                      all these limitations.
   To obtain our final language models for the IMR, IMDB and
Yelp corpora, we fine-tune the pretrained AWD-LSTM model using
discriminative [18] and slanted triangular [8, 16] learning rates. We                                                                                 3.4         Classification with transfer learning
do the same kind of minimal text preprocessing as in [8].                                                                                             We further fine-tune the language models discussed in the previous
   The perplexity and categorical accuracy for the 3 language mod-                                                                                    subsection to train classifiers for the three datasets. Following [4, 8],
els are provided in Table 4. The perplexity for the IMR findings is                                                                                   we gradually unfreeze the classifier models to avoid catastrophic
much lower than for the IMDB / Yelp reviews, and the language                                                                                         forgetting.
model can correctly guess the next word more than half the time.                                                                                         The results of evaluating the classifiers on the withheld test
                                                                                                                                                      sets are provided in Table 5. Despite the fact that the IMR dataset
  Table 4: Language-model perplexity and categ. accuracy                                                                                              contains half of the classification observations of the other two
                                                                                                                                                      datasets, we obtain the highest level of accuracy when predicting
                                                                     IMR          IMDB           Yelp                                                 binary Upheld/Overturned decisions based on the text of the IMR
                                 perplexity                          11.86         36.96          40.3                                                findings.
                                 categorical accuracy                 53%           39%           29%
                                                                                                                                                                 Table 5: Accuracy for transfer-learning classifiers
  The IMR language model can generate high quality and largely
coherent text, unlike the IMDB / Yelp models. Two samples of                                                                                                                                                          IMR              IMDB            Yelp
generated text are provided below (the ‘seed’ text is boldfaced).                                                                                                  classification accuracy                          97.12%             94.18%        96.16%
    • The issue in this case is whether the requested partial hos-
      pitalization program ( PHP ) services are medically necessary
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                                       Brasoveanu, Moodie and Agrawal


Table 6: Comparison of language models across all datasets.                               40
Best performing metrics are boldfaced.                                                                                                                     50
                                                                                          35




                                                                                                                                                                Categorical accuracy (%)
                                                                                                                                                           45
      Dataset              Perplexity        Categorical Accuracy                         30




                                                                             Perplexity
      IMR reviews          11.86             0.53                                         25                                                               40
      Legal cases          18.17             0.43
      DS Jobs              22.14             0.41                                         20                                                               35
      Drug reviews         25.06             0.36                                         15
                                                                                                                                                           30
      Recipes              29.56             0.39
      IMDB                 36.96             0.39




                                                                                                R



                                                                                                            s




                                                                                                                                s


                                                                                                                                         es


                                                                                                                                               DB



                                                                                                                                                      lp
                                                                                                                     s
                                                                                                          se




                                                                                                                              iew
                                                                                                                 job
                                                                                               IM




                                                                                                                                                    Ye
                                                                                                                                      cip


                                                                                                                                              IM
                                                                                                        ca




                                                                                                                          rev
                                                                                                                DS




                                                                                                                                    Re
      Yelp                 40.3              0.29




                                                                                                          l
                                                                                                       ga




                                                                                                                         ug
                                                                                                    Le




                                                                                                                         Dr
                                                                                                                         Corpora

3.5     Models for auxiliary corpora                                    Figure 6: Comparison of language-model perplexity and cat-
We also estimated topic and language models for the 4 auxiliary         egorical accuracy across all the datasets.
corpora (drug reviews, DS jobs, legal cases and cooking recipes).
The associations between coherence scores and number of topics
for these 4 corpora was similar to the ones plotted in Figure 4 above   these medical reviews to be so much more predictable and generic
for the IMDB and Yelp corpora. For all 4 auxiliary corpora, the best    than less socially consequential reviews of movies and restaurants.
topic models had at least 14 topics, often more, with coherence            What are the ethical and potentially legal consequences of these
scores above 0.5. The quality of the topics was also high, with         findings? First, while state legislators assume we have strong health-
intuitively coherent and contentful topics (just like IMDB / Yelp).     insurance related consumer protections in place, an image DMHC
   The perplexity and accuracy of the ULMFiT language models            goes to great lengths to promote, we find the reviews to be up-
on previously-withheld test data are provided in Table 6, which         holding insurance plan denials at rates that exceed what one might
contains the results for all the 7 datasets under consideration in      expect, given that the treatments in question are frequently being
this paper. We see that the predictability of the IMR corpus, as        ordered by a treating physician, and that the IMR process is the last
reflected in its perplexity and categorical accuracy scores, is still   stage in a bureaucratically laborious (hence high-attrition) process
clearly higher than the 4 auxiliary corpora. The perplexity of the      of appealing health-plan denials.
legal-case corpus (18.17) is somewhat close to the IMR perplexity          Second, given that the IMR process creates an implied relation
(11.86), but we should remember that the legal-case corpus is about     of care between the reviewers hired by MAXIMUS and the patient –
5 times larger than the IMR corpus. Furthermore, the legal-case         since reviewers are, after all, being entrusted with the best interests
categorical accuracy of 43% is still substantially lower than the IMR   of the patient without regard to cost –, one can hardly say that they
accuracy of 53%. Notably, even the recipe corpus, which is about 20     are fulfilling their obligations as doctors to their patient with such
times larger than the IMR corpus (≈ 117.5 vs. ≈ 5.5 million words)      seemingly rote, perfunctory reviews.
does not have test-set scores similar to the IMR scores.                   Third, if IMR processes were designed to make sure that (i) treat-
   The results for these 4 auxiliary corpora indicate that the IMR      ment decisions are being made by doctors, not by profit-driven
corpus is an outlier, with very highly templatic and generic texts.     businesses, and (ii) insurance companies cannot welch on their re-
                                                                        sponsibilities to plan members, one must wonder whether prescrib-
4     DISCUSSION                                                        ing physicians are wrong more than half the time. Do American
The models discussed in the previous section show that language-        doctors really order so many erroneous, medically unnecessary
model learning is significantly easier for IMRs compared to the other   treatments and medications? If so, how is it possible that they are
6 corpora. As can be seen in Table 6, perplexity in the language        so committed and confident in them that they are willing to escalate
model for IMR reviews is clearly lower than even legal cases, for       the appeal process all the way to the state-managed IMR stage?
which we expect highly templatic language and high similarity           Or is it that IMRs often serve as a final rubber stamp for health-
between texts. This pattern can be clearly observed in Figure 6,        insurance plan denials, failing their stated mission of protecting a
with the IMR corpus clearly at the very end of the high-to-low          vulnerable population?
predictability spectrum.                                                   We end this discussion section by briefly reflecting on the way
   One would not expect such highly predictable texts in an ideal       we used ML/NLP methods for social good problems in this paper.
scenario, where each medical review is thorough, and each deci-         Overwhelmingly, the social-good applications of these methods
sion is accompanied by strong medical reasoning relying on the          and models seem to be predictive in nature: their goal is to improve
specifics of the case at hand, and based on an objective physician’s,   the outcomes of a decision-making process, and the improvement
or team of physicians’, opinion as to what is in the patient’s best     is evaluated according to various performance-related metrics. An
interest. Arguably, these medically complex cases are as diverse as     important class of metrics that are currently being developed have
Hollywood blockbusters or fashionable restaurants – the patients        to do with ethical, or ‘safe,’ uses of ML/AI models.
themselves certainly experience them as unique and meaningful              In contrast, our use of ML models in this paper was analytical,
–, and their reviews should be similarly diverse, or at most as tem-    with the goal of extracting insights from large datasets that enable
platic as a job posting or a cooking recipe. We wouldn’t expect         us to empirically evaluate how well an established decision-making
                                                                                                         KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews


process with high social impact functions. Data analysis of this          limited to (i) adding ways for patients to check that all the rele-
kind, more akin to hypothesis testing than to predictive modeling,        vant documentation has been collected and will be reviewed, and
is in fact one of the original uses of statistical models / methods.      (ii) identifying ways to hold the anonymous reviewers to higher
    Unfortunately, using ML models in this way does not straightfor-      standards of doctor-patient care.
wardly lead to plots showing how ML models obviously improve
metrics like the efficiency or cost of a process. We think, however,      ACKNOWLEDGMENTS
that there are as many socially beneficial opportunities for this kind    We are grateful to four KDD-KiML anonymous reviewers for their
of data-analysis use of ML modeling as there are for its predictive       comments on an earlier version of this paper. We gratefully acknowl-
uses. The main difference between them seems to be that the data-         edge the support of the NVIDIA Corporation with the donation of
analysis uses do not lead to more-or-less immediately measurable          two Titan V GPUs used for this research, as well as the UCSC Office
products. Instead, they are meant to become part of a larger ar-          of Research and The Humanities Institute for a matching grant to
gument and evaluation of a socially and politically relevant issue,       purchase additional hardware. The usual disclaimers apply.
e.g., the ethical status of current health-insurance related practices
and consumer protections discussed here. What counts as ‘success’         REFERENCES
when ML models are deployed in this way is less immediate, but             [1] Leatrice Berman-Sandler. 2004. Independent Medical Review: Expanding Legal
could provide at least as much social good in the long run.                    Remedies to Achieve Managed Care Accountability. Annals Health Law 13 (2004).
                                                                           [2] Kenneth H. Chuang, Wade M. Aubry, and R. Adams Dudley. 2004. Independent
                                                                               Medical Review Of Health Plan Coverage Denials: Early Trends. Health Affairs
                                                                               23, 6 (2004), 163–169. https://doi.org/10.1377/hlthaff.23.6.163
5    CONCLUSION AND FUTURE WORK                                            [3] Angus Deaton and Nancy Cartwright. 2018. Understanding and misunderstand-
                                                                               ing randomized controlled trials. Social Science and Medicine 210 (2018), 2–21.
We examined a database of 26,361 IMRs handled by the California                https://doi.org/10.1016/j.socscimed.2017.12.005
DMHC through a private contractor. IMR processes are meant to              [4] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann.
                                                                               2017. Using millions of emoji occurrences to learn any-domain representations for
provide protection for patients whose doctors prescribe treatments             detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on
that are denied by their health insurance.                                     Empirical Methods in Natural Language Processing. Association for Computational
   We found that, in a majority of cases, IMRs uphold the health               Linguistics, Copenhagen, Denmark, 1615–1625. https://doi.org/10.18653/v1/D17-
                                                                               1169
insurance denial, despite DMHC’s claim to the contrary. In addition,       [5] Filippo Galgani and Achim Hoffmann. 2011. LEXA: Towards Automatic Legal
we analyzed the text of the reviews and compared them with a                   Citation Classification. In AI 2010: Advances in Artificial Intelligence, Jiuyong Li
sample of 50,000 Yelp reviews and the IMDB movie review corpus.                (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 445–454.
                                                                           [6] Felix Gräundefineder, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder.
Despite the fact that these corpora are basically twice as large, we           2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain
can construct a very good language model for the IMR corpus,                   and Cross-Data Learning (DH ’18). Association for Computing Machinery, New
                                                                               York, NY, USA, 121–125. https://doi.org/10.1145/3194658.3194677
as measured by the quality of text generation, as well as its low          [7] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
perplexity and high categorical accuracy on unseen test data. These            Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.
results indicate that movie and restaurant reviews exhibit a much              8.1735
                                                                           [8] Jeremy Howard and Sebastian Ruder. 2018. Fine-tuned Language Models for Text
larger variety, more contentful discussion, and greater attention              Classification. CoRR abs/1801.06146 (2018). arXiv:1801.06146 http://arxiv.org/
to detail compared to IMR reviews, which seem highly templatic                 abs/1801.06146
and perfunctory in comparison. We see similar trends in topic              [9] Shanshan Lu. 2018. Data Scientist Job Market in the U.S. https://www.kaggle.
                                                                               com/sl6149/data-scientist-job-market-in-the-us More info available here: https:
models and classification models predicting binary IMR outcomes                //github.com/Silvialss/projects/tree/master/IndeedWebScraping.
and binarized sentiment for Yelp and IMDB reviews.                        [10] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng,
                                                                               and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis (HLT
   These results were further confirmed by topic and language                  ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 142–150.
models for four other specialized-register corpora (drug reviews,         [11] Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf
data science job postings, legal-case reports and cooking recipes).            Aytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1M+: A Dataset for
                                                                               Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE
   We are in the process of extending our datasets with (i) workers’           Trans. Pattern Anal. Mach. Intell. (2019).
comp cases from California and (ii) private insurance cases from          [12] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing
other states. This will enable us to investigate if the reviews for            and Optimizing LSTM Language Models. CoRR abs/1708.02182 (2017).
                                                                          [13] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017.
workers’ comp cases are substantially different from the DMHC                  Pointer Sentinel Mixture Models. CoRR abs/1609.07843 (2017).
IMR data (the percentage of upheld decisions is much higher for           [14] Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the
                                                                               Space of Topic Coherence Measures (WSDM ’15). ACM, New York, NY, USA,
workers’ comp: ≈ 90%), as well as if the reviews vary substantially            399–408. https://doi.org/10.1145/2684822.2685324
across states.                                                            [15] Shirley Eiko Sanematsu. 2001. Taking a broader view of treatment disputes
   Another direction for future work is to follow up on our pre-               beyond managed care: Are recent legislative efforts the cure? UCLA Law Review
                                                                               48 (2001).
liminary qualitative research with a survey of patients that have         [16] Leslie N. Smith. 2017. Cyclical learning rates for training neural networks. In
experienced the IMR process to see if these patients agree with the            Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE.
DMHC-promoted message that the IMR process provides strong                     464–472.
                                                                          [17] Mark Steyvers and Tom Griffiths. 2007. Probabilistic Topic Models. Lawrence
consumer protection against unjustified health-plan denials. This              Erlbaum Associates.
could also enable us to verify if the medical documentation col-          [18] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transfer-
                                                                               able are features in deep neural networks?. In Advances in Neural Information
lected during the IMR process is complete and actually taken into              Processing Systems. 3320–3328.
account when the decision is made.                                        [19] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Con-
   The ultimate upshot of this project would be a list of recommen-            volutional Networks for Text Classification. CoRR abs/1509.01626 (2015).
                                                                               arXiv:1509.01626 http://arxiv.org/abs/1509.01626
dations for the improvement of the IMR process, including but not
                                         Knowledge Intensive Learning of
                                         Generative Adversarial Networks
           Devendra Singh Dhami                                               Mayukh Das                                Sriraam Natarajan
        devendra.dhami@utdallas.edu                                   Samsung Research India                     The University of Texas at Dallas
       The University of Texas at Dallas                             mayukh.das@samsung.com                       sriraam.natarajan@utdallas.edu
ABSTRACT                                                                                   We aim to address the above limitations. Inspired by Mitchell’s
While Generative Adversarial Networks (GANs) have accelerated                           argument of “The Need for Biases in Learning Generalizations” [38],
the use of generative modelling within the machine learning com-                        we mitigate the challenges of existing data hungry methods via in-
munity, most of the applications of GANs are restricted to images.                      ductive bias while learning GANs. We show that effective inductive
The use of GANs to generate clinical data has been rare due to the                      bias can be provided by humans in the form of domain knowl-
inability of GANs to faithfully capture the intrinsic relationships                     edge [14, 27, 41, 50]. Rich human advice can effectively balance
between features. We hypothesize and verify that this challenge can                     the impact of quality (sparsity) of training data. Data quality also
be mitigated by incorporating domain knowledge in the generative                        contributes to, the well studied, modal instability of GANs. This
process. Specifically, we propose human-allied GANs that using                          problem is especially critical in domains such as medical/clinical
correlation advice from humans to create synthetic clinical data. Our                   analytics that does not typically exhibit ‘spatial homophily’ [21], un-
empirical evaluation demonstrates the superiority of our approach                       like images, and are prone to distributional diversity among feature
over other GAN models.                                                                  clusters as well. Our human-guided framework proposes a robust
                                                                                        strategy to address this challenge. Note that in our setting the human
CCS CONCEPTS                                                                            is an ally and not an adversary.
                                                                                           The second limitation of access is crucial for medical data gener-
   • Deep Learning → Generative Adversarial Networks; • Ap-
                                                                                        ation. Access to existing medical databases [10, 18] is hard due to
plication → Healthcare; • Learning → Knowledge Intensive Learn-
                                                                                        cost and access concerns and thus synthetic data generation holds
ing.
                                                                                        tremendous promise [6, 13, 19, 35, 48]. While previous methods
KEYWORDS                                                                                generated synthetic images, we go beyond images and generate clin-
    generative adversarial networks, human in the loop, healthcare                      ical data. Building on this body of work, we present a synthetic data
ACM Reference Format:
                                                                                        generation framework that effectively exploits domain expertise to
Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan. 2020. Knowl-                   handle data quality.
edge Intensive Learning of Generative Adversarial Networks. In Proceedings                 We make a few key contributions:
of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20). ,
6 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn                                            (1) We demonstrate how effective human advice can be provided
                                                                                                to a GAN as an inductive bias.
1    INTRODUCTION                                                                           (2) We present a method for generating data given this advice.
                                                                                            (3) Finally, we demonstrate the effectiveness and efficacy of our
Deep learning models have reshaped the machine learning landscape
                                                                                                approach on 2 de-identified clinical data sets. Our method
over the past decade [16, 29]. Specifically, Generative Adversar-
                                                                                                is generalizable to multiple modalities of data and is not
ial Networks (GANs) [17] have found tremendous success in gen-
                                                                                                necessarily restricted to images.
erating examples for images [34, 37, 45], photographs of human
                                                                                            (4) Yet another feature of this approach is that training occurs
faces [1, 25, 52], image to image translation [30, 33, 55] and 3D
                                                                                                from very few data samples (< 50 in one domain) thus pro-
object generation [44, 51, 53] to name a few. Despite such success,
                                                                                                viding human guidance as a data generation alternative.
there are several key factors that limit the widespread adoption of
GANs, for a broader range of tasks, including, widely acknowledged
data hungry nature of such methods, potential access issues of real                     2    RELATED WORK
medical data and finally, their restricted usage, mainly in the con-                    The key principle behind GANs [17] is a zero-sum game [26] from
text of images. These factors have limited the use of these arguably                    game theory, a mathematical representation where each participant’s
successful techniques in medical (or similar) domains. However,                         gain or loss is exactly balanced by the losses or gains of the other
recently, synthetic data generation has become a centerpiece of re-                     participants and is generally solved by a minimax algorithm. The
search in medical AI due to the diverse difficulties in collection,                     generator distribution 𝑝𝑑𝑎𝑡𝑎 (𝒙) over the given data 𝒙 is learned by
persistence, sharing and analysis of real clinical data.                                sampling 𝒛 from a random distribution 𝑝 𝒛 (𝒛) (initially uniform was
                                                                                        proposed but Gaussians have been proven superior [2]). While GANs
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego,           have proven to be a powerful framework for estimating generative
California, USA, August 24, 2020. Use permitted under Creative Commons License          distributions, convergence dynamics of naive mini-max algorithm
Attribution 4.0 International (CC BY 4.0).
                                                                                        has been shown to be unstable. Some recent approaches, among
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s).                                                 many others, augment learning either via statistical relationships be-
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                 tween true and learned generative distributions such as Wasserstein-1
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                         Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan


distance [3], MMD [32] or via spectral normalization of the parame-       on defining a distance/divergence (Wasserstein or earth movers dis-
ter space of the generator [39] which controls the generator distribu-    tance) to measure the closeness between the real distribution and the
tion from drifting too far. Although these approaches have improved       model distribution.
the GAN learning in some cases, there is room for improvement.
   Guidance via human knowledge is a provably effective way to
control learning in presence of systematic noise (which leads to          3.1    Human input as inductive bias
instability). One typical strategy to incorporate such guidance is        Historically, two approaches have been studied for using guidance
by providing rules over training examples and features. Some of           as bias. The first is to provide advice on the labels as constraints
the earliest approaches are explanation-based learning (EBL-NN,           or preferences that controls the search space. Some example advice
[49]) or ANNs augmented with symbolic rules (KBANN, [50]). Var-           rules on the labels include: (3 ≤ feature1 ≤ 5) ⇒ label = 1 and (0.6
ious widely-studied techniques of leveraging domain knowledge             ≤ feature2 ≤ 0.8) ∧ (4 ≤ feature3 ≤ 5) ⇒ label = 0. Such advice
for optimal model generalization include polyhedral constraints in        is more relevant in an discriminative setting but are not ideal for
case of knowledge-based SVMs, [9, 14, 28, 47]), preferences rules         GANs. Since GANs are shown to be sensitive to the training data
[5, 27, 41, 42] or qualitative constraints (ex: monotonicities / syner-   and here the labels are getting generated, they should not be altered
gies [54] or quantitative relationships [15]). Notably, whereas these     during training. The second is via correlations between features as
models exhibit considerable improvement with the incorporation of         preferences (our approach) which allows for faithful representation
human knowledge, there is only limited use of such knowledge in           of diverse modality.
training GANs. Our approach resembles the qualitative constraints            Advice injection: After every fixed number of iterations, N, we
framework in spirit.                                                      calculate the correlation matrix of the generated data G1 and provide
   While widely successful in building optimally generalized models       a set of advice 𝜓 on the correlations between different features. Con-
in presence of systematic noise (or sample biases), knowledge-based       sider the following motivating example for the use of correlations as
approaches have mostly been explored in the context of discrimi-          a form of advice.
native modeling. In the generative setting, a recent work extends         Example: Consider predicting heart attack with 3 features - choles-
the principle of posterior regularization from Bayesian modeling to       terol, blood pressure (BP) and income. The values of the given
deep generative models in order to incorporate structured domain          features can vary (sometimes widely) between different patients due
knowledge [22]. Traditionally, knowledge based generative learning        to several latent factors (ex, smoking habits). It is difficult to assume
has been studied as a part of learning probabilistic graphical models     any specific distribution. In other words, it is difficult to deduce
with structure/parameter priors [36]. We aim to extend the use of         whether the values for the features come from the same distribution
knowledge to the generative model setting.                                (even though the feature values in the data set are similar).
                                                                          We modify the correlation coefficients (for both positive and neg-
3    KNOWLEDGE INTENSIVE LEARNING OF                                      ative correlations) between the features by increasing them if the
     GENERATIVE ADVERSARIAL NETWORKS                                      human advice suggests that two features are highly correlated and
                                                                          decrease the same if the advice suggests otherwise.
A notable disadvantage of adversarial training formulation is that        Example: Continuing the above example, since rise in the choles-
the training is slow and unstable, leading to mode collapse [2] where     terol level can lead to rise in BP and vice versa, expert advice here
the generator starts generating data of only a single modality. This      can suggest that cholesterol and BP should be highly correlated.
has resulted in GANs not being exploited to their full potential in       Also, as income may not contribute directly to BP and cholesterol
generating synthetic non-image clinical data. Human advice can            levels, another advice here can be to de-correlate cholesterol/BP
encourage exploration in diverse areas of the feature space and helps     and income level.
learn more stable models [43]. Hence, we propose a human-allied              The example advice rules ∈ 𝜓 are: 1. Correlation(“cholesterol
GAN architecture (HA-GAN) (figure 1). The architecture incorpo-           level",“BP")↑, 2. Correlation(“cholesterol level",“income level")↓
rates human advice in form of feature correlations. Such intrinsic        and 3. Correlation(“BP",“income level")↓, where ↑ and ↓ indicate
relationships between the features are crucial in medical data sets       increase and decrease respectively. Based on the 1st advice we need
and thus become a natural candidate as additional knowledge/advice        to increase the correlation coefficient between cholesterol level and
in guided model learning for faithful data generation.                    BP. Then
   Our approach builds upon a GAN architecture [17] where a ran-
dom noise vector is provided to the generator which tries to generate
examples as close to the real distribution as possible. The discrimi-                        1     0.2     0.3       1   𝜆   1
                                                                                                                       
nator tries to distinguish between real examples and ones generated                     C = 0.2    1      0.07 A = 𝜆   1   1           (1)
by the generator. The generator tries to maximize the probability                            0.3   0.07     1        1   1   1
                                                                                                                       
that the discriminator makes a mistake and the discriminator tries to
minimize its mistakes thereby resulting in a min-max optimization
problem which can be solved by a mini-max algorithm. We adopt             Here C is the correlation matrix, A is the advice matrix and 𝜆 is the
the Wasserstein GAN (WGAN) architecture1 [3, 20] that focuses             factor by which the correlation value is to be augmented. In case
                                                                          where we need to increase the value of the correlation coefficient, 𝜆
                                                                          should be > 1. We keep 𝜆 = 𝑚𝑎𝑥 1( | C |) . Since -1.0 ≤ ∀𝑐 ∈ C ≤ 1.0,
1 We use ‘GAN’ to indicate ‘W-GAN’                                        in this case, the value of 𝜆 ≥ 1.0, leading to enhanced correlation via
Knowledge Intensive Learning of                                                                KiML’20, August 24, 2020, San Diego, California, USA,
Generative Adversarial Networks




             Figure 1: Human-Allied GAN. Correlation advice takes generated distribution closer to the real distribution.


Hadamard product. Thus the new correlation matrix Ĉ is,                  function. For a sampled point 𝑣, CDF (𝑣) = P (𝑉 ≤ 𝑣). Thus, to
                      1                               1                  generate samples, the values 𝑣 ∼ V are passed through CDF −1 to
                             0.2     0.3   1            1
                                                1
                                                      0.3                 obtain the desired values 𝑥 [CDF −1 (𝑣) = {𝑥 |CDF (𝑥) ≤ 𝑣, 𝑣 ∈
        Ĉ = C ⊙ A = 0.2    1     0.07 ⊙  0.3   1     1
                      0.3 0.07                                           [0, 1]}]. Thus for Gaussian,
                                      1  1
                                                    1     1
                                                                  (2)
                       1      0.667 0.3                                                 ∫ 𝑥                 ∫ 𝑥
                                                                                        1        −𝑥 2      1           −𝑥 2
                   = 0.667     1      0.07                                CDF (𝑥) = √      exp 2 𝑑𝑥 = √         exp 2 𝑑𝑥
                       0.3     0.07       1                                           2𝜋 −∞              2𝜋 0
                                                                                                                                                  (4)
                                                                                                                  −𝑥 2 𝑥
If the advice says that features have low correlations (2nd rule in                                     = [− exp(     )]
example), we decrease the correlation coefficient. Now, 𝜆 must be                                                  2 0
< 1 and we set 𝜆 = 𝑚𝑎𝑥 (|C|). Since -1 ≤ ∀𝑐 ∈ C ≤ 1.0, the value of
                                                                                                                                               2
𝜆 ≤ 1.0. Thus multiplying by 𝜆 will decrease the correlation value,       The inverse CDF can be thus written as CDF −1 (𝑣) = 1−exp( −𝑥2 ) ≤
and the new correlation matrix is,                                                                                                   p
                                                                          𝑣 and the desired values 𝑥 ∈ M can be obtained as 𝑥 = 2𝑙𝑛(1 − 𝑣).
                     1      0.667 0.3   1      1 0.3                  [Step 2]: Calculate the correlation matrix E of M.
                    
    Ĉ1 = Ĉ ⊙ A = 0.667
                              1    0.07 ⊙  1
                                                  1 0.3                  [Step 3]: Calculate the Cholesky decomposition F of the corre-
                     0.3    0.07     1   0.3 0.3 1                    lation matrix E. Cholesky decomposition [46] of a positive-definite
                                                         (3)
                     1      0.667   0.09                                matrix is given as the product of a lower triangular matrix and its con-
                    
                 = 0.667     1    0.021
                                                                         jugate transpose. Note that for Cholesky decomposition to be unique,
                     0.09 0.021       1                                the target matrix should be positive definite, (such as the co-variance
                                                                          matrix) whereas the correlation matrix, used in our algorithm, is only
                    
This is used to create the new generated data G̃1 . For negative corre-   positive semi-definite. We enforce positive-definiteness by repeated
lations, the process is unchanged.                                        addition of very small values to the diagonal of the correlation ma-
                                                                          trix until positive-definiteness is ensured. Given a symmetric and
3.2    Advice-guided data generation                                      positive definite matrix E, its Cholesky decomposition F is such
After Ĉ1 is constructed, we next generate data satisfying the con-       that E = F · F ⊤ .
straints. To this effect, we employ the Iman-Conover method [23],            [Step 4]: Calculate the Cholesky decomposition Q of the correla-
a distribution free method to define dependencies between distri-         tion matrix obtained after modifications based on human advice, Ĉ.
butional variables based on rank correlations such as Spearman or         As above the Cholesky decomposition is such that Ĉ = Q · Q ⊤ .
Kendell Tau correlations. Since we deal with linear relationships            [Step 5]: Calculate the reference matrix T by transforming the
between the features and assume a normal distribution and that            sampled matrix M from step 1 to have the desired correlations of Ĉ,
Pearson coefficient has shown to perform equally well with the            by using their Cholesky decompositions.
Iman-Conover method [40] due to the close relationship between               [Step 6]: Rearrange values in columns of the generated data G1
Pearson and Spearman correlations, we use the Pearson correlations.       to have the same ordering as corrresponding column in the reference
Further, we assume that the features are Gaussian, justified by the       matrix T to obtain the final generated data G̃1 .
fact that most lab test data is continuous. The Iman-Conover method
consists of the following steps:                                             Cholesky decomposition to model correlations: Given an ran-
   [Step 1]: Create a random standardized matrix M with values            domly generated data set with no correlations P, a correlation matrix
𝑥 ∈ M ∼ Gaussian distribution. This is obtained by the process of         C and its Cholesky decomposition Q, data that faithfully follows
inverse transform sampling described next. Let V1 be a uniformly          the given correlations ∈ C can be generated by the product of the
distributed random variable and CDF be the cumulative distribution        obtained lower triangular matrix with the original uncorrelated data
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                              Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan


i.e. P̂=QP. The correlation of the newly obtained data, P̂ is,                        to plan and prognosticate treatments. The data consists of 19
                                                                                      features with 44 positive and 6 negative examples.
                           𝐶𝑜𝑣 ( P̂)     E[ P̂ P̂ ⊤ ] − E[ P̂]E[ P̂] ⊤
           𝐶𝑜𝑟𝑟 ( P̂) =              =                                   (5)      (2) MIMIC database [24] consists of deidentified information
                             𝜎 P̂                    𝜎 P̂                             of patients admitted to critical care units at a large tertiary
Since we consider data P̂ from a Gaussian distribution with zero                      care hospital. The features included are predominately time
mean and unit variance,                                                               window aggregations of physiological measurements from
                                                                                      the medical records. We selected relevant lab results, vital
               E[ P̂ P̂ ⊤ ] − E[ P̂]E[ P̂] ⊤                                          sign observations and feature aggregations. The data consists
𝐶𝑜𝑟𝑟 ( P̂) =                                 = E[ P̂ P̂ ⊤ ] = E[(QP)(QP) ⊤ ]
                             𝜎 P̂                                                     of 18 with 5813 positive and 40707 negative examples.
            = E[QPQ ⊤ P ⊤ ] = QE[PP ⊤ ]Q ⊤ = QQ ⊤ = C                          Advice Acquisition: Here we compile the sources from which we
                                                                         (6)   obtain the advice.
Thus Cholesky decomposition can capture the desired correlations
                                                                                  (1) Nephrotic Syndrome: This is a novel real data set and the ad-
faithfully and can be used for generating correlated data. Since we al-
                                                                                      vice is obtained from a nephrologist in India. According
ready have a normal sampled matrix M and a calculated correlation
                                                                                      to the problem statement from the expert, nephrotic syndrome
E of M, we need to calculate a reference matrix (step 5).
                                                                                      involves the loss of a lot of protein and nephritic syndrome
                                                                                      involves the loss of a lot of blood through urine. A kidney
3.3      Human-Allied GAN training                                                    biopsy is often required to diagnose the underlying pathol-
Since the human expert advice is provided independent of the GAN                      ogy in patients with suspected glomerular disease. The goal
architecture, our method is agnostic of the underlying GAN architec-                  of the project is to build a clinical support system that pre-
ture. We make use of Wasserstein GAN (WGAN) architecture since                        dicts the disease using clinical features, thus reducing the
its shown to be more stable while training and can handle mode                        need of kidney biopsy. Since the data collection is scarce,
collapse [3]. Only the error backpropagation values differ when we                    a synthetic data set can help in better understanding of the
are using the data generated by the underlying GAN or the data                        disease from the clinical features.
generated by the Iman-Conover method. Our algorithm starts with                   (2) MIMIC: The feature set and the expected correlations are
the general process of training a GAN where the generator takes                       obtained in consultation with trauma experts at a Dallas
random noise as an input and generates data which is then passed,                     hospital.
along with the real data, to the discriminator. The discriminator
tries to identify the real and generated data and the error is back            All experiments were run on a 64-bit Intel(R) Xeon(R) CPU E5-2630
propagated to the generator. After every specified number of itera-            v3 server for 10K epochs. Both the generator and discriminator are
tions, the correlations between features C in the generated data is            neural networks with 4 hidden layers. To measure the quality of the
obtained and a new correlation matrix Ĉ, is obtained with respect             generated data we make use of the train on synthetic, test on real
to the expert advice (section 3.1). A new data set is generated wrt            (TSTR) method as proposed in [12]. We use gradient boosting with
Ĉ using the Iman-Conover method (Section 3.2) and then passed to              100 estimators and a learning rate of 0.01 as the underlying model.
the discriminator along with the real data set.                                We train the GAN for 10K epochs and provide correlation advice
                                                                               every 1K iterations.
4     EXPERIMENTAL EVALUATION                                                  Table 1 shows the results of the TSTR method with data generated
                                                                               with (HA-GAN𝐺𝐴 ) and without advice (GAN). It shows that the
We aim to answer the following questions:
                                                                               data generated with advice has higher TSTR performance than the
    Q1: Does providing advice to GANs help in generating better                data generated without advice across all data sets and all metrics.
        quality data?                                                          Thus, to answer Q1, providing advice to generative adversarial net-
    Q2: Are GANs with advice effective for data sets that have few             works captures the relationship between features better and thus are
        examples?                                                              able to generate better quality synthetic data.
    Q3: How does bad advice affect the quality of generated data?              Learning with less data: GANs with advice are especially impres-
    Q4: How well does human advice handle class imbalance?                     sive in nephrotic syndrome data which consists of only 50 examples
    Q5: How does our method compare to state-of-the-art GAN archi-             across all metrics and is thus very small in size when compared to the
        tectures.                                                              number of samples typically required to train a GAN model. Thus,
     We consider 2 real clinical data sets.                                    we realize an important property of incorporating human guidance in
     (1) Nephrotic Syndrome is a novel data set of symptoms that               the GAN model and can answer Q2 affirmatively. The use of advice
         indicate kidney damage. This consists of 50 kidney biopsy             opens up the potential of using GANs in presence of sparse data
         images along with the clinical reports sourced from Dr Lal            samples.
         PathLabs, India 2 . We use the clinical reports that consist of       Effect of bad advice: Table 1 also shows the results for data gen-
         the values for kidney tissue diagnosis which can confirm the          erated with bad advice (HA-GAN𝐵𝐴 ). To simulate bad advice, we
         clinical diagnosis and help to identify high-risk patients and        follow a simple process: if the advice says that the correlation be-
         influence treatment decisions and help medical practitioners          tween features should be high, we set the correlations in Ĉ to 0
                                                                               and if the advice says that the correlation should be low, we set the
2 https://www.lalpathlabs.com/                                                 correlations in Ĉ to be either 1 or -1 based on whether the original
Knowledge Intensive Learning of                                                                               KiML’20, August 24, 2020, San Diego, California, USA,
Generative Adversarial Networks


Table 1: TSTR Results (≈ 3 𝑑𝑒𝑐.). N/A in Nephrotic Syndrome denotes that all generated labels were of a single class (0 in our case)
and thus we were not able to run the discriminative algorithm in the TSTR method. 𝐺𝐴 and 𝐵𝐴 denotes good and bad advice to our
HA-GAN model respectively.

                                                Data set      Methods      Recall    F1     AUC-ROC        AUC-PR
                                                               GAN         0.584    0.666     0.509         0.911
                                                             HA-GAN𝐵𝐴       0.42    0.511     0.518         0.886
                                                              medGAN        N/A      N/A       N/A           N/A
                                                   NS
                                                             medWGAN        N/A      N/A       N/A           N/A
                                                             medBGAN        N/A      N/A       N/A           N/A
                                                             HA-GAN𝐺𝐴        1.0    0.943     0.566         0.947
                                                               GAN         0.122    0.119     0.495         0.174
                                                             HA-GAN𝐵𝐴      0.285    0.143     0.459         0.235
                                                              medGAN       0.374    0.163     0.478         0.279
                                                 MIMIC
                                                             medWGAN         0.0     0.0       0.5          0.562
                                                             medBGAN         0.0     0.0       0.5          0.562
                                                             HA-GAN𝐺𝐴      0.979    0.263     0.598         0.567


correlation is positive or negative. Thus, given a correlation matrix                in table 1 where advice based data generation outperforms the non-
                            1                                                       advice and bad advice based data generation. Thus, we can answer
                                   0.2    0.3 
                                                                                    Q4 affirmatively.
                       C = 0.2    1     0.07                  (7)
                            0.3 0.07                                                To answer Q5 we compare our method to 3 GAN architectures,
                                           1 
                                                                                    medGAN [8] which uses an encoder decoder framework for EHR
suppose the advice says that we need to increase the correlation                     data generation and its 2 variants medBGAN and medWGAN [4]
coefficient between feature 1 and feature 2. Then the new correlation                and the results are shown in table 1. Our method, with good advice,
matrix after bad advice can be calculated as:                                        outperforms the baseline both domains showing the effectiveness of
                   1     0.2    0.3       1 𝜆 1                                 our method.
                                                     
              C = 0.2    1     0.07 A = 𝜆 1 1             (8)
                   0.3 0.07      1          1 1 1
                   
                                      
                                                                                  5      CONCLUSION
                           1       0.2       0.3  1     𝜆   1                 We presented a new GAN formulation that employs correlation
                                                                                     information between features as advice to generate new correlated
                           
             Ĉ = C ⊙ A = 0.2      1        0.07 ⊙ 𝜆   1   1      (9)
                           0.3     0.07       1  1      1   1                 data and train the underlying GAN model. We tested our model
                                                                                     on real clinical data sets and show that incorporating advice helps
                           
where 𝜆 is the factor by which the correlation value is to be aug-
                                                                                     generate good quality synthetic medical data. We employ TSTR
mented. Since the advice asks to increase the correlation, we set 𝜆=0.
                                                                                     method to test the quality of generated data and demonstrated that
Thus,
                                                                                     the generated data with advice is more aligned with the real data.
       1      0.2     0.3  1 0 1  1         0.0    0.3                    There are several future interesting directions. First, providing advice
       
  Ĉ = 
        0.2     1     0.07 ⊙
                               0  1   1  =
                                             0.0    1    0.07   (10)
                                                                                    only when required in an active fashion can allow for significant
       0.3 0.07        1    1 1 1 0.3 0.07              1                      reduction in the amount of effort on the human side. Second, there
                                                            
    Similarly, if the advice says that we need to decrease the correla-              can be multiple advice options, such as posterior regularization [15],
tion coefficient between feature 1 and feature 3, we set 𝜆 = 𝑓 𝑒𝑎𝑡1 .                that can be used to capture feature relationships explicitly. Third,
                                                                          𝑣𝑎𝑙        although we do not have identifiers in the data, thereby eliminating
      1       0.2    0.3   1       0.2      1      1      0.2    1.0        the need of differential privacy [11], a general framework that can
                                               0.3     
Ĉ = 0.2      1     0.07 ⊙  0.2    1       1  = 0.2     1     0.07       uphold the privacy of patient data along the lines of using Cholesky
      0.3     0.07    1   0.3 1     1       1  1.0      0.07    1         decomposition [7, 31] is a natural next step.
      
                                                                          (11)
                                                                                     ACKNOWLEDGMENTS
As results show in table 1, giving bad advice adversely affects the
                                                                                     DSD and SN gratefully acknowledge DARPA Minerva award FA9550-
performance thereby answering Q3.
                                                                                     19-1-0391. Any opinions, findings, and conclusion or recommenda-
The nephrotic syndrome and MIMIC data sets are relatively unbal-
                                                                                     tions expressed in this material are those of the authors and do not
anced with a pos to neg ratio of ≈ 8:1 and 1:7 respectively. Most
                                                                                     necessarily reflect the view of the DARPA or the US government.
of the medical data sets, except highly curated data sets, are un-
balanced. A data generator model should be able to handle this
imbalance. Since our method explicitly focuses on the correlations                   REFERENCES
between features and generates better quality data based on such                      [1] Grigory Antipov, Moez Baccouche, and Jean-Luc Dugelay. 2017. Face aging with
                                                                                          conditional generative adversarial networks. In ICIP.
relationships between features, our method is quite robust to the                     [2] Martin Arjovsky and Leon Bottou. 2017. Towards principled methods for training
imbalance in the underlying data. This can be seen in the results                         generative adversarial networks. In ICLR.
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                            Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan


 [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan.           [36] V. K. Mansinghka, C. Kemp, J. B. Tenenbaum, and T. L. Griffiths. 2006. Structured
     ICML (2017).                                                                              Priors for Structure Learning. In UAI.
 [4] Mrinal Kanti Baowaly, Chia-Ching Lin, Chao-Lin Liu, and Kuan-Ta Chen. 2019.          [37] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen
     Synthesizing electronic health records using improved generative adversarial              Paul Smolley. 2017. Least squares generative adversarial networks. In ICCV.
     networks. JAMA (2019).                                                               [38] Tom M Mitchell. 1980. The need for biases in learning generalizations. Depart-
 [5] Darius Braziunas and Craig Boutilier. 2006. Preference elicitation and generalized        ment of Computer Science, Laboratory for Computer Science Research, Rutgers
     additive utility. In AAAI.                                                                Univ. New Jersey.
 [6] Anna L Buczak, Steven Babin, and Linda Moniz. 2010. Data-driven approach             [39] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018.
     for creating synthetic electronic medical records. BMC medical informatics and            Spectral normalization for generative adversarial networks. ICLR (2018).
     decision making (2010).                                                              [40] Klemen Naveršnik and Klemen Rojnik. 2012. Handling input correlations in
 [7] Jim Burridge. 2003. Information preserving statistical obfuscation. Statistics and        pharmacoeconomic models. Value in Health (2012).
     Computing (2003).                                                                    [41] P. Odom, T. Khot, R. Porter, and S. Natarajan. 2015. Knowledge-Based Proba-
 [8] Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F Stewart,                 bilistic Logic Learning. In AAAI.
     and Jimeng Sun. 2017. Generating Multi-label Discrete Patient Records using          [42] Phillip Odom and Sriraam Natarajan. 2015. Active advice seeking for inverse
     Generative Adversarial Networks. In MLHC.                                                 reinforcement learning. In AAAI.
 [9] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine           [43] Phillip Odom and Sriraam Natarajan. 2018. Human-guided learning for proba-
     Learning (1995).                                                                          bilistic logic models. Frontiers in Robotics and AI (2018).
[10] Ivo D Dinov. 2016. Volume and value of big healthcare data. Journal of medical       [44] Michela Paganini, Luke de Oliveira, and Benjamin Nachman. 2018. Calo-
     statistics and informatics (2016).                                                        GAN: Simulating 3D high energy particle showers in multilayer electromagnetic
[11] Cynthia Dwork. 2008. Differential privacy: A survey of results. In TAMS.                  calorimeters with generative adversarial networks. Physical Review D (2018).
[12] Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. 2017. Real-valued          [45] Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised represen-
     (medical) time series generation with recurrent conditional gans. arXiv preprint          tation learning with deep convolutional generative adversarial networks. ICLR
     arXiv:1706.02633 (2017).                                                                  (2016).
[13] Maayan Frid-Adar, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit             [46] Ernest M Scheuer and David S Stoller. 1962. On the generation of normal random
     Greenspan. 2018. Synthetic data augmentation using GAN for improved liver                 vectors. Technometrics (1962).
     lesion classification. In ISBI.                                                      [47] Bernhard Schölkopf, Patrice Simard, Alex J Smola, and Vladimir Vapnik. 1998.
[14] Glenn M Fung, Olvi L Mangasarian, and Jude W Shavlik. 2003. Knowledge-based               Prior knowledge in support vector kernels. In Advances in neural information
     support vector machine classifiers. In NIPS.                                              processing systems. 640–646.
[15] Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. 2010. Posterior regular-    [48] Rittika Shamsuddin, Barbara M Maweu, Ming Li, and Balakrishnan Prabhakaran.
     ization for structured latent variable models. JMLR (2010).                               2018. Virtual patient model: an approach for generating synthetic healthcare time
[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning.                  series data. In ICHI.
[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,        [49] Jude W Shavlik and Geoffrey G Towell. 1989. Combining explanation-based
     Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial           learning and artificial neural networks. In Proceedings of the sixth international
     nets. In NIPS.                                                                            workshop on Machine learning. Elsevier.
[18] Peter Groves, Basel Kayyali, David Knott, and Steve Van Kuiken. 2016. The’big        [50] Geoffrey G Towell and Jude W Shavlik. 1994. Knowledge-based artificial neural
     data’revolution in healthcare: Accelerating value and innovation. (2016).                 networks. Artificial intelligence (1994).
[19] John T Guibas, Tejpal S Virdi, and Peter S Li. 2017. Synthetic medical images        [51] Yan Wang, Biting Yu, Lei Wang, Chen Zu, David S Lalush, Weili Lin, Xi Wu, Jiliu
     from dual generative adversarial networks. arXiv preprint arXiv:1709.01872                Zhou, Dinggang Shen, and Luping Zhou. 2018. 3D conditional generative adver-
     (2017).                                                                                   sarial networks for high-quality PET image estimation at low dose. NeuroImage
[20] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C             (2018).
     Courville. 2017. Improved training of wasserstein gans. In NIPS.                     [52] Zongwei Wang, Xu Tang, Weixin Luo, and Shenghua Gao. 2018. Face aging with
[21] Haroun Habeeb, Ankit Anand, Mausam Mausam, and Parag Singla. 2017. Coarse-                identity-preserved conditional generative adversarial networks. In CVPR.
     to-fine lifted MAP inference in computer vision. In IJCAI.                           [53] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum.
[22] Zhiting Hu, Zichao Yang, Russ R Salakhutdinov, LIANHUI Qin, Xiaodan Liang,                2016. Learning a probabilistic latent space of object shapes via 3d generative-
     Haoye Dong, and Eric P Xing. 2018. Deep Generative Models with Learnable                  adversarial modeling. In NIPS.
     Knowledge Constraints. In NeurIPS.                                                   [54] S. Yang and S. Natarajan. 2013. Knowledge Intensive Learning: Combining
[23] Ronald L Iman and William-Jay Conover. 1982. A distribution-free approach to              Qualitative Constraints with Causal Independence for Parameter Learning in
     inducing rank correlation among input variables. Communications in Statistics-            Probabilistic Models. In ECMLPKDD.
     Simulation and Computation (1982).                                                   [55] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired
[24] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng,              image-to-image translation using cycle-consistent adversarial networks. In ICCV.
     Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi,
     and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database.
     Scientific data (2016).
[25] Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator archi-
     tecture for generative adversarial networks. In CVPR.
[26] Harold William Kuhn and Albert William Tucker. 1953. Contributions to the
     Theory of Games.
[27] Gautam Kunapuli, Phillip Odom, Jude W Shavlik, and Sriraam Natarajan. 2013.
     Guiding autonomous agents to better behaviors through human advice. In ICDM.
[28] Quoc V Le, Alex J Smola, and Thomas Gärtner. 2006. Simpler knowledge-based
     support vector machines. In ICML.
[29] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature
     (2015).
[30] Minjun Li, Haozhi Huang, Lin Ma, Wei Liu, Tong Zhang, and Yugang Jiang.
     2018. Unsupervised image-to-image translation with stacked cycle-consistent
     adversarial networks. In ECCV.
[31] Yaping Li, Minghua Chen, Qiwei Li, and Wei Zhang. 2011. Enabling multilevel
     trust in privacy preserving data mining. TKDE (2011).
[32] Yujia Li, Kevin Swersky, and Rich Zemel. 2015. Generative moment matching
     networks. In ICML.
[33] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image
     translation networks. In NIPS.
[34] Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. In
     NIPS.
[35] Faisal Mahmood, Richard Chen, and Nicholas J Durr. 2018. Unsupervised reverse
     domain adaptation for synthetic medical images via adversarial training. IEEE
     transactions on medical imaging (2018).
            Depressive, Drug Abusive, or Informative:
    Knowledge-aware Study of News Exposure during COVID-19
                           Outbreak
                Amanuel Alambo                                                Manas Gaur                         Krishnaprasad Thirunarayan
                 Knoesis Center                                   AI Institute, University of South                         Knoesis Center
                  Dayton, Ohio                                                 Carolina                                      Dayton, Ohio
               amanuel@knoesis.org                                   Columbia, South Carolina                            tkprasad@knoesis.org
                                                                        mgaur@email.sc.edu

ABSTRACT                                                                                on Knowledge-infused Mining and Learning (KiML’20). , 5 pages. https://doi.
The COVID-19 pandemic is having a serious adverse impact on                             org/10.1145/nnnnnnn.nnnnnnn
the lives of people across the world. COVID-19 has exacerbated
community-wide depression, and has led to increased drug abuse
brought about by isolation of individuals as a result of lockdown.
                                                                                        1    INTRODUCTION
Further, apart from providing informative content to the public,
the incessant media coverage of COVID-19 crisis in terms of news                        COVID-19 pandemic has changed our societal dynamics in different
broadcasts, published articles and sharing of information on social                     ways due to the varying impact of the news articles and broadcasts
media have had the undesired snowballing effect on stress levels                        on a diverse population in the society. Thus, it is important to
(further elevating depression and drug use) due to uncertain future.                    place the news articles in their spatio-temporal-thematic (Nagarajan
In this position paper, we propose a novel framework for assessing                      et al., 2009; Andrienko et al., 2013; Harbelot et al., 2015) contexts to
the spatio-temporal-thematic progression of depression, drug abuse,                     offer appropriate and timely response and intervention. In order
and informativeness of the underlying news content across the                           to limit the scope of this research agenda, we propose to focus
different states in the United States. Our framework employs an                         on identifying regions that are exposed to depressive and drug
attention-based transfer learning technique to apply knowledge                          abusive news articles and to determine/recommend ways for timely
learned on a social media domain to a target domain of media                            interventions by epidemiologists.
exposure. To extract news articles that are related to COVID-19                            The impact of COVID-19 on mental health has been investigated
communications from the streaming news content on the web, we                           in recent studies (Garfin et al., 2020; Holmes et al., 2020; Qiu et al.,
use neural semantic parsing, and background knowledge bases in a                        2020). [4] studied the impact of repeated media exposure on the men-
sequence of steps called semantic filtering. We achieve promising                       tal well-being of individuals and its ripple effects. [8] underscore
preliminary results on three variations of Bidirectional Encoder                        the importance of a multidisciplinary study to better understand
Representations from Transformers (BERT) model. We compare                              COVID-19. Specifically, the study explores its psychological, social,
our findings against a report from Mental Health America and the                        and neuroscientific impacts. [12] studied the psychological impact
results show that our fine-tuned BERT models perform better than                        COVID-19 lockdown had on the Chinese population. These studies,
vanilla BERT. Our study can benefit epidemiologists by offering                         however, do not adequately explore a technique to computationally
actionable insights on COVID-19 and its regional impact. Further,                       analyze the regional repercussions associated with media exposure
our solution can be integrated into end-user applications to tailor                     to COVID-19 that may provide a better basis for local grassroots
news for users based on their emotional tone measured on the scale                      level action.
of depressiveness, drug abusiveness, and informativeness.                                  We propose an approach to measure depressiveness, drug abu-
                                                                                        siveness, and informativeness as a result of media exposure for
                                                                                        various states in the US in the months from January 2020 to March
KEYWORDS
                                                                                        2020. Our study is focused on the first quarter of 2020 as this period
  COVID-19; Spatio-Temporal-Thematic; Depressiveness; Drug                              was critical in the spread of COVID-19 and its ominous impact;
Abuse; Informativeness; Transfer Learning                                               this was a period when the public faced major changes to lifestyle
ACM Reference Format:                                                                   including lockdown, social distancing, closure of businesses, unem-
Amanuel Alambo, Manas Gaur, and Krishnaprasad Thirunarayan. 2020.                       ployment, and broadly speaking, complete lack of control over the
Depressive, Drug Abusive, or Informative: Knowledge-aware Study of News                 unfolding situation precipitating in severe uncertainty about the
Exposure during COVID-19 Outbreak . In Proceedings of KDD Workshop                      impending future. In consequence, this continued media exposure
                                                                                        progressively worsened the mental health of individuals across the
                                                                                        board. We analyze and score news content on three orthogonal
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego,           dimensions: spatial, temporal, and thematic. For spatial, we use
California, USA, August 24, 2020. Use permitted under Creative Commons License          state boundaries. For temporal, we use monthly data analysis. For
Attribution 4.0 International (CC BY 4.0).
                                                                                        thematic, we score news content on the category/dimension of
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s).                                                 depression, drug abuse and informativeness (relevant to COVID-19
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                 but not directly connected to either depression or drug-abuse).
                                                                            and grouped the ones that are from the US based on their state
                                                                            of origination. The state-level grouped news articles had a total
                                                                            of over 150K entities identified using DBpedia spotlight service2 .
                                                                            However, since using a coarse filtering service such as DBpedia
                                                                            spotlight over the entire news articles is not efficient and brings
                                                                            in irrelevant entities, and thus noisy news articles, we utilize (“i”)
                                                                            a neural parsing approach with self-attention (Wu et al., 2019) to
                                                                            extract relevant entities. After extracting relevant entities and news
                                                                            articles, we use (“ii”) DBpedia spotlight service to identify news
                                                                            articles that are related to online communications about COVID-19.




       Figure 1: Spatio-Temporal-Thematic Dimensions
                                                                            Figure 2: Knowledge-based entity extraction using Semantic
                                                                            Filtering
   Our study hinges on the use of domain-specific language model-
ing and transfer learning to better understand how depressiveness,             For this task, we explored 780 DBpedia categories that are rel-
drug abusiveness, and informativeness of news articles evolve in            evant to COVID-19 communications to create the most relevant
response to media exposure by people. We conduct the transfer               set of entities and news articles. Further, upon inspection of the
of knowledge learned on a social media platform to the domain               news articles, we discovered medical terms that were not available
of exposure to news using variations of the attention-based BERT            in DBpedia. As a result, we used (“iii”) the MeSH terms hierarchy
model (Devlin et al., 2018), also called Vanilla BERT. Thus, in addi-       in Unified Medical Language System (UMLS), the Diagnostic and
tion to vanilla BERT, we fine-tune BERT models on corpora that              Statistical Manual for Mental Disorders (DSM-5) lexicon (Gaur et al.,
are representative of depression and drug abuse. Then, we compare           2018), and Drug Abuse Ontology (DAO), collectively referred to
results obtained using the three variants of the BERT model. For            as Mental Health and Drug Abuse Knowledgebase (MHDA-Kb) to
scoring depressiveness, drug abusiveness, and informativeness of            spot additional entities. Thus, from 700K unique news articles
news articles, we utilize entities from structured domain knowledge         (which are extracted from the total of 1.2 Million news articles by
from the Patient Health Questionnaire (PHQ-9) lexicon (Yazdavar             removing duplicates), we created a set of 120K unique entities that
et al., 2017), Drug Abuse Ontology (DAO) (Cameron et al., 2013),            are described by the 780 DBpedia categories and 225 concepts in
and DBpedia (Lehmann et al., 2015). PHQ-9 lexicon is a knowl-               MHDA-Kb. The figures below show two examples that illustrate
edge base developed specifically for assessing depression, and DAO          entities spotted during entity extraction on a sample news article.
is built to study drug abuse. Similarly, we use DBpedia, which is           A news article that has entities identified using this sequence of
a generic and comprehensive knowledge base, for assessing the               steps is selected for our study.
informativeness of news content.
   Having determined the scores for depressiveness, drug abusive-
ness, and informativeness of news articles for each state during
the three months, we computed the aggregate score for each the-
matic category by summing up the scores for the news articles. We
finally assigned the category with the highest score as a label for
a state. For instance, if the aggregate score of depressiveness for
the state of Iowa in the month of January 2020 is the highest of
the three thematic categories, then the state of Iowa is assigned a         Figure 3: Example entity extraction-I using Semantic Filter-
label of depression for that month, which means the state of Iowa           ing
is most exposed to depressive news contents. Thus, identifying
which states are consistently exposed to depressive or drug abusive
news contents enables policy makers and epidemiologists to devise
appropriate intervention strategies.

2    DATA COLLECTION
We collected 1.2 Million news articles from the Web and GDELT1 (a
resource that stores world news on significant events from different
countries) using semantic filtering (Sheth and Kapanipathi, 2016)           Figure 4: Example entity extraction-II using Semantic Filter-
and spanning the period from January 01, 2020, to March 29, 2020.           ing
We filtered news articles that did not originate from within the US
1 https://www.gdeltproject.org/                                             2 https://www.dbpedia-spotlight.org/

                                                                        2
3   METHODS                                                                    scores of news articles as described. The category with the highest
We propose to use three variations of the BERT model for represent-            cumulative score is set as the label for a state.
ing news articles. In its basic form, we use vanilla BERT for encoding            Using vanilla-BERT (Figure 5), we can see that no state shows
news articles. For the remaining two variations, we fine-tune BERT             exposure to news content on drug abuse in January. Going from
on a binary sequence classification task by independently training             February to March, we see depressive news content move from
on two corpora using masked language modeling (MLM) and next                   inner-most states such as Missouri, Kansas, and Colorado to border
sentence prediction (NSP) objectives. The two corpora used are: 1)             states such as California, Montana, North Dakota, and Louisiana,
Subreddit Depression (Gkotsis et al., 2017; Gaur et al., 2018); 2) A           making way for informative news content. Further, there are fewer
combination of subreddits: Crippling Alcoholism, Opiates, Opiates              states exposed to drug-related news content than those exposed
Recovery, and Addiction (abbreviated COOA), each consisting of                 to depressive or informative news content in February or March.
Reddit posts about drug abuse. Subreddit Depression has 760049                 Particularly, Arizona and Virginia show consistent exposure to
posts across 121795 Redditors, and COOA has 1416765 posts from                 drug-related news content in February and March.
46183 users, both consisting of posts from the years 2005 - 2016.                 Using depression-BERT, as shown in Figure 6, we see that states
Reddit posts belonging to subreddits depression or COOA are con-               such as Texas, and Kansas are exposed to depressive news content
sidered positive classes and the 380444 posts from control group               for the month of January and February while states such as Cali-
(∼10K subreddits unrelated to mental health) as negative classes.              fornia, Montana, Alaska, and Michigan show higher consumption
We use the following settings for training our BERT model for se-              of depressive news content in February and March. With regard to
quence classification: training batch size of 16, maximum sequence             informativeness, we see an overall even distribution of informative
length of 256, Adam optimizer with learning rate of 2e-5, number of            news content across the nation in February and March. Further,
training epochs set to 10, and a warmup proportion of 0.1. We used             we see a few midwest states showing relatively higher instances of
40%-60% split for training and testing sets for creating the BERT              news content that are informative than depressive in February and
models and achieved a test accuracy of 89% for Depression-BERT                 March. It’s interesting to see a few southern states such as Okla-
and 78% for Drug Abuse-BERT. We set the size of the training set               homa, Texas, and Arkansas transition from exposure to depressive
smaller than the testing set for generalizability of our models. In            news content in the month of February to drug use related news
this manuscript, we refer to the BERT model fine tuned on subreddit            content in the month of March.
depression as Depression-BERT or DPR-BERT, while the one fine                     Using Drug Abuse-BERT model (Figure 7), states such as Texas,
tuned on subreddit COOA as Drug Abuse-BERT or DA-BERT.                         and Wisconsin shift from exposure of depressive news content in
   In addition to using BERT for encoding news contents, we also               January to exposure of drug-related news content in February, while
use it for representing the entities in the background knowledge               states such as California, and Oklahoma transition from exposure to
bases (i.e., PHQ-9, DAO, and DBpedia). Once we have encoded the                depressive news content in February to drug-related news content
news articles and the entities in the knowledge bases using vanilla            in March. Further, we see the informativeness of news content
BERT or fine-tuned BERT model, we generated depressiveness                     sweeping from the east to the midwest, to parts of the south, and
score, drug abusiveness score, and informativeness score corre-                to some parts of the west from February to March.
sponding to the entities in PHQ-9, DAO, and DBpedia respectively.                 Our results show that a fine-tuned BERT model cleanly separates
The equation below gives the score of a news article for a category            the thematic categorical scores to a state. For instance, using DA-
given one of the BERT models:                                                  BERT for the month of March, the drug abuse score for the state
                                                                               of California is much higher than the score of depressiveness or
                                                                               informativeness for the same state. However, with the vanilla BERT
                                     |E𝐾𝐵 |
                                1     Õ                                        model, the three scores computed for the various states and months
          𝑆𝑐𝑜𝑟𝑒𝑐𝑚 (𝑛𝑒𝑤𝑠) =                    𝑐𝑜𝑠𝑠𝑖𝑚 (news, 𝑒)       (1)       are marginally different. Moreover, the results using DPR-BERT or
                              |E𝐾𝐵 | 𝑒=1
                                                                               DA-BERT capture the state-level ranking of mental disorders by
                                                                               Mental Health America 3 better than vanilla-BERT; for a few states,
where,
                                                                               the fine-tuned BERT models identify more months to have media
  m ∈ {vanilla-BERT, DPR-BERT, DA-BERT}
                                                                               exposure to depression or drug abuse news content.
  c ∈ {informativeness, depressiveness, drug abuse}
  cossim (news, e): cosine similarity between a news content and                  As indicated in Table 1, we report months showing predominant
an entity in KB                                                                media exposure to either depressive or drug abuse news articles
  KB - a collection of entities present in PHQ-9, DBpedia, or DAO              using the three variants of BERT model. We use 10 of the 13 states
                                                                               recognized as showing high prevalence of mental disorders accord-
We used the base variant of the BERT model with 12 layers, 768                 ing to a report by Mental Health America on overall mental disorder
hidden units, and 12 attention heads. We use PyTorch 1.5.0+cu101               ranking. The 3 states not included in this table are Washington,
for fine-tuning our BERT models. All our programs were run on                  Wyoming, and Idaho. We did not consider these 3 states as these
Google Colab’s NVIDIA Tesla P100 PCI-E GPU.                                    states were not in our dataset cohort. For the Mental Health Amer-
                                                                               ica (MHA) report, we make a practical assumption that each of the
4   PRELIMINARY RESULTS AND DISCUSSION                                         three months is either depressive or drug abusive for each state.
                                                                               Thus, our objective is to maximize the number of months with
In this section, we report the state-wise labels (i.e., depressive, drug
abusive, informative) for each month obtained after summing the                3 https://www.mhanational.org/issues/ranking-states

                                                                           3
               Figure 5: vanilla BERT modeling of Depressiveness, Drug Abuse, and Informativeness in US states.




     Figure 6: Depression-BERT (DPR-BERT) modeling of Depressiveness, Drug Abuse, and Informativeness in US states




     Figure 7: Drug Abuse BERT (DA-BERT) modeling of Depressiveness, Drug Abuse, and Informativeness in US states


exposure to depressive/drug abuse news content for each of the              where,
10 states. We can see in Table 1 that fine-tuned BERT models help              𝑚 1, 𝑚 2 ∈ {vanilla-BERT, DPR-BERT, DA-BERT, MHA}
identify more months to having exposure to depressive or drug                  𝑆 - Set of States in the US (Table 1)
abuse news content than vanilla BERT does for the 10 states. For ex-           𝑚𝑀      𝑀
                                                                                 1 , 𝑚 2 : Number of depressive, drug abusive, or informative
ample, using DA-BERT, five states are identified to have at least two       months for a state “i”
months showing exposure to depressive/drug abuse news content                  We report inter-model and model-to-MHA Jaccard similarity
while DPR-BERT identifies six states to having been exposed to              scores computed using equation (2) in Figure 8.
depressive/drug abuse news content for two months. On the other                As shown in Figure 8, DA-BERT gives the best results against
hand, vanilla-BERT identifies only two states with depressive/drug          MHA report in Jaccard similarity (0.53), which means DA-BERT
abuse news content for two months. To compare models with one               identifies over half of the state-to-month instances in MHA. On the
another and against the report by Mental Health America (MHA),              other hand, vanilla-BERT has a Jaccard similarity of 0.37 with MHA,
we compute a Jaccard Index between each pair of models and each             which can be interpreted as vanilla-BERT identifies a little over
model against the report from MHA. The equation below computes              one-third of the state-to-month instances in MHA. The best Jaccard
Jaccard similarity between the results of two models or a model’s           similarity is achieved between DPR-BERT and vanilla-BERT (0.7);
results with an MHA report.                                                 thus, 70% of state-to-month mappings are shared between DPR-
                                                                            BERT and vanilla-BERT based on Jaccard index. It’s interesting to
                                     |𝑆 |                                   see DA-BERT has the same Jaccard similarity with vanilla-BERT
                                     Õ    𝑚𝑀 ∩ 𝑚𝑀
                                          1     2
                   𝐽 (𝑚 1, 𝑚 2 ) =          𝑀    𝑀
                                                                 (2)
                                     𝑖 ∈ 𝑆 𝑚1 ∪ 𝑚2
                                                                        4
 MHA States        vanilla-        DA-BERT         DPR-BERT             from Mental Health America. In the future, we plan to incorporate
 with    high      BERT            (Months         (Months              background knowledge bases in our attention-based transfer learn-
 DPR and DA        (Months         with depres-    with                 ing framework to further investigate knowledge-infused learning
                   with depres-    sion/drug       depres-              (Kursuncu et al., 2019).
                   sion/drug       abuse)          sion/drug
                   abuse)                          abuse)               REFERENCES
                                                                         [1] Gennady Andrienko, Natalia Andrienko, Harald Bosch, Thomas Ertl, Georg Fuchs,
 Tennessee         Feb, Mar        Feb, Mar         Feb, Mar                 Piotr Jankowski, and Dennis Thom. 2013. Thematic patterns in georeferenced
 Alabama           Feb             Feb, Mar         Feb                      tweets through space-time visual analytics. Computing in Science & Engineering
                                                                             15, 3 (2013), 72–82.
 Oklahoma          Mar             Feb, Mar         Feb, Mar             [2] Delroy Cameron, Gary A Smith, Raminta Daniulaityte, Amit P Sheth, Drashti
 Kansas            Feb             Jan, Feb         Jan, Feb                 Dave, Lu Chen, Gaurish Anand, Robert Carlson, Kera Z Watkins, and Russel
                                                                             Falck. 2013. PREDOSE: a semantic web platform for drug abuse epidemiology
 Montana           Mar             Feb              Feb, Mar                 using social media. Journal of biomedical informatics 46, 6 (2013), 985–997.
 South Carolina    Mar             Mar              Feb, Mar             [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
 Alaska            Feb, Mar        Jan, Feb, Mar    Feb, Mar                 Pre-training of deep bidirectional transformers for language understanding. arXiv
                                                                             preprint arXiv:1810.04805 (2018).
 Utah              Mar             Mar              Mar                  [4] Dana Rose Garfin, Roxane Cohen Silver, and E Alison Holman. 2020. The novel
 Oregon            None            Feb              None                     coronavirus (COVID-2019) outbreak: Amplification of public health consequences
 Nevada            Feb             Feb              None                     by media exposure. Health psychology (2020).
                                                                         [5] Manas Gaur, Ugur Kursuncu, Amanuel Alambo, Amit Sheth, Raminta Daniu-
Table 1: Evaluation of base and domain-specific BERT mod-                    laityte, Krishnaprasad Thirunarayan, and Jyotishman Pathak. 2018. " Let Me Tell
                                                                             You About Your Mental Health!" Contextualized Classification of Reddit Posts to
els for MHA states over the period of three months (January,                 DSM-5 for Web-based Intervention. In Proceedings of the 27th ACM International
February, and March). These three months showed high dy-                     Conference on Information and Knowledge Management. 753–762.
                                                                         [6] George Gkotsis, Anika Oellrich, Sumithra Velupillai, Maria Liakata, Tim JP Hub-
namicity in COVID-19 spread.                                                 bard, Richard JB Dobson, and Rina Dutta. 2017. Characterisation of mental health
                                                                             conditions in social media using Informed Deep Learning. Scientific reports 7
                                                                             (2017), 45141.
                                                                         [7] Benjamin Harbelot, Helbert Arenas, and Christophe Cruz. 2015. LC3: A spatio-
                                                                             temporal and semantic model for knowledge discovery from geospatial datasets.
                                                                             Journal of Web Semantics 35 (2015), 3–24.
                                                                         [8] Emily A Holmes, Rory C O’Connor, V Hugh Perry, Irene Tracey, Simon Wes-
                                                                             sely, Louise Arseneault, Clive Ballard, Helen Christensen, Roxane Cohen Silver,
                                                                             Ian Everall, et al. 2020. Multidisciplinary research priorities for the COVID-19
                                                                             pandemic: a call for action for mental health science. The Lancet Psychiatry
                                                                             (2020).
                                                                         [9] Ugur Kursuncu, Manas Gaur, and Amit Sheth. 2019. Knowledge Infused Learning
                                                                             (K-IL): Towards Deep Incorporation of Knowledge in Deep Learning. arXiv
                                                                             preprint arXiv:1912.00512 (2019).
                                                                        [10] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,
                                                                             Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören
                                                                             Auer, et al. 2015. DBpedia–a large-scale, multilingual knowledge base extracted
                                                                             from Wikipedia. Semantic Web 6, 2 (2015), 167–195.
                                                                        [11] Meenakshi Nagarajan, Karthik Gomadam, Amit P Sheth, Ajith Ranabahu,
                                                                             Raghava Mutharaju, and Ashutosh Jadhav. 2009. Spatio-temporal-thematic analy-
                                                                             sis of citizen sensor data: Challenges and experiences. In International Conference
Figure 8: Inter-BERT model and BERT Model-to-MHA Jac-                        on Web Information Systems Engineering. Springer, 539–553.
card Similarity Scores as a measure of closeness of model’s             [12] Jianyin Qiu, Bin Shen, Min Zhao, Zhen Wang, Bin Xie, and Yifeng Xu. 2020. A
                                                                             nationwide survey of psychological distress among Chinese people in the COVID-
prediction to an extensive survey on Mental Health America                   19 epidemic: implications and policy recommendations. General psychiatry 33, 2
(MHA).                                                                       (2020).
                                                                        [13] Amit Sheth and Pavan Kapanipathi. 2016. Semantic filtering for social data. IEEE
                                                                             Internet Computing 20, 4 (2016), 74–78.
                                                                        [14] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, and
and DPR-BERT, subsuming the former and being subsumed by the                 Xing Xie. 2019. Npa: Neural news recommendation with personalized attention.
                                                                             In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
latter in terms of depressive/drug abusive months.                           Discovery & Data Mining. 2576–2584.
                                                                        [15] Amir Hossein Yazdavar, Hussein S Al-Olimat, Monireh Ebrahimi, Goonmeet
5   CONCLUSION                                                               Bajaj, Tanvi Banerjee, Krishnaprasad Thirunarayan, Jyotishman Pathak, and
                                                                             Amit Sheth. 2017. Semi-supervised approach to monitoring clinical depressive
In this paper, we model depressiveness, drug abusiveness, and in-            symptoms in social media. In Proceedings of the 2017 IEEE/ACM International
                                                                             Conference on Advances in Social Networks Analysis and Mining 2017. 1191–1198.
formativeness of news articles to assess the dominant category
characterizing each US state during each of the three months (Jan
2020 to Mar 2020). We demonstrate the power of transfer learning
by fine-tuning an attention-based deep learning model on a dif-
ferent domain and use the domain-tuned model for gleaning the
nature of media exposure. Specifically, we use background knowl-
edge bases for measuring depressiveness, drug abusiveness, and
informativeness of news articles. We found out DA-BERT identifies
the most number of state-to-month instances as being exposed
to depressive or drug abuse news content according to the report
                                                                    5
                                               Cost Aware Feature Elicitation
                      Srijita Das                                             Rishabh Iyer                             Sriraam Natarajan
       The University of Texas at Dallas                         The University of Texas at Dallas               The University of Texas at Dallas
           Srijita.Das@utdallas.edu                                Rishabh.Iyer@utdallas.edu                     Sriraam.Natarajan@utdallas.edu

ABSTRACT                                                                                tests for reasonably accurate prediction. We build on the intuition
Motivated by clinical tasks where acquiring certain features such                       that given certain observed features like one’s demographic details,
as FMRI or blood tests can be expensive, we address the problem of                      the most important features for a patient depends on the important
test-time elicitation of features. We formulate the problem of cost-                    features for similar patients. Based on this intuition, we find out
aware feature elicitation as an optimization problem with trade-off                     similar data points in the observed feature space and identify the
between performance and feature acquisition cost. Our experiments                       important feature subsets of these similar instances by employing
on three real-world medical tasks demonstrate the efficacy and                          a greedy information theoretic feature selector objective.
effectiveness of our proposed approach in minimizing costs and                             Our contributions in this work are as follows: (1) formalize the
maximizing performance.                                                                 problem as a joint optimization problem of selecting the best feature
                                                                                        subset for similar data points and optimizing the loss function using
CCS CONCEPTS                                                                            the important feature subsets. (2) account for acquisition cost in
                                                                                        both the feature selector objective and classifier objective to balance
   • Supervised learning → Budgeted learning; Feature selec-
                                                                                        the trade-off between acquisition cost and model performance. (3)
tion; • Applications → Healthcare.
                                                                                        empirically demonstrate the effectiveness of the proposed approach
KEYWORDS                                                                                on three real-world medical data sets.
    cost sensitive learning, supervised learning, classification
ACM Reference Format:                                                                   2   RELATED WORK
Srijita Das, Rishabh Iyer, and Sriraam Natarajan. 2020. Cost Aware Feature
                                                                                        The related work on cost-sensitive feature selection and learning
Elicitation. In Proceedings of KDD Workshop on Knowledge-infused Min-
ing and Learning (KiML’20). , 6 pages. https://doi.org/10.1145/nnnnnnn.                 can be categorized into the following four broad approaches.
nnnnnnn                                                                                 Tree based budgeted learning: Prediction time elicitation of fea-
                                                                                        tures under a cost budget has been widely studied in literature. A
                                                                                        lot of work has been done in tree based models [5, 16, 17, 26–28]
1    INTRODUCTION
                                                                                        by adding cost term to the tree objective function in either deci-
In supervised classification setting, every instance has a fixed fea-                   sion trees or ensemble methods like gradient boosted trees. All
ture vector and a discriminative function is learnt on such fixed-                      these methods aim to build an adaptive and complex decision tree
length feature vector and it’s corresponding class variable. However,                   boundary by considering trade-off between performance and test-
a lot of practical problems like healthcare, network domains, de-                       time feature acquisition cost. While we are similar in motivation to
signing survey questionnaire [19, 20] etc has an associated feature                     these approaches, our methodology is different in the sense that
acquisition cost. In such domains, there is a cost budget and get-                      we do not consider tree based models. Instead our approach aims
ting all the features of an instance can be very costly. As a result,                   to find local feature subsets using an information theoretic feature
many cost sensitive classifier models [2, 8, 24] have been proposed                     selector for different clusters of training instance build in a lower
in literature to incorporate the cost of acquisition into the model                     dimensional space.
objective during training and prediction.                                               Adaptive classification and dynamic feature discovery: Our
   Our problem is motivated by such a cost-aware setting where the                      work also draws inspiration from Nan al.’s work [15] where they
assumption is that prediction time features have an acquisition cost                    learn a high performance costly model and approximate the model’s
and adheres to a strict budget. Consider a patient visiting a doctor                    performance adaptively by building a low cost model and gating
for some potential diagnosis of a disease. For such a patient, infor-                   function which decides which model to use for specific training in-
mation like age, gender, ethnicity and other demographic features                       stances. This adaptive switching between low and high cost model
are easily available at zero cost. However, various lab tests that the                  takes care of the trade-off between cost and performance. Our
patient needs to undergo incurs cost. So, a training model should be                    method is different from theirs because we do not maintain a high
able to identify the most relevant (i.e. those which are most infor-                    cost model which is costly to build and and difficult to decide. We
mative, yet least costly) lab tests that are required for each specific                 refine the parameters of a single low cost model by incorporating a
patient. The intuition of this work is that different patients, depend-                 cost penalty in the feature selector and model objective. Our work
ing on their history, ethnicity, age and gender, may require different                  is also along the direction of Nan et al.’s work [18] where they select
                                                                                        varying feature subsets for test instance using neighbourhood in-
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego,           formation of the training data. While calculating the neighborhood
California, USA, August 24, 2020. Use permitted under Creative Commons License          information from training data is similar to building clusters in
Attribution 4.0 International (CC BY 4.0).
                                                                                        our approach, the training neighborhood for our method is on just
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s).                                                 the observed feature space. Moreover, we incorporate the neigh-
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                 bourhood information in the training algorithm whereas Nan et
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                             Srijita Das, Rishabh Iyer, and Sriraam Natarajan


al.’s work is a prediction time algorithm. Ma et al. [10] also address       observed features to find similar instances in the training set and
this problem of dynamic discovery of features based on generative            identify the important feature subsets for each of these clusters
modelling and Bayesian experimental design.                                  based on a feature selector objective function which balances the
Feature elicitation using Reinforcement learning: There is                   trade-off between choosing the important features and the cost at
another line of work along the sequential decision making liter-             which these features are acquired.
ature [4, 9, 22] to model the test time elicitation of features by
learning the optimal policy of test feature acquisition. Along this
direction, our work aligns with the work of Shim et al. [25] where           3.2     Proposed solution
they jointly train a classifier and RL agent together. Their classifier      As a first step, we cluster the training instances based on just the
objective function is similar to our method with a cost penalty,             observed zero cost feature set O. The intuition is that instances
however they use a Deep RL agent to figure out the policy. We on             with similar features will also have similar characteristics in terms
the other hand use localised feature selector to find the important          of which elicitable features to order. For example, in a medical appli-
feature subsets for the underlying training clusters in the observed         cation, whether to request for a blood test or a ct-scan will depend
feature space.                                                               on factors such as age, gender, ethnicity and whether patients with
Active Feature Acquisition: Our problem set-up is also inspired              similar demographic features had requested these tests. Also, since
by work along active feature acquisition [13, 14, 19, 23, 29] where          the feature set O, comes at zero cost, we assume that for unseen
certain feature subsets are observed and rest are acquired at a cost.        test instances, this feature set is observed.
While all the above mentioned work follow this problem set up
during training time and typically use active learning to seek infor-
mative instances at every iteration, we use this particular setting
for test instances. Unlike their work, all the training instances in
our work are fully observed and the assumption is that the feature
acquisition cost has already being paid during training. Also, we
address a supervised classification problem instead of an active
learning set up. Our problem set up is similar to Kanani et al. [6] as
they also have partial test instances, however their problem is that
of instance acquisition where the acquired feature subset is fixed.
                                                                             Figure 1: Optimization framework for the proposed problem
Our method aims at discovering variable length feature subsets for
various underlying clusters.
Our contributions: Although the problem of prediction time fea-
ture elicitation has been explored in literature from various direc-
tions and with various assumptions, we come up with an intu-
itive solution to this problem and formulate the problem in a two                 We propose a model which consists of a parameterized feature
step optimization framework. We incorporate acquisition cost                 selector module 𝐹 (𝑋, E𝑐𝑖 , 𝛼) which takes in a set of input instances
in both the feature selector and model objectives to balance the             𝐸𝑐𝑖 belonging to the cluster 𝑐𝑖 based on the feature set O and pro-
performance and cost trade-off. The problem set up is naturally              duces a subset 𝑋 of most important features for the classification
applicable in real world health care and other domains where the             task. The feature selection model is based on an information- theo-
knowledge of the observed features also needs to be accounted                retic objective function and is augmented with the feature cost to
while selecting the elicitable features .                                    account for the trade off between model performance and acquisi-
                                                                             tion cost at test-time. The output feature subset from the feature
3 COST AWARE FEATURE ELICITATION                                             selector module are used to update the parameters of the classifier.
                                                                             The optimization framework is shown in Figure 1
3.1 Problem setup                                                                 Information theoretic Feature selector model: The feature
Given: A dataset {(𝑥 1, 𝑦1 ), · · · , (𝑥𝑛 , 𝑦𝑛 )} with each 𝑥𝑖 ∈ R𝑑 as the   selector module selects the best subset of features for each cluster
feature set. Each feature has an associated cost 𝑟𝑖 .                        of training data based on an information theoretic objective score.
Objective: Learn a discriminative model which is aware of the fea-           Since at test time, we do not know the elicitable feature subset E
ture costs and can balance the trade-off between feature acquisition         (since the goal of feature selection is in the first place to find the
cost and model performance.                                                  truly necessary features for learning). Hence we propose to use the
We make an additional assumption here that there is a subset of fea-         closest set of instances in the training data to the current instance.
tures which have 0 cost. These could be, for example, demographic            Since we assume that the training data has already been elicited,
information (e.g. age, gender, etc) in a medical domain which are            we have all the features observed in the training data. We compute
easily available/less cumbersome to obtain as compared to other              this distance just based on the observed feature set O. We cluster
features. In other words, we can partition the feature set F = O ∪ E         the training data based on the observed features into m clusters
where O are the zero cost observed features and E are the elicitable         𝑐 1, 𝑐 2, · · · 𝑐𝑚 . Next, we use the Minimum-Redundancy-Maximum
features which can be acquired at a cost. We also assume that the            Relevance (MRMR) feature Selection paradigm [1, 21]. We denote
training data is completely available with all features (i.e. the cost       parameters [𝛼𝑐1𝑖 , 𝛼𝑐2𝑖 , 𝛼𝑐3𝑖 , 𝛼𝑐4𝑖 ] as parameters of a particular cluster
for all the features has already been paid). The goal is to use these        𝑐𝑖 . The feature selection module is a function of the parameters of
                                                                                                                   KiML’20, August 24, 2020, San Diego, California, USA,
Cost Aware Feature Elicitation


the cluster to which a set of instances belong and is defined as:                       where 𝜆1 and 𝜆2 are hyper-parameters. In the above equation, 𝜃
                                Õ                                                       is the parameter of the model and can be updated by standard
       𝐹 (𝑋, E𝑐𝑖 , 𝛼𝑐𝑖 ) = 𝛼𝑐1𝑖   𝐼 (E𝑝 ; 𝑌 )                                           gradient based techniques. This loss function takes into account the
                                      E𝑝 ∈𝑋                                             important feature subset for each cluster and updates the parameter
                                                                                        accordingly. The classifier objective also consists of a cost term
                                 |         {z          }
                                     max. relevance
                                                                                        denoted by 𝑐 (𝑋𝛼𝑖 ) to account for the cost of the selected feature
                                                                                        subset. For hard budget on the elicited features, the cost component
             Õ ©         Õ                           Õ
        −         ­𝛼𝑐2𝑖        𝐼 (E 𝑗 ; E𝑝 ) − 𝛼𝑐3𝑖        𝐼 (E𝑝 ; E 𝑗 |𝑌 ) ®
                                                                            ª
            E𝑝 ∈𝑋 «     E 𝑗 ∈𝑋                      E 𝑗 ∈𝑋                      (1)     in the model objective can be considered. In case of a cost budget,
                                                                                        this component can be ignored because the elicited feature subset
                                                                            ¬
            |                            {z                                }
                  Õ                    min. redundancy                                  adheres to a fixed cost and hence, this term is constant.
        − 𝛼𝑐4𝑖             𝑐 (E𝑝 )
                 E𝑝 ∈𝑋                                                                   3.3    Algorithm
            |         {z        }                                                       We present the algorithm for Cost Aware Feature Elicitation
                cost penalty                                                            (CAFE) in Algorithm 1. CAFE takes as input set of training examples
where 𝐼 (E𝑝 ; 𝑌 ) is the mutual information between the random vari-                    E, the zero cost feature set O, the elicitable feature subset E, a cost
able E𝑝 (feature) and 𝑌 (target). In the above equation, the feature                    vector 𝑀 ∈ R𝑑 and a budget 𝐵. Each element in the training set E
subset 𝑋 is grown greedily using a greedy optimization strategy                         consists of a tuple (𝑥, 𝑦) where 𝑥 ∈ R𝑑 is the feature vector and y
maximizing the above objective function. In equation 1, E𝑝 denotes                      is the label.
a single feature from the elicitable set E that is considered for eval-                      The training instances E are clustered based on just the observed
uation based on the subset 𝑋 grown so far. The first term is the                        feature set O using K-means clustering (Cluster). For every cluster
mutual information between each feature and the class variable 𝑌 .                      𝑐𝑖 , the training instances belonging to the cluster is assigned to
In a discriminative task, this value should be maximized. The sec-                      the set E𝑐𝑖 and is passed to the Feature Selector module (lines 6-8).
ond term is the pairwise mutual information between each feature                        The FeatureSelector function takes E𝑐𝑖 , parameter 𝛼, the feature
to be evaluated and the features already added to the feature subset                    subsets O and E, cost vector 𝑀 and a predefined budget 𝐵 as input
𝑋 . This value needs to be minimized for selecting informative fea-                     and returns the most important feature subset X𝑐𝛼𝑖 corresponding
tures. The third term is called the conditional redundancy [1] and                      to a cluster 𝑐𝑖 . A greedy optimization technique is used to grow
this term needs to be maximized. The last term adds the penalty                         the feature subset 𝑋 of every cluster based on the feature selector
for cost of every feature and ensures the right trade-off between                       objective function defined in Equation 1. The FeatureSelector
cost, relevance and redundancy. For this work, we do not learn the                      terminates once the budget 𝐵 is exhausted or the mutual informa-
parameters 𝛼𝑐𝑖 for each cluster, instead fix these parameters to 1.                     tion score becomes negative. Once all the important feature subsets
We leave the learning of these parameters to future work.                               are obtained for all the |𝐶 | clusters, the model objective function is
    In the problem setup, since the 0 cost feature subset is always                     optimized as mentioned in Equation 3 for all the training instances
present, we always consider the observed feature subset O in ad-                        using the important feature subsets for the clusters to which the
dition to the most important feature subset as returned by the                          training instances belong (lines 12-18). All the remaining features
Feature selector objective. We also account for the knowledge of                        are imputed by using either 0 or any other imputation model be-
the observed features while growing the informative feature subset                      fore training the model. The final training model G(E O∪𝑋𝛼 , 𝛼, 𝜃 )
through greedy optimization. Specifically, while calculating the                        is an unified model used to make predictions for a test-instance
pairwise mutual information between the features and the condi-                         consisting of just the observed feature subset O.
tional redundancy term (second and third term of equation 1), we
also evaluate the mutual information of the features with these                          4     EMPIRICAL EVALUATION
observed features. It is to be noted that in cases where the observed                   We did experiments with 3 real world medical data sets. The in-
features are not discriminative enough of the target, the feature se-                   tuition of CAFE makes more sense in medical domains, hence our
lector module ensures that the elicitable features with maximum                         choice of data sets. However, the idea can be applied to other do-
relevance to the target variable are picked.                                            mains ranging from logistics to resource allocation task. Table 2
    Optimization Problem: The cost aware feature selector                               jots down the various features of the data sets used in our experi-
𝐹 (𝑋, E𝑐𝑖 , 𝛼) for a given set of instance E𝑐𝑖 belonging to a specific                  ments. Below are the details of the 3 real data sets, we use for our
cluster 𝑐𝑖 solves the following optimization problem:                                   experiments.

                           𝑋𝛼𝑖 = argmax𝑋 ⊆ E 𝐹 (𝑋, E𝑐𝑖 , 𝛼)                     (2)   1. Parkinson’s disease prediction: The Parkinson’s Progression
                                                                                         Marker Initiative (PPMI) [12] is an observational study where the
   For a given instance (𝑥, 𝑦), we denote 𝐿(𝑥, 𝑦, 𝑋, 𝜃 ) as the loss                     aim is to identify Parkinson’s disease progression from various
function using a subset 𝑋 of the features as obtained from the                           types of features. The PPMI data set consists of various features
Feature selector optimization problem. The optimization problem                          related to various motor functions and non-motor behavioral and
for learning the parameters of a classifier can be posed as:                             psychological tests. We consider certain motor assessment features
                       𝑛                                                                 like rising from chair, gait, freezing of gait, posture and postural sta-
                                                                                         bility as observed features and rest all features as elicitable features
                       Õ
                 min         𝐿(𝑥𝑖 , 𝑦𝑖 , 𝑋𝛼𝑖 , 𝜃 ) + 𝜆1𝑐 (𝑋𝛼𝑖 ) + 𝜆2 ||𝜃 || 2   (3)
                  𝜃
                       𝑖=1                                                               which must be acquired at a cost.
  KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                              Srijita Das, Rishabh Iyer, and Sriraam Natarajan


  Algorithm 1 Cost Aware Feature Elicitation
    1: function CAFE(E, O, E, 𝑀, 𝐵)
    2:     E = E O∪E         ⊲ E consists of 0 cost features O and costly
       features E
    3:     𝐶 = Cluster(E O )         ⊲ Clustering based on the observed
       features O
    4:     X = {∅}          ⊲ Stores best feature subsets of each cluster
    5:     for 𝑖 = 1 to |𝐶 | do                     ⊲ Repeat for every cluster
    6:         E𝑐𝑖 = GetClusterMember(E, 𝐶, 𝑖)
    7:               ⊲ get the data points belonging to each cluster 𝑐𝑖
    8:         X𝑐𝛼𝑖 = FeatureSelector(E𝑐𝑖 , 𝛼, O, E, 𝑀, 𝐵)
    9:                ⊲ Parameterized feature selector for each cluster
   10:         X = X ∪ {X𝑐𝛼𝑖 ∪ O}
   11:     end for
   12:     for 𝑖 = 1 to |𝐶 | do                     ⊲ Repeat for every cluster
   13:         X𝑐𝛼𝑖 = GetFeatureSubset(X, 𝑖)                                     Figure 2: Recall Vs number of clusters for Rare disease for
   14:                       ⊲ Get the feature subset for each cluster 𝑐𝑖        CAFE-I
   15:         for 𝑗 = 1 to |E𝑐𝑖 | do ⊲ Repeat for every data point in
       cluster 𝑐𝑖
   16:             Optimize 𝐽 (𝑥 𝑗 , 𝑦 𝑗 , X𝑐𝛼𝑖 , 𝜃, 𝑀)                          and built upon it. We consider two variants of CAFE:(1) CAFE in
   17:                 ⊲ Optimize the objective function in Equation 3           which we replace the missing and unimportant features of every
   18:             Update 𝜃              ⊲ Update the model parameter 𝜃          cluster with 0 and then update the classifier parameters (2) CAFE-I
   19:         end for                                                           where we replace the missing and unimportant features by using an
   20:     end for                                                               imputation model learnt from the already acquired feature values
            return G(E O∪𝑋𝛼 , 𝛼, 𝜃 ) ⊲ G is the training model built on E        of other data points. A simple imputation model is used where we
   21: end function                                                              replace the missing features with mode for categorical features and
                                                                                 mean for numeric features.
                                                                                    Baselines: We consider 3 baselines for evaluating CAFE and
2. Alzheimer’s disease prediction: The Alzheimer’s Disease Neu-                  CAFE-I: (1) using the observed and zero cost features to update
   roIntiative (ADNI1 ) is a study that aims to test whether various             the training model denoted as OBS (2) using a random subset of
   clinical, FMRI and biomarkers can be used to predict the early onset          fixed number of elicitable features and all the observed features
   of Alzheimer’s disease. In this data set, we consider the demograph-          to update the training model denoted as RANDOM. For this baseline,
   ics of the patients as observed and zero cost features and the FMRI           the results are averaged over 10 runs. (3) using the information
   image data and cognitive score data as unobserved and elicitable              theoretic feature selector score as defined in Equation 1 to select
   features.                                                                     the ’k’ best elicitable features on the entire data without any cluster
3. Rare disease prediction This data set is created from survey                  consideration along with the observed features denoted as KBEST.
   questionnaires [11] and the task here is to predict whether a person          We keep the value of ’k’ to be the same as that used by CAFE.
   has rare disease or not. The demographic features are observed                Although some of the existing methods could be potential baselines,
   while other sensitive questions in the survey regarding technology            none of these methods match the exact setting of our problem, hence
   use, health and disease related meta information is considered to             we do not compare our method against them.
   be elicitable.                                                                   Results: We aim to answer the following questions:
      Evaluation Methodology: All the data sets were partitioned                   Q1: How does CAFE and CAFE-I with hard budget on features
   into a 80:20 train-test split. Hyper parameters like the number of                  compare against the standard baselines?
   clusters on the observed features were picked by doing 5 fold cross             Q2: How does the cost-sensitive version of CAFE and CAFE-I
   validation on all the data sets. The optimal number of clusters                     fare against the cost-sensitive baseline KBEST?
   picked were 6 for ADNI, 9 for Rare disease data set and 7 for the
   PPMI data set. For the results reported in Table 1, we considered a              The results reported in Table 1 suggests both CAFE and CAFE-
   hard budget on the number of elicitable features and set it to half           I significantly outperform the other baselines in almost all the
   of the total number of features in the respective data set. We use K-         metrics for Rare disease and PPMI data set. For ADNI, CAFE and
   means clustering as the underlying clustering algorithm. For all the          CAFE-I outperform the other baselines in clinically relevant recall
   reported results, we use an underlying Support Vector Machine [3]             metric while KBEST performs the best for the other metrics. The
   classifier with Radial basis kernel function. Since, all the data sets        reason for this is that in ADNI, since, the elicitable features are
   are highly imbalanced, hence we consider metrics like recall, F1,             image features and we discretize the image features to calculate
   AUC-ROC and precision for our reported results. For the Feature               the information gain for the feature selector module, the granular
   selector module, we used the existing implementation of Li et al. [7]         level feature information is lost because of this discretization and
                                                                                 hence the drop in performance. For the experiments in Table 1,
  1 www.loni.ucla.edu/ADNI                                                       we keep the budget to be approximately half of the total number
                                                                                                          KiML’20, August 24, 2020, San Diego, California, USA,
Cost Aware Feature Elicitation


                                 Data set Algorithm      Recall           F1       AUC-ROC        AUC-PR
                                             OBS         0.647          0.488        0.642          0.347
                                          RANDOM      0.57 ± 0.064  0.549± 0.059 0.693 ± 0.042 0.421 ± 0.051
                            Rare disease   KBEST          0.47          0.457        0.628          0.349
                                            CAFE         0.647          0.628        0.749          0.489
                                           CAFE-I        0.647          0.647        0.759         0.512
                                             OBS         0.765          0.685        0.741          0.563
                                          RANDOM 0.857 ± 0.023 0.809 ± 0.015 0.85 ± 0.013 0.712 ± 0.020
                                PPMI       KBEST         0.828          0.807        0.846          0.716
                                            CAFE         0.846          0.817        0.855          0.726
                                           CAFE-I        0.855          0.829        0.865         0.743
                                             OBS           0.5           0.44        0.553          0.365
                                          RANDOM 0.711 ± 0,043 0.697 ± 0.082 0.767 ± 0.064 0.592 ± 0.098
                                ADNI       KBEST          0.73          0.745        0.806         0.646
                                            CAFE         0.807          0.711        0.786          0.578
                                           CAFE-I        0.769          0.701        0.776          0.574
                             Table 1: Comparison of CAFE against other baseline methods on 3 real data sets

          Dataset          # Pos   # Neg    # Observed   # Elicitable      5    CONCLUSION
           PPMI             554     919         5             31
           ADNI              94     287         6             69
                                                                           In this paper, we pose the prediction time feature elicitation problem
       Rare Disease          87     232         6             63           as an optimization problem by employing a cluster specific feature
Table 2: Data set details of the 3 real data sets used.#Pos is num-        selector to choose the best feature subset and then optimizing the
ber of positive example, #Neg is number of negative example. # Ob-         training loss. We show the effectiveness of our approach in real data
served is number of observed features and # Elicitable is the maxi-        sets where the problem set up is intuitive. Future work includes
mum number of features that can be acquired.                               learning the parameters of the feature selector module and jointly
                                                                           optimizing the feature selector and model parameters for a more
                                                                           robust framework and adding more constraints to optimization.

of features for all the methods. On an average, CAFE-I performs            ACKNOWLEDGEMENTS
better than CAFE across all the data sets because of the underlying
                                                                           SN & SD gratefully acknowledge the support of NSF grant IIS-
imputation model which helps in better treatment of the missing
                                                                           1836565. Any opinions, findings and conclusion or recommenda-
values as against replacing all the features by 0. This answers Q1
                                                                           tions are those of the authors and do not necessarily reflect the
affirmatively.
                                                                           view of the US government.
   In Figure 3, we compare the cost version of CAFE and CAFE-I
against KBEST. Cost version takes into account the cost of individ-
ual features and accounts for them as penalty in the feature selector      REFERENCES
                                                                            [1] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional
module. Hence, in this version of CAFE, a cost budget is used as                likelihood maximisation: a unifying framework for information theoretic feature
opposed to hard budget on the number of elicitable features. We gen-            selection. JMLR (2012).
erate the cost vector by sampling each cost component uniformly             [2] Xiaoyong Chai, Lin Deng, Qiang Yang, and Charles X Ling. 2004. Test-cost
                                                                                sensitive naive bayes classification. In ICDM.
from (0,1). For PPMI and Rare disease, we can see that cost sensitive       [3] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine
CAFE performs consistently better than KBEST with increasing                    learning (1995).
cost budget. In the PPMI data set, the greedy optimization of the           [4] Gabriel Dulac-Arnold, Ludovic Denoyer, Philippe Preux, and Patrick Gallinari.
                                                                                2011. Datum-wise classification: a sequential approach to sparsity. In ECML
feature selector objective on the entire data set lead to elicitation of        PKDD. 375–390.
just 1 feature, beyond that the information gain was negative, hence        [5] Tianshi Gao and Daphne Koller. 2011. Active classification based on value of
                                                                                classifier. In NIPS.
the performance of PPMI across various cross budget remains the             [6] P. Kanani and P. Melville. 2008. Prediction-time active feature-value acquisition
same. CAFE on the other hand was able to select important feature               for cost-effective customer targeting. Workshop on Cost Sensitive Learning at
subsets for various clusters based on the observed features related             NIPS (2008).
                                                                            [7] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino,
to gait and postures. For ADNI data set, CAFE performs better than              Jiliang Tang, and Huan Liu. 2018. Feature selection: A data perspective. ACM
KBEST only in recall. The reason for this is the same as mentioned              Computing Surveys (CSUR) (2018).
above. This helps in answering Q2 affirmatively.                            [8] Charles X Ling, Qiang Yang, Jianning Wang, and Shichao Zhang. 2004. Decision
                                                                                trees with minimal costs. In ICML.
   Lastly, Figure 2 shows the effect of increasing cluster on the           [9] D. J. Lizotte, O. Madani, and R. Greiner. 2003. Budgeted learning of Naive-Bayes
validation recall for the Rare disease data set. As can be seen, for            classifiers (UAI). 378–385.
                                                                           [10] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez-
smaller number of clusters, the recall is very low and increases to             Lobato, Sebastian Nowozin, and Cheng Zhang. 2019. EDDI: Efficient Dynamic
an optimum for 9 clusters. This helps us in understanding the fact              Discovery of High-Value Information with Partial VAE. In ICML.
that forming clusters based on observed important features helps           [11] H. MacLeod, S. Yang, et al. 2016. Identifying rare diseases from behavioural data:
                                                                                a machine learning approach (CHASE). 130–139.
CAFE in selecting different feature subsets for different clusters,        [12] K. Marek, D. Jennings, et al. 2011. The Parkinson Progression Marker Initiative
thus helping the learning procedure.                                            (PPMI). Prog Neurobiol 95, 4 (2011), 629–635.
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                              Srijita Das, Rishabh Iyer, and Sriraam Natarajan




Figure 3: Recall (left), F1 (middle), AUC-PR (right) for (from top to bottom) Rare Disaese, PPMI, and ADNI. The x-axis refers
to the cost budget used which leads to the elicitation of different number of features.


[13] P. Melville, M. Saar-Tsechansky, et al. 2004. Active feature-value acquisition for        (2005), 1226–1238.
     classifier induction (ICDM). 483–486.                                                [22] Thomas Rückstieß, Christian Osendorfer, and Patrick van der Smagt. 2011. Se-
[14] P. Melville, M. Saar-Tsechansky, et al. 2005. An expected utility approach to             quential feature selection for classification. In Australasian Joint Conference on
     active feature-value acquisition (ICDM). 745–748.                                         Artificial Intelligence. Springer, 132–141.
[15] Feng Nan and Venkatesh Saligrama. 2017. Adaptive classification for prediction       [23] M. Saar-Tsechansky, P. Melville, and F. Provost. 2009. Active feature-value
     under a budget. In NIPS.                                                                  acquisition. Manag Sci 55, 4 (2009).
[16] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2015. Feature-budgeted               [24] Victor S Sheng and Charles X Ling. 2006. Feature value acquisition in testing: a
     random forest. In ICML.                                                                   sequential batch test algorithm. In ICML.
[17] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2016. Pruning random forests         [25] Hajin Shim, Sung Ju Hwang, and Eunho Yang. 2018. Joint active feature acquisi-
     for prediction on a budget. In NIPS.                                                      tion and classification with variable-size set encoding. In NIPS.
[18] Feng Nan, Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2014. Fast       [26] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2015. Efficient learn-
     margin-based cost-sensitive classification. In ICASSP.                                    ing by directed acyclic graph for resource constrained prediction. In NIPS.
[19] Sriraam Natarajan, Srijita Das, Nandini Ramanan, Gautam Kunapuli, and Predrag        [27] Zhixiang Xu, Matt Kusner, Kilian Weinberger, and Minmin Chen. 2013. Cost-
     Radivojac. 2018. On Whom Should I Perform this Lab Test Next? An Active                   sensitive tree of classifiers. In ICML.
     Feature Elicitation Approach.. In IJCAI.                                             [28] Zhixiang Xu, Kilian Q Weinberger, and Olivier Chapelle. 2012. The greedy miser:
[20] S. Natarajan, A. Prabhakar, et al. 2017. Boosting for postpartum depression               learning under test-time budgets. In ICML.
     prediction (CHASE). 232–240.                                                         [29] Z. Zheng and B. Padmanabhan. 2002. On active learning for data acquisition
[21] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based                  (ICDM). 562–569.
     on mutual information criteria of max-dependency, max-relevance, and min-
     redundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8
              A New Delay Differential Equation Model for COVID-19
                                                                    Retarded logistic equation
                B Shayak†                                                   Mohit M Sharma                                     Manas Gaur
  Mechanical and Aerospace Engg                                     Population and Health Sciences                            AI Institute
        Cornell University                                             Weill Cornell Medicine                         University of South Carolina
   Ithaca, New York State, USA                                           New York City, USA                                       USA
       sb2344@cornell.edu                                             mos4004@med.cornell.edu                            mgaur@email.sc.edu




ABSTRACT                                                                                (homogeneous mixing etc). The second option affords maximum
                                                                                        potential versatility at the cost of huge computational complexity
In this work we give a delay differential equation, the retarded                        and variability in the network structure. The third option
logistic equation, as a mathematical model for the global                               combines features of the previous two – whether the features
transmission of COVID-19. This model accounts for asymptomatic                          being synergized are the positive or the negative ones depends to
carriers, pre-symptomatic or latent transmission as well as contact                     a large extent on the modeler.
tracing and quarantine of suspected cases. We find that the                                 In this work we use delay differential equations (DDE) to
equation admits varied classes of solutions including self-burnout,                     propose a simple, single-variable, lumped parameter model for the
progression to herd immunity and multiple states in between. We                         spread of Coronavirus. Jahedi and Yorke [1] make a strong case
use the term “partial herd immunity” to refer to these states,                          for simpler models relative to complex and elaborate ones. In the
where the disease ends at an infection fraction which is not                            Literature, DDE has been used for modeling COVID-19, for
negligible but is significantly lower than the conventional herd                        example in Refs. [2]–[4]. These authors however ignore features
immunity threshold. We believe that the spread of COVID-19 in                           such as contact tracing, asymptomatic carriers and latent
every localized area can be explained by one of our solution                            transmission; our results too have a richer structure.
classes.

CCS CONCEPTS                                                                            2 Derivation of the model
• Applied computing – mathematics and statistics                                            We measure time t in days and use as our basic variable y(t)
                                                                                        which is the cumulative number of corona cases, including active
KEYWORDS                                                                                cases, recovered cases and deaths, in the region of interest. The
                                                                                        following “word-equation” summarizes the approach :
Retarded logistic equation, Asymptomatic carriers, Latent
transmission, Contact tracing, Reproduction number calculation,                             Rate of emergence  =  Interaction rate of  
                                                                                                                                      
Partial herd immunity
                                                                                            of new cases        each existing case 
                                                                                                                                                   (0)
1 Introduction                                                                              Probability of    Number of 
                                                                                                                          
    Three kinds of models to study COVID-19 are currently in                                transmission   existing cases 
vogue – lumped parameter or compartmental models (ordinary                                  The left hand side (LHS) here is just dy/dt whereas the right
differential equation), agent-based models and stochastic                               hand side (RHS) needs a detailed derivation.
differential equation models. The first option affords maximum                              Equation (0) assumes that the disease is transmitted from
conceptual clarity at the expense of some simplifying assumptions                       infected to susceptible people via interaction, and not via airborne
†Presenting author, Corresponding author. ORCID : 0000-0003-2502-2268                   transmission. Due to asymptomatic and pre-symptomatic carriers,
                                                                                        there are always cases moving about in society who are oblivious
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the   to their infectivity. Each such case interacts with other people at
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San                  a different rate. For example, a working-from-home professor
Diego, California, USA, August 24, 2020. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                      might venture outside once every three days and interact with one
                                                                                        person on each trip while a grocer might go to work and interact
KiML'20, San Diego, California, USA                                                     with 10 customers every day. The professor has an interaction rate
© 2020, Copyright held by the author(s).
                                                                                        of 1/3 persons/day while the grocer has interaction rate of 10
 KDD KiML 2020                                                                                                                          Shayak et. al.

persons/day. For a compartmental model, one must average over             contact tracing drive conducted by public health department is
the professor, the grocer and all the other un-quarantined cases to       taken into account. Assumption is made that this drive is
generate an effective per-case interaction rate q0.                       instantaneous and proceeds in forward direction starting from
    Every interaction of course does not result in a transmission –       freshly arriving symptomatic cases. The contact trace captures
there is a probability strictly less than unity that the virus jumps      patients who were exposed to the new case τ2 days ago, as well as
from the infected person to the person whom s/he is interacting           patients who were exposed immediately before the new case
with. This probability has two components. The first component            manifested symptoms. The average duration for which these
is that the healthy person must be susceptible to begin with. While       secondary patients have remained at large is τ2/2, be they
we ignore intrinsic insusceptibles, there will be people who have         symptomatic or asymptomatic. The assumption of instantaneous
recovered from the disease and are therefore not susceptible              contact tracing, which decreases the average time that contact-
again. In this Article, we assume that one bout of infection brings       traced cases spend out of quarantine, opposes the error arising
permanent immunity. The assumption is valid so long as the                from the assumption of zero non-transmissible incubation period,
immunity period exceeds the total epidemic duration. Till date,           which increases the average time for which the contact-traced
there is little credible evidence for re-infection [5]–[7]; contrarily,   cases transmit before quarantine. These two effects are assumed
a very recent and thorough study [8] based on monitoring of huge          here to cancel. Let μ3 (between 0 and 1) denote the fraction of all
patient cohort has found significant evidence of long-lasting and         cases who escape from contact tracing drives – the
effective antibodies. If N be the initial number of susceptible           complementary fraction 1−μ3 get caught. Thus, we have three
people (recall that y is the case count), then the probability that a     classes of un-quarantined cases : (a) 1−μ3 are contact-traced cases
random person is a recovered case is approximately y/N and the            who remain in society for a time τ2/2, (b) μ3 (1−μ1) are untraced
probability that s/he is susceptible is (approximately) 1−y/N. This       symptomatic cases who go into isolation only after time τ2, and (c)
expression is approximate because the true number of recovered            μ3μ1 are undetected asymptomatic cases who transmit for the
cases at any time is less than y; the error however is small since        entire infection period τ1. Arguments similar to those of the
the recovery period is much shorter than the overall course of the        previous paragraph yield the total number of un-quarantined
epidemic. Note that 1−y/N is a logistic term, and a herd immunity         cases as
effect.
    Given susceptibility, the next probability is that the virus              n = ( 1 − μ3 )( y − y ( t − τ 2 / 2) ) +
actually does jump from the un-quarantined case to the                                                                                              . (1)
susceptible person. This probability depends on the level of                 ( (1 − μ ) μ )( y − y ( t − τ ) ) + μ μ ( y − y ( t − τ ) )
                                                                                      1    3                    2           1   3               1

precaution such as face covering or mask, handwashing and                     The preceding arguments now yield the mathematical form of
disinfection being adopted by the case as well as the susceptible         (0) as
person. For a compartmental model, the probability must be
averaged over all the un-quarantined cases. If this average                   dy                  y   y ( t ) − ( 1 − μ3 ) y ( t − τ 2 / 2) −
                                                                                                                                              
                                                                                   = m0 1 −
probability is P0, then q0(1−y/N)P0 gives the per-case spreading              dt               (1 − μ ) μ y ( t − τ ) − μ μ y ( t − τ ) 
                                                                                               N                   1   3           2   1   3       1
rate. Since q0 and P0 are both dependent on public health
measures, and are both difficult to measure independently, we can            (2)
club those two together into a single parameter which we call m0.            which is the retarded logistic equation.
    So far we have accounted for the rate at which each cases
spreads the disease; now we have to count the number of cases             3 Solutions of the model
out of quarantine. Let us start with an asymptomatic carrier, who
remains in open society throughout. S/he typically transmits the              Due to complexity of the equation (2), analytical solution using
disease for 7 days, which is called the infection period. Then, new       perturbation theory etc has not been attempted in this case.
healthy people can be only be infected by those asymptomatic              Instead we have used numerical integration to obtain the
cases who have fallen sick within the last 7 days, and not those          solutions of (2). Before giving the solutions however, we present
who have fallen sick earlier. The number of such people is the            the calculation of the reproduction number R. To find R at any
number of asymptomatic sick people today minus the number of              state of evolution of the disease, we first treat y in the logistic term
those 7 days earlier. Mathematically, let μ1 (between 0 and 1)            to be constant, and then carry out the steps described in Ref. [9].
                                                                          This yields the expression
denote the fraction of asymptomatic carriers and τ1 the
asymptomatic infection period. Then, the number of
asymptomatic transmitters today is μ1(y(t)−y(t−τ1)). Here we can
see the emergence of the delay term.
                                                                              R = m0 1 −  ( )(
                                                                                             N
                                                                                               y       1 + μ3 − 2 μ1 μ3

                                                                                                           2
                                                                                                                    τ 2 + μ1 μ3 τ 1     )
                                                                                                                                    . (3)

                                                                              The ease of calculating R with respect to the ordinary
    The remaining fraction 1−μ1 of cases are symptomatic. Let τ2
                                                                          differential equation based models [10] is noteworthy.
be the latency period during which these cases remain
                                                                              Solution classes of logistic DDE (2) are now demonstrated. The
transmissible prior to displaying symptoms. It is assumed that
                                                                          numerical integration routine used is second order Runge Kutta
they isolate themselves thereafter. Assumption is also made that
                                                                          with a time step of 1/1000 day. As the testbed for the simulations,
the incubation period is equal to the latency period. Finally, the
                                                                          we consider a Notional City having N=300000, μ1=0.8, (maximum
 Retarded logistic equation                                                                                             KDD KiML 2020

value as per our knowledge [11]–[13]), τ1=7 days and τ2=3 days          there spiraled out of control despite hard lockdowns being
[14]. The initial condition needs to be a function having the length    imposed at an early stage.
of the maximum delay involved in the problem, which is seven                City B also enables us to explain partial herd immunity. Even
days; we take this function to be zero cases to start with and          though the initial conditions were unfavourable for containment
constant increase of 100 cases/ day for a week.                         of the epidemic, herd immunity started activating as the disease
    Notional City A has m0=0.23 and μ3=1/2, which describes a           proliferated. A stable zone (R<1) was entered when only 13.5
hard lockdown [15] accompanied by good contact tracing. R0 (i.e.        percent of the total susceptible population was infected, and a
(3) evaluated at y=0) is 0.886. The epidemic ends with a negligible     similar percentage again got infected before the epidemic ended.
fraction of infected people, as shown below. This and the next five     Thus, herd immunity worked in synergy with non-
plots are three-way – each plot shows y as blue line, its derivative    pharmaceutical interventions to stop the epidemic at only 26
 y as green line and the weekly increments in cases, or                 percent infection level, which is significantly less than the
epidemiological curve, as a grey bar chart. These last have been        conventional 70-90 percent threshold [16]. This is what we call
reduced by a factor of 7 to ensure clarity of presentation. We          partial herd immunity. Our findings are in agreement with and act
report the rates on the left hand side y-axis and the cumulative        as an explanation for what has been obtained by Britton et. al. [17]
cases on the right hand side y-axis.                                    and Peterson et. al. [18].
                                                                            We now consider Notional City C which differs from City B in
                                                                        that m0=0.5; lockdown is replaced by a much more permissive
                                                                        state. R0 is above 2.5; 1,80,000 infections are required to bring it
                                                                        below unity.




   Figure 1 : City A extinguishes the epidemic in time.

    This is exactly what has happened in New Zealand – that il
fortunatissimo per verita has indeed quashed the epidemic
completely with the final case count being a negligible fraction of        Figure 3 : City C goes to herd immunity – total not
its total (tiny and sparsely distributed) population.                   partial. The symbol ‘k’ denotes thousand and ‘L’ hundred
    The parameter values for Notional City B are the same as those      thousand.
for A except that μ3=0.75; a greater fraction of cases escape the
contact tracing drive. R0 is 1.16, and R becomes 1 at y=40500 cases.        Need one mention that this is a public health disaster. Notional
                                                                        City D combines features of B and C. This city begins with m0=0.5
                                                                        like City C but reduces to m0=0.23 like City B when the case count
                                                                        reaches 40,000 (the R=1 threshold for B’s parameters).




  Figure 2 : City B grows at first before reaching burnout.
The symbol ‘k’ denotes thousand.

    The outbreak enters exponential regime right after being               Figure 4 : As the input, so the output – D’s response
released. As y increases, R gradually reduces so the growth slows       combines features of B and C. The symbol ‘k’ denotes
down until it peaks when the case count is about 39,000 [compare        thousand and ‘L’ hundred thousand.
with the value of 40,500 when R=1 as per (3)]. Thereafter, the
disease progresses to extinction in time. The overall progression          We can see a case count as well as a total duration intermediate
is very long but one hopes that the relatively small size of the peak   to B and C; the epidemic is over in 70 days but the peak rate of
can prevent overstressing of medical care facilities and thus avoid     12,920 cases/day is still very high and likely to load hospital
unnecessary deaths. Delhi and Mumbai in India and Los Angeles           facilities beyond their carrying capacity.
in USA are in all probability cities of this type since the disease        The Cities E and F demonstrate the issues faced in reopening.
                                                                        In both these cities, the parameters and case trajectory are
 KDD KiML 2020                                                                                                                                        Shayak et. al.

identical to those of City A for the first 80 days. Then, E and F                           forecasting transmission and control of COVID-19,” medRxiv, p.
                                                                                            2020.05.06.20092858, 2020, doi: 10.1101/2020.05.06.20092858.
reopen on the 80th day by increasing m0 from 0.23 to 0.5, and                        [4]    J. Mendenez, “Elementary time-delay dynamics of COVID-19 disease,”
simultaneously decreasing μ3 i.e. deploying a more effective                                Medrxiv, pp. 1–4, 2020, doi: https://doi.org/10.1101/2020.03.27.20045328.
                                                                                     [5]    D. C. Ackerly, “Getting COVID-19 twice.” VOX, [Online]. Available:
contact tracing program which had been built up during the                                  https://www.vox.com/2020/7/12/21321653/getting-covid-19-twice-
lockdown. The post-reopening μ3’s for E and F are 0.1 and 0.2                               reinfection-antibody-herd-immunity.
respectively.                                                                        [6]    S. McCamon, “13 USS Roosevelt Sailors Test Positive For COVID-19,
                                                                                            Again.”
                                                                                     [7]    Y. Saplakoglu, “coronavirus-reinfections-were-false-positives.” [Online].
                                                                                            Available: https://www.livescience.com/coronavirus-reinfections-were-
                                                                                            false-positives.html.
                                                                                     [8]    A. Wajnberg et al., “SARS-CoV-2 infection induces robust, neutralizing
                                                                                            antibody responses that are stable for at least three months,” medRxiv,
                                                                                            2020, doi: https://doi.org/10.1101/2020.07.14.20151126.
                                                                                     [9]    B. Shayak and R. H. Rand, “Self-burnout - A New Path to the End of
                                                                                            COVID-19,”            medRxiv,         pp.       1–14,        2020,        doi:
                                                                                            https://doi.org/10.1101/2020.04.17.20069443.
                                                                                     [10]   O. Diekmann, J. A. P. Heesterbeek, and M. G. Roberts, “The construction
                                                                                            of next-generation matrices for compartmental epidemic models,” J. R.
                                                                                            Soc. Interface, vol. 7, no. 47, pp. 873–885, 2010, doi: 10.1098/rsif.2009.0386.
                                                                                     [11]   “71 percent of patients in Maharashtra are asymptomatic.” Mumbai
      Figure 5 : City E, like City A, is a success story.                                   Mirror,                            [Online].                         Available:
                                                                                            https://mumbaimirror.indiatimes.com/coronavirus/news/covid-19-71-of-
                                                                                            patients-in-maharashtra-are-asymptomatic-mumbai-cases-at-
                                                                                            16579/articleshow/75754328.cms.
                                                                                     [12]   “Taking over hospital beds, conducting survey.” New Indain Express,
                                                                                            [Online].                                                            Available:
                                                                                            https://www.newindianexpress.com/nation/2020/may/30/taking-over-
                                                                                            hospital-beds-conducting-survey-uddhav-government-goes-after-covid-
                                                                                            19-as-state-tally-c-2149989.html.
                                                                                     [13]   “Delhi CM says COVID-19 deaths very less.” Times of India, [Online].
                                                                                            Available: https://timesofindia.indiatimes.com/city/delhi/delhi-cm-says-
                                                                                            covid-19-deaths-very-less-but-75pc-cases-asymptomatic-or-showing-
                                                                                            mild-symptoms/articleshow/75658636.cms.
                                                                                     [14]   M. L. Childs et al., “The impact of long-term non-pharmaceutical
                                                                                            interventions on COVID-19 epidemic dynamics and control,” medRxiv,
                                                                                            vol. 22, p. 2020.05.03.20089078, 2020, doi: 10.1101/2020.05.03.20089078.
                                                                                     [15]   B. Shayak and M. M. Sharma, “Retarded Logistic Equation as a Universal
    Figure 6 : Unlike City E, F is a failure story. The symbol                              Dynamic Model for the Spread of COVID-19,” medRxiv, pp. 1–27, 2020,
‘k’ denotes thousand and ‘L’ hundred thousand.                                              doi: 10.1101/2020.06.09.20126573.
                                                                                     [16]   G. A. D’Souza and D. Dowdy, “What is herd immunity and how we can
                                                                                            achieve        it      with       COVID-19 ?”        [Online].       Available:
   The difference between Cities E and F is dramatic.                                       https://www.jhsph.edu/covid-19/articles/achieving-herd-immunity-
Mathematically, R remained less than unity throughout in E; its                             with-covid19.html.
                                                                                     [17]   T. Britton, F. Ball, and P. Trapman, “The disease-induced herd immunity
value after reopening was 0.985. We can see that the case rate                              level for Covid-19 is substantially lower than the classical herd immunity
decreases monotonically all the time. In F, the post-reopening R                            level,”        pp.        1–15,        2020,        [Online].        Available:
                                                                                            http://arxiv.org/abs/2005.03085.
became 1.22 and sent the trajectory haywire. In practice however,                    [18]   A. A. Peterson, C. F. Goldsmith, C. Rose, A. J. Medford, and T. Vegge,
the incipient increase in case rate after the 80 th day acts as an                          “Should the rate term in the basic epidemiology models be second-
advance warning of what has happened – the reopening steps                                  order?,” 2020, [Online]. Available: http://arxiv.org/abs/2005.04704.

should be reversed if it is at all possible to do so while satisfying
economic and other external constraints.


Conclusion
   In this Article we have presented a new mathematical model
for COVID-19 which is simple and elegant in structure but can
generate a variety of realistic solution classes. We hope that our
work may be of use to mathematicians and data scientists who are
trying to understand the spread of the disease in a quantitative
manner. The public health implications of these results are being
reserved for another study.

REFERENCES
[1]        S. Jahedi and J. A. Yorke, “When the best pandemic models are the
           simplest .,” medRxiv, pp. 1–22, 2020, doi:
           https://doi.org/10.1101/2020.06.23.20132522.
[2]        L. Dell’Anna, “Solvable delay model for epidemic spreading: the case of
           Covid-19         in      Italy,”     2020,     [Online].     Available:
           http://arxiv.org/abs/2003.13571.
[3]        A. K. Gupta, N. Sharma, and A. K. Verma, “Spatial Network based model
     Public Health Implications of a delay differential equation model for COVID 19
                  Mohit M Sharma                                                    B Shayak
           Population and Health Sciences                             Sibley School of Mechanical and
             Weill Cornell Medicine                                          Aerospace Engineering
                New York City, USA                                            Cornell University
                                                                          Ithaca, New York State, USA

                   mos4004@med.cornell.edu                                     sb2344@cornell.edu




   ABSTRACT                                                              175,000 by August 15th, 2020 [3]. Some features however, both
                                                                         nationally and globally, have proved counterintuitive. For
    This paper describes the strategies derived from a novel delay       example, a 76-day lockdown resulted in the outbreak’s
differential equation model[1], signifying a practical extension         containment in Wuhan. A similar measure has produced similar
of our recent work. COVID -19 is an extremely ferocious and an           results in New Zealand. However, lockdown appeared only
unpredictable pandemic which poses unique challenges for                 marginally effective in New York State, USA where the case and
public health authorities, on account of which “case races”              death counts decreased only after reaching horrifying peak levels
among various countries and states do not serve any purpose and          [4]. It was contended that the stay at home order in New York
present delusive appearances while ignoring significant                  came too late. This apparent delay was not present in California,
determinants. We aim to propose comprehensive planning                   USA. The case counts there went up all the same, and the rate is
guidelines as a direct implication of our model. Our first               high even today. We would like to mention that such
consideration is reopening, followed by effective contact tracing        spatiotemporal anomalies are present not just in the US but also
and ensuring public compliance. We then discuss the                      in other countries such as Canada, Russia and India [5] which
implications of the mathematical results on people’s behavior            witnessed high case growth despite being in lockdown. In order
and eventually provide conclusive points aimed at strengthening          to better understand the epidemiology of the transmission of
the arsenal of resources that are helpful in framing public health       COVID-19, we have constructed a delay differential equation
policies. The knowledge about pandemic and its association with          model. Here we present its practical implications which tries to
public health interventions is documented in the various                 encapsulate a myriad of factors associated with the current
literature-based sources. In this study, we explore those resources      scenario.
to explain the findings inferred from delay differential equation
model of covid-19.                                                       2 MATHEMATICAL MODELING TO
                                                                         UNDERSTAND THE EPIDEMIOLOGY
KEYWORDS
                                                                             Since many decades, mathematical modelling has been used
Delay differential equation, Contact tracing, Socio-behavioral           as an integral tool in recognizing the trend of disease progression
theories, Lockdown, Reopening                                            during pandemics. For example, using a simple model explaining
                                                                         the transmission dynamics of the infectious disease between the
                                                                         susceptible, infected and recovered population ( SIR Epidemic
                                                                         Models) Kermack and McKendrick proposed and later
1 INTRODUCTION                                                           established a principle – the level of susceptibility in the
   The national (USA) and global spread of Coronavirus Disease           population should be adequately high in order for that epidemic
2019 (COVID-19), following its origins in Wuhan, China in at             to unfold in that population. Such mathematical models can give
least December 2019 and possibly earlier still [2] has been              impressionable insights in explaining the epidemiological status
alarmingly rapid and deadly. From the 25 individual national             of the population, predict or calculate the transmissibility of the
forecasts received by CDC, predicts that there is possibility of         pathogen and the potential impact of public health preventive
the total reported COVID -19 deaths is between 160,000 and

In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and
Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
KiML'20, San Diego, California, USA
 © 2020, Copyright held by the author(s).
KDD KiML 2020                                                                                                                     Sharma et al.



             practices [6]. However, a significant body of evidence
         suggests that decisions should be made regarding the parameters
         to be included, being contingent on the impact of the precision of
         predictions. Several policy questions about the containment of
         this outbreak have been considered in our recently proposed
         simple non-linear model [1]This paper delves into the practical
         solutions that can be devised utilizing the directions of our
         models’ outcome.

             In generating interpretable results gathered from
         epidemiological models, we have used the examples of six types
         of cities [1]:

                       1)     City A – Moderately effective contact
                   tracing in a hard lockdown. This city has R
                   (reproductive number) <1 and drives epidemic to
                   extinction in time.
                       2)     City B – Less effective contact tracing in a
                   hard lockdown. It starts off R >1, but reached R =1 at
                   15% infection level. The epidemic ends at 30%
                   infection rate and takes a very long time to get there.
                       3)     City C – Less effective contact tracing (Like
                   City B) with milder restriction on mobility. It proceeds
                   rapidly to herd immunity.
                       4)     City D – Combination of City B and City C.
                   Starts with mild restriction on mobility and progresses
                   towards restriction. The duration of the epidemic as
                   well as of the final case count is between CITIES B
                   and C.
                       5)     City E - Starts off like City A, it reopens with
                   very effective contact tracing and drive the epidemic to
                   extinction in time.
                       6)     City F – Starts off like CITY A, it reopens
                   with less effective contact tracing and suffers a second
                   wave.




                                                                                 Pragmatic implications of our work are as follows:
KDD KiML 2020                                                                                                                          Sharma et al.



         3 REOPENING CONSIDERATIONS, ROLE                                        Mr X as a potential case, having been exposed to a known case
         OF TESTING                                                              yesterday. Then, it can be that Mr X contracts the virus ten days
                                                                                 from now, in which situation he will report negative if tested
            The unemployment situation generated as a result of                  today or tomorrow, but will still amount to a spreading risk ten
         lockdowns is currently forcing countries and states to partially        days later if he is at large then. This also means that secondary
         reopen their economies even though many of them have not yet            contact tracing, i.e. finding Mr. X’s contacts, must go ahead
         got the virus under control. The reopening is easiest in City A         irrespective of his test results. Indeed, the medical authorities are
         regions where cases have slowed down to a trickle. With every           well aware of this loophole.
         new case being detected, swift isolation of all potential
         secondary, tertiary and maybe even quaternary cases, both                   The US Chamber of Commerce has given out state by state
         forward and backward, should prove possible while the rest of           reopening guides for small businesses which are mandated to be
         the economy functions in a relatively uninhibited way. Even one         followed across the US. Continued following of federal, state,
         mass transmission event can restart an exponential growth               tribal, territorial and local recommendations is of paramount
         regime and force a rollback to a fully locked down state.               importance.
         Reopening beyond a skeletal level is impossible in City B regions
         which are still in the ascending phase. The ascent implies that             Prior to resuming work, all workplaces should have a
         contact tracing is already inadequate, and on top of that if            carefully chartered exposure control, mitigation and recovery
         mobility increases then the region might turn into City C,              plan. Although essential guidance is specific for each business,
         overstress healthcare systems, and become a massacre. An                there are certain measures that can be generally adopted across
         ascending B-City has little option other than to contact trace as       all workplaces.
         hard as possible and wait for partial herd immunity to kick in.
                                                                                     1) Reopening in phases – The US government has laid down
         Only when that happens and the cases slow down on their own
                                                                                 guidelines to open the country in 3 phases. First phase involves
         can it consider a more extensive reopening like a City A region.
                                                                                 continuation of vulnerable individuals to remain at home. When
             Testing is an important part of the epidemic management             in public, people are expected to wear masks, have maximum
         process no doubt since it enables the authorities to get an accurate    physical separation, avoid places with more than 10 people and
         description of the spread of the disease. As we have already            limit non-essential travel. Second phase allows gatherings of 50
         discussed, limited testing capacity is giving us a partial or           people, some nonessential travel and reopening of schools. Third
         distorted picture in many regions. There is a widespread media          phase involves relaxation of restrictions, permitting vulnerable
         perception that extensive testing is one of the prerequisites for       populations to operate.
         any kind of reopening process [7], [8]. Much criticism has also
                                                                                     2) Defining new metrics – Post-corona world will witness
         been levelled at certain countries for having inadequate testing
                                                                                 some significant changes in regulatory controls, and behavioral
         programs (we shall further elaborate the blame aspects later).
                                                                                 drift in personal and professional spheres. Cleanliness standards,
         However, we would like to emphasize that testing is as of yet a
                                                                                 safety standards, infection prevention practices with regular
         diagnostic tool and not a preventive one. Currently, it can show
                                                                                 monitoring and inspection for its assurance are some of the new
         us how the disease is behaving but cannot slow its spread in any
                                                                                 terms that will have to be a part of a daily life of the people for
         way. Test-induced slowing can come only when the capacity
                                                                                 at least the next few months.
         expands to such a level as to be able to preventively test potential
         super-spreaders such as grocers and food workers every single              3) Organizational changes – To help essential operations to
         day. We hope that such a development may prove possible in the          function, companies and organizations will have to be prepared
         near future – many Universities for example are making                  with advanced IT systems (in case of continuation of remote
         reopening arrangements with provision for very frequent testing         working), supply of PPE, setting up travel facilities to avoid
         of the entire community.                                                public transport, providing behavioral health services, and leave
                                                                                 no stone unturned in overcoming biological, physical, and
             During reopening it is vital to get a true picture of the disease
                                                                                 emotional challenges. We can see that the above guidelines are
         evolution so that we can gauge the effect of any relaxation of
                                                                                 broadly conformal to our model predictions.
         restrictions – whether it keeps the outbreak under control as in
         City E or brings about the beginnings of a second wave as in City
                                                                                 4 METHODS OF CONTACT TRACING
         F. Such beginnings are heralded by a rise in the case rate. As we
         saw, there was no such rise in City E even though R increased              As we have already mentioned, contact tracing is probably the
         after the reopening. If the rise takes place, the relaxation must       single most important factor in determining the progression of
         immediately be rolled back to avert the disaster. Hence, during         COVID-19 in a region. We can see from the model that the faster
         reopening, the testing capacity must be high enough to detect           the contact tracing takes place, the better; the more delay we
         such incipient rises. As per China’s state media reports, with an       have, the higher R becomes. Moreover, our model does not
         aim to reopen the economy, the city of Wuhan conducted 6                account for backward contact tracing. In practice however, a
         million tests in one week; we present this fact without discussion      sufficiently high level of detection might not be possible to
         or comment. A second reason why testing is still not all that it        achieve with forward contact tracing alone. As much as it is
         could have been is the high false-negative rate during the initial      important, contact tracing is also one of the trickiest aspects to
         stages of infection [9]. Suppose a contact tracing drive identifies     handle since it can interfere with people’s privacy. In classical
KDD KiML 2020                                                                                                                       Sharma et al.



         contact tracing, human tracers talk to the confirmed cases and             2) Communicating the consequences involved with risky
         track down their movements as well as the persons they                  behaviors in a transparent manner – Central and state ministers
         interacted with over the past couple of days. This method has           as well as public health authorities are in constant
         worked well in Ithaca, USA and in Kerala, India. While it is the        communication with the masses.
         least invasive of privacy, it is also the most unreliable since
         people might not remember their movements or their interactions            3) Conveying information about the steps involved in
         correctly. The time taken in this method is also the maximum. A         performing the recommended action and focusing on the benefits
         more sophisticated variant of this supplements human testimony          to action – Famous celebrities, in addition to state and central
         with CCTV footage and credit/debit card transaction histories –         governments, spread the messages explaining the required steps
         this approach is possible only in countries such as USA where           cogently and ensuring that it has the maximum reach, especially
         card usage predominates over cash. The most sophisticated               among social media-addicted millennials and similar
         contact tracing algorithms use artificial intelligence together with    populations.
         location-tracking mobile devices and apps – while they are quick
                                                                                    4) Being open about the issues/barriers, identifying them at
         and fool-proof, they automatically raise issues of privacy and
                                                                                 early stage and working toward resolution – Activating all sorts
         security. For example, the TraceTogether app in Singapore,
                                                                                 of helpline numbers, email addresses, personal offices etc to
         which worked very well during the initial phases of the outbreak,
                                                                                 address any grievances around the topic.
         has not found popularity with many users [10]. Similarly, India’s
         Aarogya Setu has also raised privacy concerns [11]. Americans               5) Developing skills and providing assistance that encourages
         too have expressed their aversion to using contact tracing apps in      self-efficacy and possibility of positive behavior change –
         a recent poll, with only 43 percent of people saying that they          Adequate arrangements for people from lower socio-economic
         trusted companies like Google or Apple with their data.                 strata, stable and trustworthy financial schemes for middle class,
                                                                                 plan to support small business and a means to become a bridge
                                                                                 between the affluent class and the needy class are some of the
                                                                                 ways to foster positive behavior change and develop natural trust.
         5 ENSURING SOCIAL COMPLIANCE – A
                                                                                 Other than health belief model, some theories that can be useful
         BEHAVIORAL PERSPECTIVE                                                  are:
             As the epidemic drags on and on, the continued restrictions
         on social activity are becoming more and more unbearable. There             Theory of Reasoned Action – This theory implies that an
         is an increasing tendency, especially among younger people who          individual’s behavior is based on the outcomes which the
         are much less at risk of serious symptoms, to violate the               individual expects as a result of such behavior. In a practical
         restrictions and spread the disease through irresponsible actions.      scenario, if the health officials want the people to follow a
         However, City F, a rise in violator behavior can completely             particular trend, let us say based on our model, they need to
         nullify the effects of lockdown over the past few weeks or              reinforce the advantages of targeted behavior and strategically
         months. Here we discuss how public health professionals and             address the barriers. For instance, to enforce separation minima
         policy makers can resort to behavior/psychological theories to          even when it is apparently proving ineffective and the cases are
         ensure compliance among the common people. The most widely              increasing, they can use the examples of Cities B and C to
         used model is Health Belief Model which has been used                   convince the citizens that violations – and hence violators – can
         successfully in addressing public health challenges. We briefly         be responsible for thousands of excess deaths. Trans-theoretical
         discuss the utility of this model in the current situation.             Model – This model posits that any health behavior change
                                                                                 entails progress through six stages of change: precontemplation,
             Health belief model is a theoretical model which hypothesizes       contemplation, preparation, action, maintenance and
         that interventions will be most effective if they target key factors    termination. For instance, it was observed that in March, despite
         that influence health behaviors such as perceived susceptibility,       a rise in cases in New York City (NYC), people were not
         perceived severity, perceived benefits, perceived barriers to           observing social restrictions the way they should have. Now, we
         action and exposure to factors that prompt action and self-             can see that with passing time, the behavior of the masses
         efficacy. In general, this model can be used to design short and        transforms according to the stages of this model
         long term interventions. The prime components of this model
         which are relevant in the current scenario can be outlined as               Precontemplation – This is a stage where people are
         follows.                                                                typically not cognizant of the fact that their behavior is
                                                                                 troublesome and may cause undesirable consequences. There is
             1) Conducting a health need assessment to determine the             a long way to go before an actual behavior change. This phase
         target population – The best example is the demarcation of zones        coincides with the commencement of cases in NYC.
         in India depending on the level of risk. Red zone is highest risk,
         orange zone is average risk and green zone translates into no               Contemplation – Recognition of the behavior as problematic
         cases since last 21 days. Classification is multifactorial, taking      begins to surface and a shift begins towards behavior change.
         into account the incidence of cases, the doubling rate and the          When the cases started being reported all over media and the
         limit of testing and surveillance feedback to classify the districts.   major cause of spread began to surface, citizens started paying
                                                                                 attention to their activities.
KDD KiML 2020                                                                                                                       Sharma et al.



            Preparation – People start taking small steps toward               on their course of action. Since the virus is a new one, there is no
         behavior change like in our case, exhibiting hygienic practices       precedent which can act as a model. Even among emerging
         and ensuring six feet separation minima.                              infectious diseases, this latest one is particularly unpredictable,
                                                                               since minuscule changes in parameters can cause dramatic
             Action – This stage covers the phase where people have just       changes in the system’s behavior. This phenomenon is best
         changed their behavior and have positive intention to maintain        illustrated by the notional cities, discussed previously. For
         that approach. In this instance, people continue to practice social   example, to get from City A to B, all we did was increase by 50
         restrictions and hygiene positively.                                  percent the fraction of people who escaped the contact-tracers’
                                                                               net. The result was a 30 times (not 30 percent!) increase in the
            Maintenance – This stage focuses on maintenance and
                                                                               total number of cases. Similarly, the difference between Cities B
         continuity toward the adopted approach. Majority of people in
                                                                               and D is an 11-day delay (recall that the first seven days in the
         NYC are exhibiting positive behavior and maintaining it
                                                                               plots are the seeding period, so they don’t count) in imposing the
         throughout the stages of reopening phases. This is vitally
                                                                               lockdown in D. 11 days out of a 200-plus-day run might not
         important to ensure that NYC stops at partial herd immunity like
                                                                               sound like a lot. But, that was enough to create tens of thousands
         City D instead of blowing up again like City C.
                                                                               of additional cases, risk overstressing healthcare systems and at
            Termination – There is lack of motivation to come back to          the same time shorten the epidemic duration by a factor of three.
         the unhealthy behaviors and some sections of people across the
                                                                                   Further uncertainty comes from the fact that the parameter
         country/world will continue practicing good hygiene (though not
                                                                               values are changing constantly. It is a well- known fact the
         social restrictions!) in our day-to-day lives.
                                                                               reported fraction of asymptomatic carriers has increased
             Social Ecological theory – This theory highlights multiple        continuously over the last three months or so. Considering the
         levels of influences that molds the decision. In our case, let us     sensitivity of this or any other model to parameter values, such
         say for example that the decision is to maintain sufficient           changes can completely invalidate the results of a model as well
         physical separation once offices are opened up. To successfully       as any decision which was made on their basis. Identifying
         follow this, there is a complex interplay between individual,         potential exposures is much easier in a smaller city than a large
         relationship, community and societal factors that comes into          or densely populated one. It is also more effective if the cases are
         action. Law enforcement authorities need to take this into            mostly from the sophisticated social class who can use mobile
         consideration. A group of individuals when motivated by one           phone contact tracing apps or otherwise keep (at least mental)
         another to follow the guidelines, builds a good connection within     records of their movements and of the people they interacted
         the society, and in turn there is a high probability to build a       with. However, if there is an outbreak among the unsophisticated
         healthy network within a defined area. A negative interplay at        class, then even the most skillful contact tracer might run up
         different levels of motivation may in turn, prove disastrous and      against a wall of zero or false information. In such cases there are
         cause all efforts go down the drain. A perfect illustration of this   limited options that are left to the authorities to proceed in a
         in the present condition is how various NGO’s are working in          conducive manner.
         conjunction with public health authorities to bring about a change
         at an individual level by door-to-door campaigning. This propels
         the behavior of even the most potentially recalcitrant population         India went into lockdown on 25 March 2020. At that time, the
         in the most desirable way i.e. wearing masks and gloves,              official figures stated that there were only 571 cases, which made
         adopting hand hygiene, being cognizant of symptoms arising in         the decision appear premature to many people. Indeed, a seven-
         any member of the family and following quarantine rules in case       day delay of lockdown was suggested so that the migrant workers
         of travel from other states.                                          would have been able to return to their homes. However, when
                                                                               the lockdown was imposed, the testing had also been woefully
         6 SOCIAL ATTITUDES AND BEHAVIOUR                                      inadequate, with a nationwide total of just 22,694 tests having
             In this Section we address another important issue related to     been conducted up to that date. If we use the extrapolation
         the Coronavirus. This is that the widely heterogeneous case           technique of inferring case counts from death counts, then using
         profiles in different regions have often led to “corona contests”     the same 1 percent mortality rate and 20 day interval to death, we
         among these regions. Far too often, the residents of better-off       find almost 40,000 assumed cases on the day that the lockdown
         regions are seen heaping scorn on worse-hit regions. We have          began. If we go by this figure, then the lockdown wasn’t really
         selected a tiny handful of representative media articles,             early, and possibly should have been enforced earlier still in
         castigating the approaches of India, USA and Sweden, to show          trouble zones such as Mumbai. Certainly, if the figure of 40,000
         the breadth and vitriol of such commentary [12][13][14]               cases is true, then one further week of normal life (with huge
         [15][16].A feature common to almost all opinion pieces like this      crowds in trains and railway stations) might have been
         is that their authors do not have the slightest knowledge of the      disastrous. From the vantage point of today, alternate
         issues involved, either epidemiological or economic.                  arrangements should definitely have been made much earlier for
                                                                               rehabilitation of the migrant workers. However these
             Before embarking on criticisms, we should note that policy        arrangements would have involved considerable complexity in
         decisions need to be taken in real time, as the situation evolves.    the prevailing situation, and were certainly not as easy as one
         The authorities do NOT have the benefit of hindsight to decide
KDD KiML 2020                                                                                                                      Sharma et al.



         week’s delay in announcing lockdown. Sweden, which has                     • Efficiency of contact tracing comes at the expense of
         adopted a controlled herd immunity strategy, has been accused           people’s privacy – balancing between the two is a delicate
         of playing with fire. It is also possible that the Swedish              optimization problem.
         authorities are aware that they do not have the contact tracing
         capacity required for performing like City A and hence are                 • In some regions, restrictions such as masks and six-feet
         attempting something like City D – a faster end of the epidemic         separation minima must be maintained for a very long time to
         than City B at the expense of a higher case count. To make a            come. The public health authorities can ensure compliance by
         comprehensive analysis of their policy, it is crucial to know not       resorting to socio –behavioral theories/approaches.
         only the last intricate detail of the epidemiological aspects but
                                                                                          In deploying advanced contact tracing techniques,
         also the details of the economic considerations. That is almost
                                                                                   significant consideration has to be given for ensuring high
         impossible. On a different note however, we have seen reports
                                                                                   data security and lay down privacy regulations that are
         [17], [18] stating that the virus has entered into old age homes
                                                                                   convincing to the users
         and similar establishments, causing hundreds of deaths over
         there. Assuming that these reports are not overturned in the
                                                                                           Control the spread by swift identification and
         course of time, allowing the ingress of virus into high-risk areas
                                                                                   isolation of cases accompanied by tracing and quarantine for
         is an indefensible action, whatever the overall epidemiological
                                                                                   at least 2 weeks
         strategy.
                                                                                         Empowering of individuals and communities by the
                                                                                   government to facilitate efficient capacity building.
             Finally, extremely important public health factors such as the
         racial dependence of susceptibility and/or transmissibility have just            Multidisciplinary coordination, strong leadership to
         started coming to the surface. Another complete grey area is the          mobilize communities and take quick decisions coupled with
         mutations which this new and vicious virus are undergoing and what        thoughtful development of operation plans are likely to prove
         effect they might have on the spreading dynamics. Some reports also       considerably efficient in handling this pandemic to the best of
         reflect that the change in genetic composition due to mutation might      our capacity.
         be the reason behind huge differences in the crude infection rate
         between countries [19][20]. In the absence of a clear picture about        References
         this, any public health measure is all the more likely to be a random
                                                                                            [1]      B. Shayak and M. M. Sharma, “Retarded
         guess with non-zero probabilities of both success and failure. Not              logistic equation as a universal dynamic model for the
         everything about corona is random or outside one’s control though.              spread of COVID-19,” medRxiv, p.
         Amongst the European countries, we can see that Germany, Austria,               2020.06.09.20126573, 2020, doi:
         Switzerland, Denmark, Norway and Finland have definitely                        10.1101/2020.06.09.20126573.
         managed the epidemic while their neighbors have not, which rules
         out some hidden luck factor. The same has happened in Kerala and                    [2]     E. Okanyene, B. Rader, Y. L. Barnoon, L.
         Karnataka (also in India). This has been feasible only due to                   Goodwin, and J. S. Brownstein, “Analysis of hospital
         governmental awareness and hard work, and people’s cooperation.                 traffic and search engine data in Wuhan China
                                                                                         indicates early disease activity in the Fall of 2019,”
         Similarly, there are some governments which have been clearly
                                                                                         Harvard, 2020, [Online]. Available:
         guilty of negligence or hubris in their management of the disease. It
                                                                                         http://nrs.harvard.edu/urn-3:HUL.InstRepos:42669767.
         would also be noteworthy to observe and take lessons from the some
         of the new places like Alabama, Arkansas, Florida , Texas etc which                [3]     CDC, “Forecasting COVID-19 in the US,”
         have been recently identified as potential hotspots of this pandemic.           2020. https://www.cdc.gov/coronavirus/2019-
         Lastly, our conclusion best resonates with the message that                     ncov/covid-data/forecasting-us.html.
         coronavirus is not some kind of race but a public health disaster and
                                                                                             [4]   “Microsoft coronavirus webpage.”
         we should adopt a unified approach to the fight against it.
                                                                                         https://www.bing.com/covid.
         CONCLUSION                                                                         [5]     “COVID-19 in India.” [Internet]. Available
            Here, we summarize the take-home messages from this paper:                   from: https://www.covid19india.org/.

           • A city can reopen only if it is past the peak of cases.                        [6]      L. Star and S. Moghadas, “The Role of
         Reopening must be accompanied by robust contact tracing. The                    Mathematical Modelling in Public Health Planning and
         US CDC has laid down a set of reopening guidelines which are                    Decision Making,” Natl. Collab. Cent. Infect. Dis., vol.
                                                                                         (5)2, no. 2, pp. 285–299, 2010.
         compatible with our model and its solutions.
                                                                                             [7]   Livemint, ““Many states are far short of
               • Incorporation of socio-behavioral theories can come                     COVID-19 testing levels.”
           into play for effective execution of interventional strategies.               https://www.statnews.com/2020/04/27/coronavirus-
                                                                                         many-states-short-of-testing-levels-needed-for-
                                                                                         safereopening/.
KDD KiML 2020                                                                                                     Sharma et al.



                    [8]     Harvard Business Review, “A Plan to            no-longer-exists-provokes-controversy.html.
                Safely Reopen the U.S. Despite Inadequate Testing.”
                https://hbr.org/2020/05/a-plan-to-safely-reopen-the-u-
                s-despite-inadequate-testing.

                   [9]     S. Telles, S. K. Reddy, and H. R. Nagendra,
                “Variation in False Negative Rate of RT-PCR Based
                SARS-CoV-2 Tests by Time Since Exposure,” J.
                Chem. Inf. Model., vol. 53, no. 9, pp. 1689–1699,
                2019, doi: 10.1017/CBO9781107415324.004.

                    [10] M. Lee, “Given low adoption rate of
                TraceTogether, experts suggest merging with
                SafeEntry or other apps,” Today, 2020.
                https://www.todayonline.com/singapore/given-low-
                adoption-rate-tracetogether-experts-suggest-merging-
                safeentry-or-other-apps.

                   [11] A. Zargar, “Privacy, security concerns as
                India forces virus-tracking app on millions,” CBS
                News. .

                    [12] K. Bajpai, “Five lessons of COVID.”
                Available: from:
                https://timesofindia.indiatimes.com/blogs/toi-
                editpage/five-lessons-of-covid-factors-that-are-
                negative-for-india-are-having-greater-impact-than-
                mitigating-ones/..

                   [13] K. Grimes, “Is politics the reason why Gov.
                Newsom is keeping California locked down ?,”
                California Globe. .

                    [14] R.Guha, “What Modi got wrong on
                COVID-19 and how he can fix it.”
                https://www.ndtv.com/opinion/5-lessons-for-modi-on-
                covid-19-by-ramachandra-guha-2227259.

                    [15] K. Weintraub, “Sweden sticks with
                controverial covid approach.,” [Online]. Available:
                https://www.webmd.com/lung/news/20200501/sweden
                -sticks-with-controversial-covid19-approach.

                    [16] The Island Now, “Cuomo has failed in his
                handling of coronavirus.”
                https://theislandnow.com/opinions-100/readers-write-
                cuomo-has-failed-in-handling-of-coronavirus/.

                    [17] “Are care homes the dark side of Sweden’s
                coronavirus strategy.”
                https://www.euronews.com/2020/05/19/are-care-
                homes-the-dark-side-of-sweden-s-coronavirus-
                strategy.

                  [18] “What’s going wrong in Sweden’s care
                homes.” .

                   [19] L. van Dorp et al., “Emergence of genomic
                diversity and recurrent mutations in SARS-CoV-2,”
                Infect. Genet. Evol., vol. 83, no. May, p. 104351, 2020,
                doi: 10.1016/j.meegid.2020.104351.

                    [20] H. Ellyatt, “Coronavirus no longer exists
                clinically - controversy,” CNBC.
                https://www.cnbc.com/2020/06/02/claim-coronavirus-
  Attention Realignment and Pseudo-Labelling for Interpretable
           Cross-Lingual Classification of Crisis Tweets
                   Jitin Krishnan                                          Hemant Purohit                              Huzefa Rangwala
       Department of Computer Science                                Department of Information                  Department of Computer Science
          George Mason University                                     Sciences & Technology                        George Mason University
                  Fairfax, VA                                        George Mason University                              Fairfax, VA
             jkrishn2@gmu.edu                                               Fairfax, VA                               rangwala@gmu.edu
                                                                        hpurohit@gmu.edu

ABSTRACT                                                                                1   INTRODUCTION
State-of-the-art models for cross-lingual language understanding                        Social media platforms such as Twitter provide valuable information
such as XLM-R [7] have shown great performance on benchmark                             to aid emergency response organizations in gaining real-time situ-
data sets. However, they typically require some fine-tuning or cus-                     ational awareness during the sudden onset of crisis situations [4].
tomization to adapt to downstream NLP tasks for a domain. In this                       Extracting critical information about affected individuals, infras-
work, we study unsupervised cross-lingual text classification task                      tructure damage, medical emergencies, or food and shelter needs
in the context of crisis domain, where rapidly filtering relevant data                  can help emergency managers make time-critical decisions and
regardless of language is critical to improve situational awareness                     allocate resources efficiently [15, 21, 22, 30, 31, 36]. Researchers
of emergency services. Specifically, we address two research ques-                      have designed numerous classification models to help towards this
tions: a) Can a custom neural network model over XLM-R trained                          humanitarian goal of converting real-time social media streams into
only in English for such classification task transfer knowledge to                      actionable knowledge [1, 22, 26, 28, 29]. Recently, with the advent
multilingual data and vice-versa? b) By employing an attention                          of multilingual models such as multilingual BERT [9] and XLM
mechanism, does the model attend to words relevant to the task                          [20], researchers have started adopting them to multilingual disas-
regardless of the language? To this goal, we present an attention                       ter tweets [6, 25]. Since XLM-R [7] has been shown to be the most
realignment mechanism that utilizes a parallel language classifier to                   superior model in cross-lingual language understanding, we re-
minimize any linguistic differences between the source and target                       strict our work to this model to explore the aspects of cross-lingual
languages. Additionally, we pseudo-label the tweets from the target                     transfer of knowledge and interpretability.
language which is then augmented with the tweets in the source
language for retraining the model. We conduct experiments using
Twitter posts (tweets) labelled as a ‘request’ in the open source
data set by Appen1 , consisting of multilingual tweets for crisis re-
sponse. Experimental results show that attention realignment and
pseudo-labelling improve the performance of unsupervised cross-
lingual classification. We also present an interpretability analysis by
evaluating the performance of attention layers on original versus
translated messages.

KEYWORDS
   Social Media, Crisis Management, Text Classification, Unsuper-
vised Cross-Lingual Adaptation, Interpretability
ACM Reference Format:
Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Attention Re-                Figure 1: Problem: Unsupervised cross-lingual tweet classifi-
alignment and Pseudo-Labelling for Interpretable Cross-Lingual Classifica-              cation, e.g., train a model using English tweets, predict labels
tion of Crisis Tweets. In Proceedings of KDD Workshop on Knowledge-infused
                                                                                        for Multilingual tweets, and vice-versa.
Mining and Learning (KiML’20). , 7 pages. https://doi.org/10.1145/nnnnnnn.
nnnnnnn
                                                                                           In this work, we address two questions. First is to examine
1 https://appen.com/datasets/combined-disaster-response-data/                           whether XLM-R is effective in capturing multilingual knowledge by
                                                                                        constructing a custom model over it to analyze if a model trained
                                                                                        using English-only tweets will generalize to multilingual data and
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego,           vice-versa. Social media streams are generally different from other
California, USA, August 24, 2020. Use permitted under Creative Commons License          text, given the user-generated content. For example, tweets are
Attribution 4.0 International (CC BY 4.0).
                                                                                        usually short with possibly errors and ambiguity in the behavioral
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s).                                                 expressions. These properties in turn make the classification task or
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                 extracting representations a bit more challenging. Second question
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                                      Krishnan, et al.


is to examine whether word translations will be equally attended                With more and more machine learning systems being adopted
by the attention layers. For instance, the words with higher atten-          by diverse application domains, transparency in decision-making
tion weights in a sentence in Haitian Creole such as “Tanpri nou             inevitably becomes an essential criteria, especially in high-risk
bezwen tant avek dlo nou zon silo mesi” should align with the words          scenarios [12] where trust is of utmost importance. With deep
in its corresponding translated tweet in English “Please, we need            neural networks, including natural language systems, shown to
tents and water. We are in Silo, Thank you!”. Our core idea is that if       be easily fooled [16], there has been many promising ideas that
‘dlo’ in the Haitian tweet has a higher weight, so should its English        empower machine learning systems with the ability to explain
translation ‘water’. This word-level language agnostic property can          their predictions [5, 32]. Gilpin et al. [11] presents a survey of
promote machine learning models to be more interpretable. This               interpretability in machine learning, which provides a taxonomy of
also brings several benefits to downstream tasks such as knowledge           research that addresses various aspects of this problem. Similar to
graph construction using keywords extracted from tweets. In situa-           the work by Ross et al. [33], we employ an attention-based approach
tions where data is available only in one language, this similarity in       to evaluate model interpretability applied to the crisis-domain.
attention would still allow us to extract relevant phrases in cross-
lingual settings. To the best of our knowledge in crisis analytics           3 METHODOLOGY
domain, aligning attention in cross-lingual setting is not attempted         3.1 Problem Statement: Unsupervised
before. In this work, we focus our classification experiments only
to tweets containing ‘request’ intent, which will be expanded to
                                                                                 Cross-Lingual Crisis Tweet Classification
other behaviors, tasks, and datasets in the future.                          Consider tweets in language A and their corresponding translated
    Contributions: We propose a novel attention realignment method           tweets in language B. The task of unsupervised cross-lingual classi-
which promotes the task classifier to be more language agnostic,             fication is to train a classifier using the data only from the source
which in turn tests the effectiveness of multilingual knowledge              language and predict the labels for the data in the target language.
capture of XLM-R model for crisis tweets; and a pseudo-labelling             This experimental set up is usually represented as 𝐴 → 𝐵 for train-
procedure to further enhance the model’s generalizability. Furher,           ing a model using A and testing on B or 𝐴 → 𝐵 for training a
incorporating the attention-based mechanism allows us to perform             model using B and testing on A. 𝑋 refers to the data and 𝑌 refers
an interpretability analysis on the model, by comparing how words            to the ground truth labels. The multilingual dataset used in our
are attended in the original versus translated tweets.                       experiments consists of original multilingual (𝑚𝑙) tweets and their
                                                                             translated (𝑒𝑛) tweets in English. To summarize:
                                                                             Experiment 𝐴 (𝑒𝑛 → 𝑚𝑙):
2    RELATED WORK AND BACKGROUND                                             Input: 𝑋𝑒𝑛 , 𝑌𝑒𝑛 , 𝑋𝑚𝑙
                                                                                         𝑝𝑟𝑒𝑑
There are numerous prior works (c.f. surveys [4, 14]) that focus             Output: 𝑌𝑚𝑙 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑚𝑙 )
specifically on disaster related data to perform classification and          Experiment 𝐵 (𝑚𝑙 → 𝑒𝑛):
other rapid assessments during an onset of a new disaster event.             Input: 𝑋𝑚𝑙 , 𝑌𝑚𝑙 , 𝑋𝑒𝑛
Crisis period is an important but challenging situation, where col-                      𝑝𝑟𝑒𝑑
                                                                             Output: 𝑌𝑒𝑛 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑒𝑛 )
lecting labeled data during an ongoing event is very expensive. This
problem led to several works on domain adaptation techniques in              3.2    Overview
which machine learning models can learn and generalize to unseen
crisis event [3, 10, 18, 23]. In the context of crisis data, Nguyen et al.   In the following sections, we propose two methodologies to en-
[28] designed a convolutional neural network model which does not            hance cross-lingual classification: 1) Attention Realignment and 2)
require any feature engineering and Alam et al. [1] designed a CNN           Pseudo-Labelling. Attention realignment utilizes a language clas-
architecture with adversarial training on graph embeddings. Krish-           sifier which is trained in parallel to realign the attention layer of
nan et al. [19] showed that sharing a common layer for multiple              the task classifier such that the weights are more geared towards
tasks can improve performance of tasks with limited labels.                  task-specific words regardless of the language. Pseudo-Labelling
   In multilingual or cross-lingual direction, many works [8, 17]            further enhances the classifier by adding high quality seeds from
tried to align word embeddings (such as fastText [27]) from different        the target language that are pseudo-labelled by the task classifier.
languages into the same space so that a word and its translations
have the same vector. These models are superseded by models such
                                                                             3.3    Attention Realignment by Parallel
as multilingual BERT [9] and XLM-R [7] that produce contextual                      Language Classifier
embeddings which can be pretrained using several languages to-               As depicted in Fig 2, model on the left side is the task classifier and
gether to achieve impressive performance gains on multilingual               the model on the right side is a language classifier that is trained in
use-cases.                                                                   parallel. The purpose of this language classifier is to pick up aspects
   Attention mechanism [2, 24] is one of the most widely used meth-          that is missed by the XLM-R model. This could be tweet-specific,
ods in deep learning that can construct a context vector by weigh-           crisis-specific, or other linguistic nuances that can separate original
ing on the entire input sequence which improves over previous                tweets and translated tweets. Note that semantically, translated
sequence-to-sequence models [13, 34, 35]. As the model produces              words are expected to have similar XLM-R representations.
weights associated with each word in a sentence, this allows for                Attention realignment is a mechanism we introduce to promote
evaluating interpretability by comparing the words that are given            the task classifier to be more language independent. The main idea
priority in original versus translated tweets.                               is that the words that are given higher attention in a language
                                                                                                                      KiML’20, August 24, 2020, San Diego, California, USA,
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets




                                  Figure 2: Attention Realignment with Pseudo-Labelling over XLM-R model


         Notation Definition                                                               representation in language agnostic models; while the sentence
         𝑒𝑛             Tweets translated to English (‘message’                            structure, grammar, and other nuances can vary. We enforce this
                        column in the dataset)                                             rule by constructing two operations:
         𝑚𝑙             Multilingual Tweets (‘original’ column                                 (1) Attention Difference: When a sentence goes through model
                        in the dataset)                                                            M1, it also goes through model M2. For the same sentence,
         𝛼              Attention Layer                                                            this returns two attention layer weights: one from the task
         𝑇              A component that uses Task-specific                                        classifier (𝛼−
                                                                                                                →) and the other from the language classifier
                                                                                                                 𝑇
                        data. i.e., + and − ‘Request’ tweets                                       (𝛼𝑇 ). Directly subtracting 𝛼−
                                                                                                    −
                                                                                                    →  ′                        → ′ from 𝛼−
                                                                                                                                 𝑇
                                                                                                                                          → poses two issues: 1)
                                                                                                                                           𝑇
         𝐿              A component that uses Language-                                            we do not know whether they are comparable and 2) 𝛼−       →′
                                                                                                                                                               𝑇
                        specific data. i.e., 𝑒𝑛 and 𝑚𝑙 tweets                                      may have negative values. A simple solution to this is to
         𝑎𝐵𝑖𝐿𝑆𝑇 𝑀       Activation from the BiLSTM layer                                           normalize bothe vectors and clip 𝛼−→ ′ such that it is between
                                                                                                                                       𝑇
         𝛽, 𝛾, 𝜁        Hyperparameters                                                            0 and 1. Thus, an attention subtraction step is as follows:
                            Table 1: Notations
                                                                                                                 𝛼−
                                                                                                                  →                   𝛼−
                                                                                                                                       →′
                                                                                                                                                    !
                                                                                                                   𝑇                    𝑇
                                                                                                                  → − 𝛾𝑇 𝑐𝑙𝑖𝑝
                                                                                                                 𝛼−                    → ′ , 0, 1
                                                                                                                                      𝛼−
                                                                                                                                                                       (1)
                                                                                                                  𝑇                     𝑇

classifier should be less important in a task classifier. For example,                             where 𝛾𝑇 is a hyperparameter to tune the amount of subtrac-
‘dlo’ in Haitian and ‘water’ in English should have the same vector                                tion needed for the task classifier. Similarly, for the language
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                                             Krishnan, et al.


        classifier,                                                               𝑇𝑥                               30
                                                                                  Deep Learning Library            Keras
                         𝛼−
                          →′                   𝛼−
                                                →
                                                            !
                           𝐿                     𝐿
                              − 𝛾𝐿 𝑐𝑙𝑖𝑝            , 0, 1              (2)        Optimizer                        Adam [𝑙𝑟 = 0.005, 𝑏𝑒𝑡𝑎 1 = 0.9,
                         𝛼→ ′
                          −
                          𝐿                    𝛼−
                                                →
                                                  𝐿                                                                𝑏𝑒𝑡𝑎 2 = 0.999, 𝑑𝑒𝑐𝑎𝑦 = 0.01]
    (2) Attention Loss: Along with attention difference, the model                Maximum Epoch                    100
        can also be trained by inserting an additional loss function              Dropout                          0.2
        term that penalizes the similarity between the attention                  Early Stopping Patience          10
        weights from the two classifiers. We use the Frobenius norm.              Batch Size                       32
                         𝐿 = ∥𝛼−
                               𝐴𝑡
                                 →𝑇 𝛼−→′ ∥ 2
                                        𝑇     𝑇       𝐹           (3)             𝜁𝑇                               1
                                                                                  𝜁𝐿                               0.1
                           𝐿𝐴𝑙 = ∥𝛼−
                                   →𝑇 𝛼−
                                    𝐿
                                        →′ ∥ 2
                                        𝐿 𝐹                        (4)            𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , 𝛾𝐿                0.01
        for task and language respectively. Resulting final loss func-                         Table 3: Implementation Details
        tion of joint training will be:
                                                      
            𝐿(𝜃 ) = 𝜁𝑇 𝐶𝐸𝑇 + 𝛽𝑇 𝐿𝐴𝑡 + 𝜁𝐿 𝐶𝐸𝐿 + 𝛽𝐿 𝐿𝐴𝑙              (5)
        where 𝛽 is the hyperparameter to tune the attention loss                We use the open source dataset from Appen3 consisting of multi-
        weight, 𝜁 is the hyperparameter to tune the joint training           lingual crisis response tweets. The dataset statistics for tweets with
        loss, and 𝐶𝐸 denotes the binary cross entropy loss,                  ‘request’ behavior labels is shown in Table 2. For all the experiments,
                                                                             the dataset is balanced for each split.
                          𝑁
                      1 Õ                                                       Each experiment is denoted as 𝐴 → 𝐵, where 𝐴 is the data that
            𝐶𝐸 = −          [𝑦𝑖 log 𝑦ˆ𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 )]     (6)
                      𝑁 𝑖=1                                                  is used to train the model and 𝐵 is the data that is used for testing
                                                                             the model. For example, 𝑒𝑛 → 𝑚𝑙 means we train the model using
        It is important to note that the Frobenius norm is not simply
                                                                             English tweets and test on multilingual tweets.
        between the attention weights of the two models but rather
                                                                                Models are implemented in Keras and the details are shown in
        between the attention weights produced by the two models
                                                                             table 3. Hyperparameters 𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , and 𝛾𝐿 are not exhaustively
        on the same input tweet. For example, for a given tweet, the
                                                                             tuned; we leave this exploration for future work.
        task classifier attends more to task-specific words and the
        language classifier attends to language-specific words. So
        the mechanism makes sure that they are distinct.                                             Baseline         Model M1         Model M2
                                                                                      𝑒𝑛 → 𝑚𝑙        59.98           62.53             66.79
3.4     Pseudo-Labelling                                                                             (80.57)         (77.02)           (82.39)
To enhance the model further, we pseudo-label the data in the                         𝑚𝑙 → 𝑒𝑛        60.93           65.69             70.95
target language. For example, if we are training a model using the                                   (70.07)         (63.50)           (73.84)
English tweets, we use the original tweets before translation for            Table 4: Performance Comparison (Accuracy in %) for
pseudo-labelling. The idea is simply to gather high-quality seeds            𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑇 𝑎𝑟𝑔𝑒𝑡 (𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑆𝑜𝑢𝑟𝑐𝑒).
from the target to retrain the model. Note that, we still do not use         Baseline = XLMR + BiLSTM + Attention.
any target labels here; still following the unsupervised goal. Thus,         Model M1 = Baseline + Attention Realignment.
for retraining model M1 for 𝑒𝑛 → 𝑚𝑙, the new dataset would consist           Model M2 = Model M1 + Pseudo-Labelling.
      + and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜+ as positive examples and 𝑋 − and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜−
of: 𝑋𝑒𝑛        𝑚𝑙                                  𝑒𝑛       𝑚𝑙
as negative examples.

3.5     XLM-R Usage                                                          5    RESULTS & DISCUSSION
The recommended feature usage of XLM-R2 is either by fine-tuning             Table 4 shows the cross-lingual performance comparison of all the
to the task or by aggregating features from all the 25 layers. We            models. The three models are described below:
employ the later to extract the multilingual embeddings for the                 (1) Baseline: The baseline model consists of embeddings re-
tweets.                                                                             trieved from XLM-R trained over BiLSTMs and Attention lay-
                                                                                    ers. This is a traditional sequence (text) classifier enhanced
4     DATASET & EXPERIMENTAL SETUP                                                  with attention mechanism. Activations from the BiLSTM
                                                                                    layers are weighed by the attention layer to construct the
                              Train         Validation          Test                context vector which is then passed through a dense layer
             Positive         3554          418                 496                 and softmax function to produce the classification output.
             Negative         17473         2152                2128            (2) Model M1: Adding attention realignment to the baseline
                                                                                    model produces model M1. Attention realignment is achieved
          Table 2: Dataset Statistics for both 𝑒𝑛 amd 𝑚𝑙                            through a language classifier which is trained in parallel with
                                                                                    the goal to make the task classifier more language agnostic.
2 https://github.com/facebookresearch/XLM                                    3 https://appen.com/datasets/combined-disaster-response-data/
                                                                                                                   KiML’20, August 24, 2020, San Diego, California, USA,
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets




Figure 3: Attention visualization example for ‘request’ tweets: words and their attention weights for two tweets in Haitian
Creole and its translation in English (darker the shade, higher the attention).




       The attention weights for both task and language classifiers                        scores are shown in brackets in table 4. A deeper investigation in
       are manipulated by each other during training by a process                          this direction on various other tasks can shed more light on the
       of subtraction (attention difference) as well a loss component                      impact of realignment mechanism.
       (attention loss). See section 3.3.
   (3) Model M2: Adding the pseudo-labelling procedure to model                            5.1     Interpretability: Attention Visualization
       M1 produces model M2. Using Model M1 which is trained
                                                                                           We follow a similar attention architecture shown in [18]. The con-
       to be language agnostic, tweets from the target languages
                                                                                           text vector is constructed as a result of dot product between the
       are pseudo-labelled. High quality seeds are selected (using
                                                                                           attention weights and word activations. This represents the inter-
       Model M1 𝑝>0.7) and augmented to the original training
                                                                                           pretable layer in our architecture. The attention weights represent
       dataset to retrain the task classifier.
                                                                                           the importance of each word in the classification process. Two ex-
   Results show that, for cross-lingual evaluation on 𝑒𝑛 → 𝑚𝑙,                             amples are shown in figure 3. In the first example, both 𝑒𝑛 → 𝑒𝑛
model M1 outperforms the baseline by +4.3% and model M2 outper-                            and 𝑚𝑙 → 𝑚𝑙 give attention to the word ‘hungry’ (i.e., ‘grangou’ in
forms by +11.4%. On 𝑚𝑙 → 𝑒𝑛, model M1 outperforms the baseline                             Haitian Creole). Note that these two are results from the models
by +7.8% and model M2 outperforms by +16.5%. This shows that                               that are trained in the same language in which they are tested; thus,
both models are effective in cross-lingual crisis tweet classification.                    expecting an ideal performance. For the baseline model in the cross-
An interesting observation to note is that using attention realign-                        lingual set-up 𝑒𝑛 → 𝑚𝑙, although it correctly predicts the label, the
ment alone decreased the classification performance in the same                            attention weights are more spread apart. In model M2 with atten-
language, which is brought back up by pseudo-labelling. These                              tion realignment and pseudo-labelling, although with some spread,
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                                                                   Krishnan, et al.


the attention weights are shifted more toward ‘grangou’. Similarly                          [8] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer,
in example 2, the attention weights in the baseline model are more                              and Hervé Jégou. 2017. Word Translation Without Parallel Data. arXiv preprint
                                                                                                arXiv:1710.04087 (2017).
spread apart. Cross-lingual performance of model M2 aligns more                             [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
with 𝑒𝑛 → 𝑒𝑛 and 𝑚𝑙 → 𝑚𝑙. These examples show the importance                                    Pre-training of deep bidirectional transformers for language understanding. arXiv
                                                                                                preprint arXiv:1810.04805 (2018).
of having interpretability as a key criterion in cross-lingual crisis                      [10] Yaroslav Ganin and Victor Lempitsky. 2014. Unsupervised domain adaptation by
tweet classification problems; which can also be used for down-                                 backpropagation. arXiv preprint arXiv:1409.7495 (2014).
stream tasks such as extracting relevant keywords for knowledge                            [11] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and
                                                                                                Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of
graph construction.                                                                             machine learning. In 2018 IEEE 5th International Conference on data science and
                                                                                                advanced analytics (DSAA). IEEE, 80–89.
                                                                                           [12] David Gunning. 2017. Explainable artificial intelligence (xai). Defense Advanced
6    CONCLUSION                                                                                 Research Projects Agency (DARPA), nd Web 2 (2017).
                                                                                           [13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
We presented a novel approach for unsupervised cross-lingual cri-                               computation 9, 8 (1997), 1735–1780.
sis tweet classification problem using a combination of attention                          [14] Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015.
                                                                                                Processing social media messages in mass emergency: A survey. ACM Computing
realignment mechanism and a pseudo-labelling procedure (over                                    Surveys (CSUR) 47, 4 (2015), 1–38.
the state-of-the-art multilingual model XLM-R) to promote the task                         [15] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a lifeline:
classifier to be more language agnostic. Performance evaluation                                 Human-annotated twitter corpora for NLP of crisis-related messages. arXiv
                                                                                                preprint arXiv:1605.05894 (2016).
showed that both models M1 and M2 outperformed the baseline by                             [16] Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading
+4.3% and +11.4% respectively for cross-lingual text classification                             comprehension systems. arXiv preprint arXiv:1707.07328 (2017).
from English to Multilingual. We also presented an interpretabil-                          [17] Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard
                                                                                                Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval
ity analysis by comparing the attention layers of the models. It                                criterion. arXiv preprint arXiv:1804.07745 (2018).
shows the importance of incorporating a word-level language ag-                            [18] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Diversity-Based
                                                                                                Generalization for Neural Unsupervised Text Classification under Domain Shift.
nostic characteristic in the learning process, when training data                               https://arxiv.org/pdf/2002.10937.pdf (2020).
is available only in one language. Performing extensive hyperpa-                           [19] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Unsupervised and
rameter tuning and expanding the idea to other tasks (including                                 Interpretable Domain Adaptation to Rapidly Filter Social Web Data for Emergency
                                                                                                Services. arXiv preprint arXiv:2003.04991 (2020).
cross-task/multi-task) are left as future work. We also plan another                       [20] Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model
direction for future work as to incorporate the human-engineered                                pretraining. arXiv preprint arXiv:1901.07291 (2019).
knowledge from the multilingual knowledge graphs such as Ba-                               [21] Kathy Lee, Ankit Agrawal, and Alok Choudhary. 2013. Real-time disease surveil-
                                                                                                lance using twitter data: demonstration on flu and cancer. In Proceedings of the
belNet in our model architecture that could improve the learning                                19th ACM SIGKDD international conference on Knowledge discovery and data
of similar concepts across languages critical to the crisis response                            mining. 1474–1477.
                                                                                           [22] Hongmin Li, Doina Caragea, Cornelia Caragea, and Nic Herndon. 2018. Disaster
agencies.                                                                                       response aided by tweet classification with a domain adaptation approach. Journal
Reproducibility: Source code is available available at: https://                                of Contingencies and Crisis Management 26, 1 (2018), 16–27.
github.com/jitinkrishnan/Cross-Lingual-Crisis-Tweet-Classification                         [23] Zheng Li, Ying Wei, Yu Zhang, and Qiang Yang. 2018. Hierarchical attention
                                                                                                transfer network for cross-domain sentiment classification. In Thirty-Second
                                                                                                AAAI Conference on Artificial Intelligence.
                                                                                           [24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec-
7    ACKNOWLEDGEMENT                                                                            tive approaches to attention-based neural machine translation. arXiv preprint
                                                                                                arXiv:1508.04025 (2015).
Authors would like to thank U.S. National Science Foundation                               [25] Guoqin Ma. 2019.            Tweets Classification with BERT in the Field
grants IIS-1815459 and IIS-1657379 for partially supporting this                                of Disaster Management.                     https://pdfs.semanticscholar.org/d226/
research.                                                                                       185fa1e14118d746cf0b04dc5be8f545ec24.pdf.
                                                                                           [26] Reza Mazloom, Hongmin Li, Doina Caragea, Cornelia Caragea, and Muhammad
                                                                                                Imran. 2019. A Hybrid Domain Adaptation Approach for Identifying Crisis-
                                                                                                Relevant Tweets. International Journal of Information Systems for Crisis Response
REFERENCES                                                                                      and Management (IJISCRAM) 11, 2 (2019), 1–19.
 [1] Firoj Alam, Shafiq Joty, and Muhammad Imran. 2018. Domain adaptation with             [27] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Ar-
     adversarial training and graph embeddings. arXiv preprint arXiv:1805.05151                 mand Joulin. 2018. Advances in Pre-Training Distributed Word Representations.
     (2018).                                                                                    In Proceedings of the International Conference on Language Resources and Evalua-
 [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-                       tion (LREC 2018).
     chine translation by jointly learning to align and translate. arXiv preprint          [28] Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muham-
     arXiv:1409.0473 (2014).                                                                    mad Imran, and Prasenjit Mitra. 2016. Rapid classification of crisis-related
 [3] John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation                 data on social networks using convolutional neural networks. arXiv preprint
     with structural correspondence learning. In Proceedings of the 2006 conference on          arXiv:1608.03902 (2016).
     empirical methods in natural language processing. 120–128.                            [29] Ferda Ofli, Patrick Meier, Muhammad Imran, Carlos Castillo, Devis Tuia, Nicolas
 [4] Carlos Castillo. 2016. Big crisis data: social media in disasters and time-critical        Rey, Julien Briant, Pauline Millet, Friedrich Reinhard, Matthew Parkan, et al. 2016.
     situations. Cambridge University Press.                                                    Combining human computing and machine learning to make sense of big (aerial)
 [5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter               data for disaster response. Big data 4, 1 (2016), 47–59.
     Abbeel. 2016. Infogan: Interpretable representation learning by information max-      [30] Bahman Pedrood and Hemant Purohit. 2018. Mining help intent on twitter during
     imizing generative adversarial nets. In Advances in neural information processing          disasters via transfer learning with sparse coding. In International Conference
     systems. 2172–2180.                                                                        on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior
 [6] Jishnu Ray Chowdhury, Cornelia Caragea, and Doina Caragea. 2020. Cross-                    Representation in Modeling and Simulation. Springer, 141–153.
     Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup.        [31] Hemant Purohit, Carlos Castillo, Fernando Diaz, Amit Sheth, and Patrick Meier.
     In Proceedings of the 58th Annual Meeting of the Association for Computational             2013. Emergency-relief coordination on social media: Automatically matching
     Linguistics: Student Research Workshop. 292–298.                                           resource requests and offers. First Monday 19, 1 (Dec. 2013).
 [7] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil-            [32] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i
     laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,                 trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd
     and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning             ACM SIGKDD international conference on knowledge discovery and data mining.
     at scale. arXiv preprint arXiv:1911.02116 (2019).                                          1135–1144.
                                                                                                                         KiML’20, August 24, 2020, San Diego, California, USA,
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets


[33] Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for         [36] István Varga, Motoki Sano, Kentaro Torisawa, Chikara Hashimoto, Kiyonori
     the right reasons: Training differentiable models by constraining their explana-           Ohtake, Takao Kawai, Jong-Hoon Oh, and Stijn De Saeger. 2013. Aid is out there:
     tions. arXiv preprint arXiv:1703.03717 (2017).                                             Looking for help from tweets during a large scale disaster. In Proceedings of the
[34] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net-              51st Annual Meeting of the Association for Computational Linguistics (Volume 1:
     works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.                    Long Papers). 1619–1629.
[35] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
     with neural networks. In Advances in neural information processing systems. 3104–
     3112.